Build fast, sparse on-device models with the new TF MOT Pruning API

Posted by Yunlu Li and Artsiom Ablavatski

Introduction

Pruning is one of the core optimization techniques provided in the TensorFlow Model Optimization Toolkit (TF MOT). Not only does it help to significantly reduce model size, but it can also be used to accelerate CPU inference on mobile and web. With modern compute intensive models, the area of pruning as a model optimization technique has drawn significant attention, demonstrating that dense networks can be easily pruned (i.e. a fraction of the weights set to zero) with negligible quality degradation. Today, we are excited to announce a set of updates to TF MOT Pruning API that simplify pruning and enable developers to build sparse models for fast on-device inference.

Updates to TF MOT

TensorFlow has long standing support for neural network pruning via TensorFlow Model Optimization Toolkit (TF MOT) Pruning API. The API, featured in 2019, introduced essential primitives for pruning, and enabled researchers throughout the world with new optimization techniques. Today we are happy to announce experimental updates to the API that further advance model pruning. We are releasing tools that simplify the control of pruning and enable latency reduction for on-device inference.

The TF MOT Pruning API has extensive functionality that provides the user with tools for model manipulation:

  • prune_low_magnitude function applies PruneLowMagnitude wrapper to every layer in the model
  • PruneLowMagnitude wrapper handles low-level pruning logic
  • PruningSchedule controls when pruning is applied
  • PruningSummaries callback logs the pruning progress

These abstractions allow to control almost any aspect of model pruning (i.e. how to prune (PruneLowMagnitude), when to prune (PruningSchedule) and how to track the progress of the pruning (PruningSummaries) with the exception of what to prune, i.e. where PruneLowMagnitude wrapper is applied. We are happy to release an extension of TF MOT PruningPolicy, a class that controls which parts of the model the PruneLowMagnitude wrapper is applied to. The instance of PruningPolicy is used as an argument in the prune_low_magnitude function and provides the following functionalities:

  • Controls where the pruning wrapper should be applied on per-layer basis through the allow_pruning function
  • Checks that the whole model supports pruning via ensure_model_supports_pruning function

PruningPolicy is an abstract interface, and it can have many implementations depending on the particular application. For latency improvements on CPU via XNNPACK, the concrete implementation PruneForLatencyOnXNNPack applies the pruning wrapper only to the parts of the model that can be accelerated via sparse on-device inference while leaving the rest of the network untouched. Such selective pruning allows an application to maintain model quality while targeting parts of the model that can be accelerated by sparsity.

The below example showcases the PruneForLatencyOnXNNPack policy in action on

MobileNetV2 (the full example is available in a recently introduced colab):

import tensorflow as tf
import tensorflow_model_optimization as tfmot
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

# See the implementation of the function below.
model = load_model_for_pruning()

model_for_pruning = prune_low_magnitude(
model, pruning_policy=tfmot.sparsity.keras.PruneForLatencyOnXNNPack())

In order to comply with the constraints of XNNPACK sparse inference the Keras implementation of MobileNetV2 model requires slight modification of the padding for the first convolution operation:

def load_model_for_pruning():
input_tensor = tf.keras.layers.Input((224, 224, 3))
input_tensor = tf.keras.layers.ZeroPadding2D(1)(input_tensor)
model = tf.keras.applications.MobileNetV2(input_tensor=input_tensor)

def clone_fn(layer):
if layer.name == 'Conv1':
# The default padding `SAME` for the first convolution is incompatible
# with XNNPACK sparse inference.
layer.padding = 'valid'
# We ask the model to rebuild since we've changed the padding parameter.
layer.built = False
return layer

return tf.keras.models.clone_model(model, clone_function=clone_fn)

The PruneForLatencyOnXNNPack policy applies the pruning wrapper only to convolutions with 1×1 kernel size since only these layers can be accelerated on CPU by as much as 2x using XNNPACK. The rest of the layers are left untouched allowing the network to recover after quality degradation at the pruning step. Also, the policy verifies that the model is amenable to being pruned by using the ensure_model_supports_pruning method. Once the sparse model has been trained and converted, we recommend using the TensorFlow Lite benchmark utility in debug mode to confirm that the final model is compatible with XNNPack’s sparse inference backend.

We hope that this newly introduced experimental API will be useful in practice and we will continue to improve its stability and flexibility in the future.

Compression and Latency Improvements

Model compression is another major benefit of applying pruning to a model. Using a smart compression format allows efficient storage of model weights which leads to a significant size reduction.

TFLite adopted the TACO format to encode sparse tensors. Compared to widely used formats like CSR and CSC, the TACO format has several advantages:

  1. It supports flexible traversal order to store a tensor in row-major or column-major formats easily.
  2. It supports multi-dimensional sparse tensors like the 4-D filter of a convolution op.
  3. It can represent block structure as the inner dimension of the tensor (example of a 4×4 tensor with 2×2 inner block structure).

We also adapted the format to use flexible data types for the metadata storing the indices of non-zero elements. This reduces the storage overhead for small tensors, or tensors with compact data types like int8_t.

In order to realize size reductions in practice during the model conversion, the tf.lite.Optimize.EXPERIMENTAL_SPARSITY optimization needs to be applied. This optimization handles examining the model for sparse tensors and converting them to an efficient storage format. It also works seamlessly with quantization and you can combine them to achieve more aggressive model compresion. The full example of such a conversion is shown below:

# Remove the pruning wrappers from the model. 
model = tfmot.sparsity.keras.strip_pruning(model)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
# We apply float16 quantization together with sparsity optimization that
# compactly stores pruned weights.
converter.optimizations = [
tf.lite.Optimize.EXPERIMENTAL_SPARSITY, # Enables size reduction optimization.
tf.lite.Optimize.DEFAULT # Enables quantization at conversion.
]
converter.target_spec.supported_types = [tf.float16]
tflite_buffer = converter.convert()

After applying the tf.lite.Optimize.EXPERIMENTAL_SPARSITY optimization together with PruneForLatencyOnXNNPack pruning policy, a ~2x size reduction can be achieved as is demonstrated in Figure 1:

Ablation study of MobileNetV2 model size (float32 and float16 types) with different sparsity levels using PruneForLatencyOnXNNPack pruning policy.
Figure 1. Ablation study of MobileNetV2 model size (float32 and float16 types) with different sparsity levels using PruneForLatencyOnXNNPack pruning policy. Only the 1×1 convolutional layers are pruned and the rest of the layers are left dense.

In addition to size reduction, pruning can provide inference acceleration on CPU via XNNPACK. Using the PruneForLatencyOnXNNPack pruning policy, we’ve conducted an ablation study of CPU inference latency for a MobileNetV2 model on Pixel 4 using TensorFlow Lite benchmark with the use_xnnpack option enabled:

Ablation study of CPU inference speed of MobileNetV2 model with different sparsity levels on a Pixel 4 device.
Figure 2. Ablation study of CPU inference speed of MobileNetV2 model with different sparsity levels on a Pixel 4 device.

This study in Figure 2 demonstrates 1.7x latency improvement when running on mobile devices using XNNPACK. The strategies for training the sparse MobileNetV2 model together with hyperparameters and pre-trained checkpoints are described in Elsen et al.

Pruning techniques & tips

Pruning aware training is a key step in model optimization. Many hyperparameters are involved in training and some of them like the pruning schedule and learning rate can have a dramatic impact on the final quality of the model. Though many strategies have been proposed, a simple yet effective 3-steps strategy (see Table 1) achieves strong performance for the majority of our use cases. The strategy builds on top of the well-proven approach from Zhu & Gupta and produces good results without extensive re-training:

Step

Learning rate

Duration

Notes

1. Pre-training or

using pre-trained weights (optional)

The same as for the regular dense network: starting from high value (possibly with warm-up) and ending with low value

The same as for the regular dense network

Paired with weight decay regularization this step helps the model to push unimportant weights towards 0 for pruning in the next step

2. Iterative pruning

Constant, mean of the learning rate values for the regular training

30 to 100 epochs

Iterative pruning step during which weights become sparse

3. Fine-tuning

The same as at the first stage but without warm up stage

The same as at the first stage

Helps to mitigate quality degradation after the pruning step

3-step schedule for training the sparse model

The strategy inevitably leads to a substantial increase (~3x) in the training time. However, paired with the PolynomialDecay pruning schedule, this 3-step strategy achieves limited or no quality degradation with significantly pruned (>70%) neural networks.

Pruned models in MediaPipe

Together with the updates to the TF MOT Pruning API, we are happy to release pruned models for some of the MediaPipe solutions. The released models include pose and face detectors as well as a pruned hand tracking model. All of these models have been trained with the newly introduced functionality using the 3-steps pruning strategy. Compared with dense baselines the released pruned models have demonstrated significant model size reduction as well as superior performance when running on CPU via XNNPACK. Quality-wise the pruned models achieve similar metrics including in the evaluation on our fairness datasets (see model cards for details). Side-by-side demos of the solutions are shown below:

MediaPipe example showing female waving at camera
MediaPipe example showing person jumping
Figure 3. Comparison of dense (left) and sparse (right) models in the end-to-end examples of face (top) and pose (bottom) detection

Pruning for GPU

While exploiting sparsity on GPUs can be challenging, recent work has made progress in improving the performance of sparse operations on these platforms. There is momentum for adding first-class support for sparse matrices and sparse operations in popular frameworks, and state-of-the-art GPUs have recently added hardware acceleration for some forms of structured sparsity. Going forward, improvements in software and hardware support for sparsity in both training and inference will be a key contributor to progress in the field.

Future directions

TF MOT offers a variety of model optimization methods, many of which have proven to be essential for efficient on-device model inference. We will continue to expand the TF MOT Pruning API with algorithms beyond low magnitude pruning, and also investigate the combination of pruning and quantization techniques to achieve even better results for on-device inference. Stay tuned!

Acknowledgments

Huge thanks to all who worked on this project: Karthik Raveendran, Ethan Kim, Marat Dukhan‎, Trevor Gale, Utku Evci, Erich Elsen, Frank Barchard, Yury Kartynnik‎, Valentin Bazarevsky, Matsvei Zhdanovich, Juhyun Lee, Chuo-Ling Chang, Ming Guang Yong, Jared Duke‎ and Matthias Grundmann.

Read More

Real-World ML with Coral: Manufacturing

Posted by Michael Brooks, Coral

For over 3 years, Coral has been focused on enabling privacy-preserving Edge ML with low-power, high performance products. We’ve released many examples and projects designed to help you quickly accelerate ML for your specific needs. One of the most common requests we get after exploring the Coral models and projects is: How do we move to production?

With this in mind we’re introducing the first of our use-case specific demos. These demos are intended to to take full advantage of the Coral Edge TPU™ with high performance, production-quality code that is easily customizable to meet your ML requirements. In this demo we focus on use cases that are specific to manufacturing; worker safety and quality grading / visual inspection.

Demo Overview

The Coral manufacturing demo targets a x86 or powerful ARM64 system with OpenGL acceleration that processes and displays two simultaneous inputs. The default demo, using the included example videos, looks like this:

two gifs side by side demonstrating the Coral manufacturing demo

The two examples being run are:

  • Worker Safety: Performs generic person detection (powered by COCO-trained SSDLite MobileDet) and then runs a simple algorithm to detect bounding box collisions to see if a person is in an unsafe region.
  • Visual Inspection: Performs apple detection (using the same COCO-trained SSDLite MobileDet from Worker Safety) and then crops the frame to the detected apple and runs a retrained MobileNetV2 that classifies fresh vs rotten apples.

By combining these two examples, we are able to demonstrate multiple Coral features that can enable this processing, including:

  • Co-compilation
  • Cascading models (using the output of one model to feed another)
  • Classification retraining
  • Real time processing of multiple inputs

Creating The Demo

When designing a new ML application, it is critical to ensure that you can meet your latency and accuracy requirements. With the two applications described here, we went through the following process to choose models, train these models, and deploy to the EdgeTPU – this process should be used when beginning any Coral application.

Choosing the Models

When deciding on a model to use, the new Coral Model Page is the best place to start. For this demo, we know that we need a detection model (which will be used for detection of both people and apples) as well as a classification model.

Detection

When picking a detection model from the Detection Model Page, there are four aspects to a model we want to look for:

  1. Training Dataset: In the case of the models page, all of our normal detection models use the COCO dataset. Referring to the labels, we can find both apples and people, so we can use just the one model for both detection tasks.
  2. Latency: We will need to run at least 3 inferences per frame and need this to keep up with our input (30 FPS). This means we need our detection to be as fast as possible. From the models page, we can see two good options: SSD MobileNet v2 (7.4 ms) and MobileDet (8.0 ms). This is the first point where we see the clear advantage of Coral – looking at the benchmarks at the bottom of our x86+USB CTS Output we can see even on a powerful workstation this would be 90 ms and 123 ms respectively.
  3. Accuracy/Precision: We also want as accurate a model as possible. This is evaluated using the primary challenge metric from COCO evaluation metrics. We see here MobileDet (32.8%) clearly outpeforms MobileNet V2 (25.7%).
  4. Size: In order to fully co-compile this detection model with the classification model below, we need to ensure that we can fit both models in the 8MB of cache on the Edge TPU. This means we want as small a model as possible. MobileDet is 5.1 MB vs MobileNet V2 is 6.6 MB.

With the above considerations, we chose SSDLite MobileDet.

Classification

For the fresh-or-rotten apple classification, there are many more options on the Coral Classification Page. What we want to check is the same:

  1. Training Dataset: We’ll be retraining on our new dataset, so this isn’t critical in this application.
  2. Latency: We want the classification to be as fast as possible. Luckily many of the models on our page are extremely fast relative to the 30 FPS frame rate we demand. With this in mind we can eliminate all the Inception models and ResNet-50.
  3. Accuracy: Accuracy for Top-1 and Top-5 is provided. We want to be as accurate as possible for Top-1 (since we are only checking fresh vs rotten) – but still need to consider latency. With this in mind we eliminate MobileNet v1.
  4. Size: As mentioned above, we want to ensure we can fit both the detection and classification models (or as much as possible) so we can easily eliminate the EfficientNet options.

This leaves us with MobileNet v2 and MobileNet v3. We opted for v2 due to existing tutorials on retraining this model.

Retraining Classification

With the model decisions taken care of, now we need to retain the classification model to identify fresh and rotten apples. Coral.ai offers training tutorials in CoLab (uses post-training quantization) and Docker (uses quantization aware training) formats – but we’ve also included the retraining python script in this demo’s repo.

Our Fresh/Rotten data comes from the “Fruits fresh and rotten for classification” dataset – we simply omit everything but apples.

In our script, we first load the standard Keras MobileNetV2 – freezing the first 100 layers and adding a few extra layers at the end:

base_model = tf.keras.applications.MobileNetV2(input_shape=input_shape,
include_top=False,
classifier_activation='softmax',
weights='imagenet')
# Freeze first 100 layers
base_model.trainable = True
for layer in base_model.layers[:100]:
layer.trainable = False
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(units=2, activation='softmax')
])
model.compile(loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.RMSprop(lr=1e-5),
metrics=['accuracy'])
print(model.summary())

Next, with the dataset download into ./dataset we train our model:

train_datagen = ImageDataGenerator(rescale=1./255,
zoom_range=0.3,
rotation_range=50,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
horizontal_flip=True,
fill_mode='nearest')
val_datagen = ImageDataGenerator(rescale=1./255)
dataset_path = './dataset'
train_set_path = os.path.join(dataset_path, 'train')
val_set_path = os.path.join(dataset_path, 'test')
batch_size = 64
train_generator = train_datagen.flow_from_directory(train_set_path,
target_size=input_size,
batch_size=batch_size,
class_mode='categorical')
val_generator = val_datagen.flow_from_directory(val_set_path,
target_size=input_size,
batch_size=batch_size,
class_mode='categorical')
epochs = 15
history = model.fit(train_generator,
steps_per_epoch=train_generator.n // batch_size,
epochs=epochs,
validation_data=val_generator,
validation_steps=val_generator.n // batch_size,
verbose=1)

Note that we’re only using 15 epochs. When retraining on another dataset it is very likely more will be required. With the apple dataset, we can see this model quickly hits very high accuracy numbers:

image of training and validation accuracy and loss

For your own dataset and model more epochs will likely be needed (the script will generate the above plots for validation).

We now have a Keras model that works for our apple quality inspector. In order to run this on a Coral Edge TPU, the model must be quantized and converted to TF Lite. We’ll do this using post-training quantization – quantizing based on a representative dataset after training:

def representative_data_gen():
dataset_list = tf.data.Dataset.list_files('./dataset/test/*/*')
for i in range(100):
image = next(iter(dataset_list))
image = tf.io.read_file(image)
image = tf.io.decode_jpeg(image, channels=3)
image = tf.image.resize(image, input_size)
image = tf.cast(image / 255., tf.float32)
image = tf.expand_dims(image, 0)
yield [image]
model.input.set_shape((1,) + model.input.shape[1:])
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()

The script will then compile the model and evaluate both the Keras and TF Lite models – but we’ll need to take one extra step beyond the script: We must use the Edge TPU Compiler to co-compile the classification model with our detection model.

Co-compiling the models

We now have two quantized TF Lite models: classifier.tflite and the default CPU model for MobileDet taken from the Coral model page. We can compile them together to ensure that they share the same caching token – when either model is requested the parameter data will already be cached. This simply requires passing both models to the compiler:

edgetpu_compiler ssdlite_mobiledet_coco_qat_postprocess.tflite classifier.tflite 
Edge TPU Compiler version 15.0.340273435

Models compiled successfully in 1770 ms.

Input model: ssdlite_mobiledet_coco_qat_postprocess.tflite
Input size: 4.08MiB
Output model: ssdlite_mobiledet_coco_qat_postprocess_edgetpu.tflite
Output size: 5.12MiB
On-chip memory used for caching model parameters: 4.89MiB
On-chip memory remaining for caching model parameters: 2.74MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 125
Operation log: ssdlite_mobiledet_coco_qat_postprocess_edgetpu.log

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 124
Number of operations that will run on CPU: 1
See the operation log file for individual operation details.

Input model: classifier.tflite
Input size: 3.07MiB
Output model: classifier_edgetpu.tflite
Output size: 3.13MiB
On-chip memory used for caching model parameters: 2.74MiB
On-chip memory remaining for caching model parameters: 0.00B
Off-chip memory used for streaming uncached model parameters: 584.06KiB
Number of Edge TPU subgraphs: 1
Total number of operations: 72
Operation log: classifier_edgetpu.log
See the operation log file for individual operation details.

There are two things to note in this log. First is that we see one operation is run on the CPU for the detection model – this is expected. The TF Lite SSD PostProcess will always run on CPU. Second, we couldn’t quite fit everything on the on-chip memory, the classifier has 584 kB of off-chip memory needed. This is fine – we’ve substantially reduced the amount of IO time needed. Both models will now be in the same folder, but because we co-compiled them they are aware of each other and the cache will persist parameters for both models.

Customizing For Your Application

The demo is optimized and ready for customization and deployment. It can be cross compiled for other architectures (currently it’s only for x86 or ARM64) and statically links libedgetpu to allow this binary to be deployed to many different Linux systems with an Edge TPU.

There are many things that can be done to customize the model to your application:

  • The quickest changes are the inputs, which can be adjusted via the --visual_inspection_input and --worker_safety_input flags. The demo accepts mp4 files and V4L2 camera devices.
  • The worker safety demo can be further improved with more complicated keepout algorithms (including consideration of angle/distance from camera) as well as retraining on overhead data. Currently the demo checks only the bottom of the bounding box, but the flag --safety_check_whole_box can be used to compare to the whole box (for situations like overhead cameras).
  • The apple inspection demonstrates simple quality grading / inspection – this cascaded model approach (using detection to determine bounding boxes and feeding into another model) can be applied to many different uses. By retraining the detection and classification model this can be customized for your application.

Conclusion

The Coral Manufacturing Demo demonstrates how Coral can be used in a production environment to solve multiple ML needs. The Coral accelerator provides a low-cost and low-power way to add enough ML compute to run both tasks in parallel without over-burdening the host. We hope that you can use the Coral Manufacturing Demo as a starting point to bringing Coral intelligence into your own manufacturing environment.

To learn more about ways edge ML can be used to benefit day to day operations across a variety of industries, visit our Industries page. For more information about Coral Products and Partner products with Coral integrated, please visit us at Coral.ai.

Read More

The TensorFlow Developer Certificate turns 1!

Posted by Alina Shinkarsky and Jocelyn Becker on behalf of the TensorFlow Team

The TensorFlow Developer Certificate exam is celebrating its first anniversary with a big milestone: more than 3,000 people have passed the exam! Successful test takers received the official TensorFlow Developer Certificate and badge. They also had the opportunity to showcase their skill set on social networks such as LinkedIn and the TensorFlow Certificate Network, where recruiters are able to seek out entry-level TensorFlow developers.

The TF Certificate program bridges the gap between the demand from companies for data-knowledgeable, production ML-capable engineers — and the students and developers around the world interested in getting a job in ML.

The goal of this program is to provide developers around the world with the opportunity to demonstrate their skills in ML in an increasingly AI-driven global job market. This is a foundational certificate for students, developers, and data scientists who want to demonstrate practical machine learning skills through building and training basic models using TensorFlow.

We’ve followed up with folks who have taken the exam to understand the impact on their professional lives.

Fran shared, “Lost my job due to COVID 1 month before taking the exam, hired by Google in August, I think this cert helped a lot for my resume to be selected for interviews :)“

Fran is now a Conversational AI engineer, and has been working at Google for over 6 months.

photo of a googler wearing a noogler hat

Tia shared, “I was a stay-at-home mom when I started to learn Machine Learning at Google Bangkit Academy back in 2020. Bangkit supported us to take TF Certification and with this certificate, I was able to get back to work after years of hiatus. My current role is Machine Learning Curriculum Developer in Dicoding, an online technology education platform in Indonesia.”

Check out the short video below to hear Tia’s story.

Are you interested in earning the TensorFlow Developer Certificate? Learn more about the TensorFlow Developer Certificate on our website, including information on exam criteria, exam cost, and a stipend application to ensure taking this certificate exam is accessible, regardless of income.

If you’ve taken the exam and have feedback or would like to share your story, we would love to hear from you!

We look forward to growing this community of TensorFlow Developer Certificate recipients, and are immensely thankful to the continued contributions of our open source community!

Note: Participating in the program and/or obtaining this certificate are not endorsements of a participant’s employability nor guarantee of future work performance.

Read More

TensorFlow Hub for Real World Impact

Posted by Luiz GUStavo Martins and Elizabeth Kemp on behalf of the TensorFlow Hub team

As a developer, when you’re faced with a challenge, you might think: “How can I solve this?”, or, “What is the right tool for the job?” For a growing number of situations, machine learning can help! ML can solve many tasks that would be very challenging using classical programming, for example: detecting objects in images, classifying sound events, or understanding text.

But training machine learning models may take a lot of time, use large amounts of data, require deep expertise in the field, and be resource intensive. What if instead of starting from scratch, someone has already solved the same problem you have? Or at least solved a very similar problem that could give you a good starting point? This is where TensorFlow Hub can help you!

TensorFlow Hub is an open repository of pre-trained machine learning models ready for fine-tuning and deployable anywhere, from servers to mobile devices, microcontrollers and browsers.

Developers are using models available from TF Hub to solve real world problems across many domains, and at Google I/O 2021 we highlighted some examples of developers changing the world using models from TensorFlow Hub.

In this article, we’ll cover these use cases as well, with links so you can check them out.

Images

Image classification is one of the most popular use cases for machine learning. The development of this field helped the whole of machine learning by showing very good results and pushing the boundaries of research.

TensorFlow Hub has many models for the image problem domain for tasks like image classification, object detection, image segmentation, pose detection, style transfer and many others.

Many of the available models have a visualizer, like the one below, right on their documentation page, enabling you to try the model without any code or downloading anything.

TFHub makes Transfer Learning simpler and easier to experiment with many state of the art models like MobilenetV3, EfficientNet V2 to find the best one for your data. A real world use case can be seen in this CropNet tutorial to create the best model possible to detect diseases in cassava leaves and deploy it on device for use in the field.

Text

Understanding text has always been a very challenging task for computers because of all the context that is necessary, and the large number of words and phrases. Many state of the art Natural Language Processing (NLP) models are available on TensorFlow Hub and ready for you to use.

One example is the BERT family of models. Using them from TFHub is easier than ever. Aside from the base BERT model, there are more advanced versions and in many languages ready to be used like you can see here in Making BERT Easier with Preprocessing Models From TensorFlow Hub.

One good example is the MuRIL model that is a multilingual BERT model trained on 17 Indian languages used by developers to solve local NLP challenges in India.

An animation of the preprocessing model that makes it easy for you to input text into BERT
An animation of the preprocessing model that makes it easy for you to input text into BERT.

Developers can also use the TF Hub spam detection model for detecting spam comments in online forums. The model is available for TF.js and TFLite for running in the browser and on-device.

Audio

TF Hub has many audio models that you can use on desktop, for on-device inference on mobile devices, or on the web. There are also audio embedding models that can be used with transfer learning which you can adapt to your own data.

gif of dog next to microphone and sound waves

Developers are using audio classification to understand what’s happening on a forest (How ZSL uses ML to classify gunshots to protect wildlife) or inside the ocean (Acoustic Detection of Humpback Whales Using a Convolutional Neural Network) or even closer to us, understanding what is happening in your own home (Important household sounds become more accessible).

Video

Video processing is increasingly important and TensorFlow Hub also has models for this domain like the MoViNet collection that can do video classification or the I3D for action recognition.

gif of video processor in TensorFlow Hub

TFHub also has tutorials for Action recognition, Video Interpolation and Text-to-video retrieval.

Summary

Reusing code is usually better than re-writing it. The same applies to machine learning models. If you can use a pre-trained model for a task, it can save you time, resources, and help you make an impact in the world. TensorFlow Hub has thousands of models available for you to deploy or customize to your task with transfer learning.

If you want to know more about how to use TensorFlow Hub and find great tutorials, check out the documentation at tensorflow.org/hub. To find models for your own real world impact, search on tfhub.dev.

Let us know what you build and also share with the community. You can talk to the team on the TensorFlow Forum and find a community that is eager to help!

Read More

Google demonstrates leading performance in latest MLPerf Benchmarks

Cross posted with the Google Cloud blog by Tao Wang, Software Engineer, Aarush Selvan, Product Manager.

The latest round of MLPerf benchmark results have been released, and Google’s TPU v4 supercomputers demonstrated record-breaking performance at scale. This is a timely milestone since large-scale machine learning training has enabled many of the recent breakthroughs in AI, with the latest models encompassing billions or even trillions of parameters (T5, Meena, GShard, Switch Transformer, and GPT-3).

Google’s TPU v4 Pod was designed, in part, to meet these expansive training needs, and TPU v4 Pods set performance records in four of the six MLPerf benchmarks Google entered using TensorFlow and JAX. These scores are a significant improvement over our winning submission from last year and demonstrate that Google once again has the world’s fastest machine learning supercomputers. These TPU v4 Pods are already widely deployed throughout Google data centers for our internal machine learning workloads and will be available via Google Cloud later this year.

Speedup of Google’s best MLPerf Training v1.0 TPU v4 submission over the fastest non-Google submission in any availability category - in this case, all baseline submissions came from NVIDIA. Comparisons are normalized by overall training time regardless of system size. Taller bars are better.
Figure 1: Speedup of Google’s best MLPerf Training v1.0 TPU v4 submission over the fastest non-Google submission in any availability category – in this case, all baseline submissions came from NVIDIA. Comparisons are normalized by overall training time regardless of system size. Taller bars are better.1

Let’s take a closer look at some of the innovations that delivered these ground-breaking results and what this means for large model training at Google and beyond.

Google’s continued performance leadership

Google’s submissions for the most recent MLPerf demonstrated leading top-line performance (fastest time to reach target quality), setting new performance records in four benchmarks. We achieved this by scaling up to 3,456 of our next-gen TPU v4 ASICs with hundreds of CPU hosts for the multiple benchmarks. We achieved an average of 1.7x improvement in our top-line submissions compared to last year’s results. This means we can now train some of the most common machine learning models in a matter of seconds.

Figure 2: Speedup of Google’s MLPerf Training v1.0 TPU v4 submission over Google’s MLPerf Training v0.7 TPU v3 submission (exception: DLRM results in MLPerf v0.7 were obtained using TPU v4). Comparisons are normalized by overall training time regardless of system size. Taller bars are better. Unet3D not shown since it is a new benchmark for MLPerf v1.0.
Figure 2: Speedup of Google’s MLPerf Training v1.0 TPU v4 submission over Google’s MLPerf Training v0.7 TPU v3 submission (exception: DLRM results in MLPerf v0.7 were obtained using TPU v4). Comparisons are normalized by overall training time regardless of system size. Taller bars are better. Unet3D not shown since it is a new benchmark for MLPerf v1.0.2

We achieved these performance improvements through continued investment in both our hardware and software stacks. Part of the speedup comes from using Google’s fourth-generation TPU ASIC, which offers a significant boost in raw processing power over the previous generation, TPU v3. 4,096 of these TPU v4 chips are networked together to create a TPU v4 Pod, with each pod delivering 1.1 exaflop/s of peak performance.

Figure 3: A visual representation of 1 exaflop/s of computing power. If 10 million laptops were running simultaneously, then all that computing power would almost match the computing power of 1 exaflop/s.

In parallel, we introduced a number of new features into the XLA compiler to improve the performance of any ML model running on TPU v4. One of these features provides the ability to operate two (or potentially more) TPU cores as a single logical device using a shared uniform memory access system. This memory space unification allows the cores to easily share input and output data – allowing for a more performant allocation of work across cores. A second feature improves performance through a fine-grained overlap of compute and communication. Finally, we introduced a technique to automatically transform convolution operations such that space dimensions are converted into additional batch dimensions. This technique improves performance at the low batch sizes that are common at very large scales.

Enabling large model research using carbon-free energy

Though the margin of difference in topline MLPerf benchmarks can be measured in mere seconds, this can translate to many days worth of training time on the state-of-the-art models that comprise billions or trillions of parameters. To give an example, today we can train a 4 trillion parameter dense Transformer with GSPMD on 2048 TPU cores. For context, this is over 20 times larger than the GPT-3 model published by OpenAI last year. We are already using TPU v4 Pods extensively within Google to develop research breakthroughs such as MUM and LaMDA, and improve our core products such as Search, Assistant and Translate. The faster training times from TPUs result in efficiency savings and improved research and development velocity. Many of these TPU v4 Pods will be operating at or near 90% carbon free energy. Furthermore, cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators – like TPUs – running inside them can be ~2-5X more effective than off-the-shelf systems.

We are also excited to soon offer TPU v4 Pods on Google Cloud, making the world’s fastest machine learning training supercomputers available to customers around the world, and we recently released an all-new Cloud TPU system architecture that provides direct access to TPU host machines, greatly improving the user experience.

Want to learn more?

Read how to get started using TPUs to train your model. We are excited to see how you will expand the machine learning frontier with access to exaflops of TPU computing power!

¹ All results retrieved from www.mlperf.org on June 30, 2021. MLPerf name and logo are trademarks. See www.mlperf.org for more information. Chart uses results 1.0-1067, 1.0-1070, 1.0-1071, 1.0-1072, 1.0-1073, 1.0-1074, 1.0-1075, 1.0-1076, 1.0-1077, 1.0-1088, 1.0-1089, 1.0-1090, 1.0-1091, 1.0-1092.

² All results retrieved from www.mlperf.org on June 30, 2021. MLPerf name and logo are trademarks. See www.mlperf.org for more information. Chart uses results 0.7-65, 0.7-66, 0.7-67, 1.0-1088, 1.0-1090, 1.0-1091, 1.0-1092..

Read More

2021 Request for Proposals: Faculty awards to support machine learning courses, diversity, and inclusion

Posted by Josh Gordon for the TensorFlow team

Google AI and the TensorFlow team have a funding opportunity open to universities. If you’re a faculty member interested in teaching machine learning courses, and/or leading or contributing to diversity initiatives, please read on to learn more. We have parallel goals for these awards, and you may apply for funding with one or both in mind.

  • We want to support you as you lead or contribute to diversity initiatives that widen access to ML education for currently underrepresented groups in computer science. We are especially interested in programs that include a clear focus on cultivating and retaining a “critical mass” of students, and that encourage students to pursue graduate study and/or future careers in ML.
  • We want to support you as you design, develop, and teach undergraduate or graduate level machine learning courses that include examples with open-source libraries. We would like to support courses that teach practical, in-demand skills, and to support courses that prepare students to solve new and challenging problems using ML in applied areas, such as healthcare, journalism, basic science, and others.

TensorFlow logo

We especially welcome proposals that combine these goals, proposals that include cross-institutional collaborations, and proposals that include collaborations between faculty and graduate students. We look forward to hearing your ideas!

To learn more, please see the RFP at goo.gle/tensorflow-rfp. Please note that the submission deadline is July 30th, 2021. For questions, you can reach out to tensorflow-rfp@google.com.

Read More

Meet TensorFlow community leads around the world

Posted by Joana Carrasqueira and Lynette Gaddi, Program Managers at Google

The TensorFlow community keeps growing every day, and includes many thousands of developers, educators, and researchers around the world. If you’d like to get involved with the community, there are many different organizations you can check out.

These include Special Interest Groups (SIGs), TensorFlow User Groups (TFUGs), and Google Developer Groups (GDGs). There are also many Google Developer Experts (GDEs) you can get in touch with. They’re knowledgeable about ML and help others in their community, and are a great point of contact to find future local events.

We spend a lot of time working with community leads, and in this article, we’d like to share some of their stories with you. We had the wonderful opportunity to interview several leads from different areas – including a SIG Lead, a Machine Learning GDE, and two TensorFlow User Group organizers, so you can learn about their background, how they got involved in the community, and how you can too.

TensorFlow branded banner with orange elements

Karl Lessard

TensorFlow SIG Lead for JVM

Montreal, Canada

Image of Karl Lessard

Karl has been working in software engineering and consulting for more than 20 years in various fields, including computer graphics and communications. He is now working full-time at Expedia in Montreal, focusing on delivering solutions for complex linguistic and localization challenges.

What does being a community leader mean to you?

What really matters is that all members of the group enjoy contributing to the project, and making it as fulfilling as possible for them. Because that’s what open-source is to me: a playground for grownups building something useful to the world! Being a community leader comes with a bunch of technical responsibilities, too, but for me that’s the most important thing.

How did you get involved in the TF community?

I started designing a few proposals to enhance the TensorFlow Java client, which at that time was offering minimal support for running model inference on Android devices. My proposed changes were welcomed by Google (special thanks to Asim Shankar), and I’ve submitted multiple pull requests over a couple years since then.

There was increasing general interest in supporting TensorFlow on the JVM following that, and I met the engineering team (and many others from the community) at a TensorFlow Dev summit in 2019 to suggest the idea of starting a group focusing on this topic. That’s how SIG JVM was born.

How do you contribute technically as a SIG Lead?

I still contribute to the design and the code of the project (like in the beginning), but I also review most of the pull requests, plan video calls, and make sure proposed changes are done with respect to the global vision of the project shared by other members, and they are being discussed properly and broadly.

Do you have any advice on how to get involved in the community?

If you can make it to the TensorFlow Dev/Contributor Summit, do it. That’s definitely a good place to meet people sharing the same interests as you. Also, you can get involved in the various discussions related to the topics of your interests. SIGs forums are a good place to start, and you can also get in touch with others on the new TensorFlow Forum. Finally, don’t be shy to make change proposals and/or to submit a few pull requests!

Ruqiya Bin Safi

Google Cloud GDE

Saudi Arabia

image of Ruqiya Bin Safi

Ruqiya Bin Safi is a Software Engineer that is interested in Artificial Intelligence, Machine Learning and Deep Learning as well as Data Science. Ruqiya started learning Machine Learning many years ago. She seeks to spread knowledge about Machine Learning and new technologies.

What does being a community leader mean to you?

It means a responsibility I take on, a goal I’ve accomplished, and a dream I fulfill. It means that I contribute to the development of my community and helping others. To be a community leader means sharing useful knowledge so that everyone can benefit. Leadership is also a give and take: the community gives to me, and I give back to the community. It’s a cooperation. We share similar goals and interests, and vision and mission. And we seek to employ what we’ve learned to develop tools that make the world a better place.

How did you get involved in the TF community?

I’ve always loved learning new technologies and using them to solve problems. I started as a software engineer, and found machine learning increasingly interesting as I studied it, and eventually made it a focus. I got involved with the TF community through a Women Techmakers (WTM) event. I then joined a local WTM community to help others learn about ML, and later applied and was accepted into the GDE program.

How do you contribute as a ML GDE?

I love giving talks, and most of my contributions were as speaker and technical trainer through various tech talks, panel discussions, and workshops. My goal is to help and motivate people to learn more about ML. I also run a deep learning monthly workshop that aims to help participants gain foundational knowledge of common deep learning techniques as well as practical experience in building neural networks with TensorFlow. I also really enjoy mentoring for Google for Startups Accelerator MENA program as well as some hackathons, and I write articles from time to time about machine learning and TensorFlow.

Do you have any advice on how to get involved in the community?

One of the best ways is to try to learn something new, and then share what you have learned with your community through blogging or a local meetup. Love what you do, and as much as you can, find work that you enjoy – and trust in yourself as you become more active and involved.

Armel Yara

TensorFlow User Group, Organizer

Abidja, West Africa

Photo of Armel Yara

Armel is a developer and TensorFlow community leader in Francophone Africa, where he organizes and hosts developer events in multiple languages (including large events like TensorFlow Everywhere SSA), and manages machine learning projects for local business

What does being a community leader mean to you?

Being a community leader means for me sharing experiences, being available for others, and listening to their needs and expectations.

How did you get involved in the TF community?

I get involved in the TF community by sharing the latest news about the TF community on my blog and working on open source projects.

How do you contribute as a TFUG lead?

I organize events and give technical sessions at universities and online.

Do you have any advice on how to get involved in the community?

My advice to become more active and get more involved in the community is to look over the membership expectations, share projects that you have built, and use them to motivate others. Lead by example, let others know how much you enjoy what you do and showcase your work.

Nijat Zeynalov

TensorFlow User Group, Organizer

Azerbaijan

photo of Nijat Zeynalov

Nijat is a Certified TensorFlow Developer and a first year master student at University of Tartu. He’s passionate about data science and machine learning, and organizing events to help others.

What does being a community leader mean to you?

I learned about leadership by running a local community where we aimed to provide free support to anyone interested in coding. In my mind, being a community leader means to help inspire others, and also to foster a community of respect that enables and encourages contributions of others. I strongly believe that leadership can be learned with practice.

How did you get involved in the TF community?

While I was preparing for the TensorFlow Developer certificate, I found a user groups page and I thought – why not set up a local user group for our country. I understood the responsibility of being a community leader, and I contacted a few TensorFlow User Group organizers to learn more about it. Their positive feedback about the overall impression made my decision even easier and motivated me to get started, and it’s been a great experience ever since.

How do you contribute as a community organizer?

We regularly discuss the latest TensorFlow updates in the user group, and we organise “Paper Reading Meetings” where we read and discuss one deep learning paper as a group. This has been a really great way for people to share their knowledge and ask questions. Additionally, in March, as a “TensorFlow User Group – Azerbaijan”, we held the 5-hour long “TensorFlow Everywhere – 2021” event which was the country’s largest machine learning event to date.

It was a pleasure to speak with Karl, Ruqiya, Armel and Nijat (thank you again for your time and contributions!) We hope their stories inspire you to get involved, and take on a leadership role in your local community in the future. If you’d like, you can start a conversation on the TensorFlow Forum and share how you got involved in the TensorFlow Community, and meet others. And check out the top of this post for more links to user and special interest groups.

Read More

Easier object detection on mobile with TensorFlow Lite

Posted by Khanh LeViet, Developer Advocate on behalf of the TensorFlow Lite team

Example of object detection on mobile

At Google I/O this year, we are excited to announce several product updates that simplify training and deployment of object detection models on mobile devices:

  • On-device ML learning pathway: a step-by-step tutorial on how to train and deploy a custom object detection model on mobile devices with no machine learning expertise required.
  • EfficientDet-Lite: a state-of-the-art object detection model architecture optimized for mobile devices.
  • TensorFlow Lite Model Maker for object detection: train custom models in just a few lines of code.
  • TensorFlow Lite Metadata Writer API: simplify metadata creation to generate custom object detection models compatible with TFLite Task Library.

Despite being a very common ML use case, object detection can be one of the most difficult to do. We’ve worked hard to make it easier for you, and in this blog post we’ll show you how to leverage the latest offerings from TensorFlow Lite to build a state-of-the-art mobile object detector using your own domain data.

On-device ML learning pathway: learn how to train and deploy custom TensorFlow Lite object detection model in 12 minutes.

Training a custom object detection model and deploying it to an Android app has become super easy with TensorFlow Lite. We released a learning pathway that teaches you step-by-step how to do it.

In the video, you can learn the steps to build a custom object detector:

  1. Prepare the training data.
  2. Train a custom object detection model using TensorFlow Lite Model Maker.
  3. Deploy the model on your mobile app using TensorFlow Lite Task Library.

There’s also a codelab with source code on GitHub for you to run through the code yourself. Please try it out and let us know your feedback!

EfficientDet-Lite: the state-of-the-art model architecture for object detection on mobile devices

Running machine learning models on mobile devices means we always need to consider the trade-off between model accuracy vs. inference speed and model size. The state-of-the-art mobile-optimized model doesn’t only need to be more accurate, but it also needs to run faster and be smaller. We adapted the neural architecture search technique published in the EfficientDet paper, then optimized the model architecture for running on mobile devices and came up with a novel mobile object detection model family called EfficientDet-Lite.

EfficientDet-Lite has 5 different versions: Lite0 to Lite4. The smaller version runs faster but is not as accurate as the larger version. You can experiment with multiple versions of EfficientNet-Lite and choose the one that is most suitable for your use case.

Model architecture

Size(MB)*

Latency (ms)**

Average Precision***

EfficientDet-Lite0

4.4

37

25.69%

EfficientDet-Lite1

5.8

49

30.55%

EfficientDet-Lite2

7.2

69

33.97%

EfficientDet-Lite3

11.4

116

37.70%

EfficientDet-Lite4

19.9

260

41.96%

SSD MobileNetV2 320×320

6.7

24

20.2%

SSD MobileNetV2 FPNLite 640×640

4.3

191

28.2%

* Size of the integer quantized models.

** Latency measured on Pixel 4 using 4 threads on CPU.

*** Average Precision is the mAP (mean Average Precision) on the COCO 2017 validation dataset.

We have released the EfficientDet-Lite models trained on the COCO dataset to TensorFlow Hub. You also can train EfficientDet-Lite custom models using your own training data with TensorFlow Lite Model Maker.

TensorFlow Lite Model Maker: train a custom object detection using transfer learning in a few lines of code

TensorFlow Lite Model Maker is a Python library that significantly simplifies the process of training a machine learning model using a custom dataset. It leverages transfer learning to enable training high quality models using just a handful of images.

Model Maker accepts datasets in the PASCAL VOC format and the Cloud AutoML’s CSV format. As you can create your own dataset using open-source GUI tools such as LabelImg or makesense.ai, everyone can create training data for Model Maker without writing a single line of code.

Once you have your training data, you can start training a TensorFlow Lite custom object detectors.

# Step 1: Choose the model architecture
spec = model_spec.get('efficientdet_lite2')

# Step 2: Load your training data
train_data, validation_data, test_data = object_detector.DataLoader.from_csv('gs://cloud-ml-data/img/openimage/csv/salads_ml_use.csv')

# Step 3: Train a custom object detector
model = object_detector.create(train_data, model_spec=spec, validation_data=validation_data)

# Step 4: Export the model in the TensorFlow Lite format
model.export(export_dir='.')

# Step 5: Evaluate the TensorFlow Lite model
model.evaluate_tflite('model.tflite', test_data)

Check out this notebook to learn more.

TensorFlow Lite Task Library: deploying object detection models on mobile in a few lines of code

TensorFlow Lite Task Library is a cross-platform library which simplifies TensorFlow Lite model deployments on mobile. Custom object detection models trained with TensorFlow Lite Model Maker can be deployed to an Android app in just a few lines of Kotlin code:

// Step 1: Load the TensorFlow Lite model
val detector = ObjectDetector.createFromFile(context, "model.tflite")

// Step 2: Convert the input Bitmap into a TensorFlow Lite's TensorImage object
val image = TensorImage.fromBitmap(bitmap)

// Step 3: Feed given image to the model and get the detection result
val results = detector.detect(image)

See our documentation to learn more about the customization options in Task Library, including how to configure the minimum detection threshold or the maximum number of detected objects.

TensorFlow Lite Metadata Writer API: simplify deployment of custom models trained with TensorFlow Object Detection API

Task Library relies on the model metadata bundled in the TensorFlow Lite model to execute the preprocessing and postprocessing logic required to run inference using the model. They include how to normalize the input image, or how to map the class id to human readable labels. Models trained using Model Maker have these metadata by default, making them compatible with Task Library. But if you train a TensorFlow Lite object detection model using a training pipeline other than Model Maker, you can add the metadata using TensorFlow Lite Metadata Writer API.

For example, if you train a model using TensorFlow Object Detection API, you can add metadata to the TensorFlow Lite model using this Python code:

LABEL_PATH = 'label_map.txt'
MODEL_PATH = "ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/model.tflite"
SAVE_TO_PATH = "ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/model_with_metadata.tflite"

# Step 1: Specify the preprocessing parameters and label file
writer = object_detector.MetadataWriter.create_for_inference(
writer_utils.load_file(MODEL_PATH), input_norm_mean=[0],
input_norm_std=[255], label_file_paths=[LABEL_PATH])

# Step 2: Export the model with metadata
writer_utils.save_file(writer.populate(), SAVE_TO_PATH)

Here we specify the normalization parameters (input_norm_mean=[0], input_norm_std=[255]) so that the input image will be normalized into the [0..1] range. You need to specify normalization parameters to be the same as in the preprocessing logic used during the model training.

See this notebook for a full tutorial on how to convert models trained with the TensorFlow Object Detection API to TensorFlow Lite and add metadata.

What’s next

Our goal is to make machine learning easier to use for every developer, with or without machine learning expertise. We are working with the Model Garden team to bring more object detection model architectures to Model Maker. We will also continue to work with researchers in Google to make future state-of-the-art object detection models available via Model Maker, shortening the path from cutting-edge research to production for everyone. Stay tuned for more updates!

Read More

Training with Multiple Workers using TensorFlow Quantum

Posted by Cheng Xing and Michael Broughton, Google

Training large machine learning models is a core ability for TensorFlow. Over the years, scale has become an important feature in many modern machine learning systems for NLP, image recognition, drug discovery etc. Making use of multiple machines to boost computational power and throughput has led to great advances in the field. Similarly in quantum computing and quantum machine learning, the availability of more machine resources speeds up the simulation of larger quantum states and more complex systems. In this tutorial you will walk through how to use TensorFlow and TensorFlow quantum to conduct large scale and distributed QML simulations. Running larger simulations with greater FLOP/s counts unlocks new possibilities for research that otherwise wouldn’t be possible at smaller scales. In the figure below we have outlined approximate scaling capabilities for several different hardware settings for quantum simulation.

Running distributed workloads often comes with infrastructure complexity, but we can use Kubernetes to simplify this process. Kubernetes is an open source container orchestration system, and it is a proven platform to effectively manage large-scale workloads. While it is possible to have a multi-worker setup with a cluster of physical or virtual machines, Kubernetes offers many advantages, including:

  • Service discovery – workers can easily identify each other using well-known DNS names, rather than manually configuring IP destinations.
  • Automatic bin-packing – your workloads are automatically scheduled on different machines based on resource demand and current consumption.
  • Automated rollouts and rollbacks – the number of worker replicas can be changed by changing a configuration, and Kubernetes automatically adds/removes workers in response and schedules in machines where resources are available.

This tutorial guides you through a TensorFlow Quantum multi-worker setup using Google Cloud products, including Google Kubernetes Engine, a managed Kubernetes platform. You will have the chance to take the single-worker Quantum Convolutional Neural Network (QCNN) tutorial in TensorFlow Quantum and augment it for multi-worker training.

From our experiments in the multi-worker setting, training a 23-qubit QCNN with 1,000 training examples, which corresponds to roughly 3,000 circuits simulated using full state vector simulation takes 5 minutes per epoch on a 32 node (512 vCPU) cluster, which costs a few US dollars. By comparison, the same training job on a single-worker would take roughly 4 hours per epoch. Pushing things a little bit farther, hundreds of thousands of 30-qubit circuits could be run in a few hours using more than 10,000 virtual CPUs which could have taken weeks to run in a single-worker setting. The actual performance and cost may vary depending on your cloud setup, such as VM machine type, total cluster running time, etc. Before performing larger experiments, we recommend starting with a small cluster first, like the one used in this tutorial.

The source code for this tutorial is available in the TensorFlow Quantum GitHub repository. README.md contains the quickest way to get this tutorial up and running. This tutorial will instead focus on walk through each step in detail, to help you understand the underlying concepts and integrate them with your own projects. Let’s get started!

1. Setting up Infrastructure in Google Cloud

The first step is to create the infrastructure resources in Google Cloud. If you have an existing Google Cloud environment, the exact steps might vary, due to organizational policy constraints for example. This is a guideline to the most common set of necessary steps. Note that you will be charged for Google Cloud resources you create, and here is a summary of billable resources used in this tutorial. If you are a new Google Cloud user, you are eligible for $300 in credits. If you are part of an academic institution, you may be eligible for Google Cloud research credits.

You will be running several shell commands in this tutorial. For that, you can use either a local Unix shell available on your computer or the Cloud Shell, which already contains many of the tools mentioned later.

A script automating the steps below is available in setup.sh. This section walks through every step in detail, and if this is your first time using Google Cloud, we recommend that you walk through the entire section. If you prefer to automate the Google Cloud setup process and skip this section:

  • Open setup.sh and configure parameter values inside.
  • Run ./setup.sh infra.

In this tutorial, you will use a few Google Cloud products:

To get your cloud environment ready, first follow these quick start guides:

For purposes of this tutorial, you could stop the Kubernetes Engine quickstart right before the instructions for creating a cluster. In addition, install gsutil, the Cloud Storage command-line tool (if you are using Cloud Shell, gsutil is already installed):

gcloud components install gsutil

For reference, shell commands throughout the tutorial will refer to these variables. Some of them will make more sense later on in the tutorial in the context of each command.

  • ${CLUSTER_NAME}: your preferred Kubernetes cluster name on Google Kubernetes Engine.
  • ${PROJECT}: your Google Cloud project ID.
  • ${NUM_NODES}: the number of VMs in your cluster.
  • ${MACHINE_TYPE}: the machine type of VMs. This controls the amount of CPU and memory resources for each VM.
  • ${SERVICE_ACCOUNT_NAME}: The name of both the Google Cloud IAM service account and the associated Kubernetes service account.
  • ${ZONE}: Google Cloud zone for the Kubernetes cluster.
  • ${BUCKET_REGION}: Google Cloud region for Google Cloud Storage bucket.
  • ${BUCKET_NAME}: Name of the Google Cloud Storage bucket for storing training output.

To ensure you have permissions to run cloud operations in the rest of the tutorial, make sure either you have the IAM role of owner, or all of the following roles:

  • container.admin
  • iam.serviceAccountAdmin
  • storage.admin

To check your roles, run:

gcloud projects get-iam-policy ${PROJECT}

with your Google Cloud project ID and search for your user account.

After you’ve completed the quickstart guides, run this command to create a Kubernetes cluster:

gcloud container clusters create ${CLUSTER_NAME} --workload-pool=${PROJECT}.svc.id.goog --num-nodes=${NUM_NODES} --machine-type=${MACHINE_TYPE} --zone=${ZONE} --preemptible

with your Google Cloud project ID and preferred cluster name.

--num-nodes is the number of Compute Engine virtual machines backing your Kubernetes cluster. This is not necessarily the same as the number of worker replicas you’d like to have for your QCNN job, as Kubernetes is able to schedule multiple replicas on the same node, provided that the node has enough CPU and memory resources. If you are trying this tutorial for the first time, we recommend 2 nodes.

--machine-type specifies the VM machine type. If you are trying this tutorial for the first time, we recommend “n1-standard-2”, with 2 vCPUs and 7.5GB of memory.

--zone is the Google Cloud zone where you’d like to run your cluster (for example “us-west1-a”).

--workload-pool enables the GKE Workload Identity feature, which ties Kubernetes service accounts with Google Cloud IAM service accounts. In order to have fine-grained access control, an IAM service account is recommended to access various Google Cloud products. Here you’ll create a service account to be used by your QCNN jobs. Kubernetes service account is the mechanism to inject the credentials of this IAM service account into your worker container.

--preemptible uses Compute Engine Preemptible VMs to back the Kubernetes cluster. They are up to 80% lower in cost compared to regular VMs, with the tradeoff that a VM may be preempted at any time, which will terminate the training process. This is well-suited for short-running training sessions with large clusters.

You can then create an IAM service account:

gcloud iam service-accounts create ${SERVICE_ACCOUNT_NAME}

and integrate it with Workload Identity:

gcloud iam service-accounts add-iam-policy-binding --role roles/iam.workloadIdentityUser --member "serviceAccount:${PROJECT}.svc.id.goog[default/${SERVICE_ACCOUNT_NAME}]" ${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com

Now create a storage bucket, which is the basic container to store your data:

gsutil mb -p ${PROJECT} -l ${BUCKET_REGION} -b on gs://${BUCKET_NAME}

using your preferred bucket name. The bucket name is globally unique, so we recommend including your project name as part of the bucket name. The bucket region is recommended to be the region containing your cluster’s zone. The region of a zone is the part of the zone name without the section after the last hyphen. For example, the region of zone “us-west1-a” is “us-west1”.

To make your Cloud Storage data accessible by your QCNN jobs, give permissions to your IAM service account:

gsutil iam ch serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com:roles/storage.admin gs://${BUCKET_NAME}

2. Preparing Your Kubernetes Cluster

With the cloud environment set up, you can now install the necessary Kubernetes tools into the cluster. You’ll need tf-operator, a component from KubeFlow. KubeFlow is a toolkit for running machine learning workloads on Kubernetes, and tf-operator is a subcomponent which simplifies the management of TensorFlow jobs. tf-operator can be installed separately without the larger KubeFlow installation.

To install tf-operator, run:

docker pull k8s.gcr.io/kustomize/kustomize:v3.10.0
docker run k8s.gcr.io/kustomize/kustomize:v3.10.0 build "github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=v1.1.0" | kubectl apply -f -

(Note that tf-operator uses Kustomize to manage its deployment files, so it needs to be installed here as well)

3. Training with MultiWorkerMirroredStrategy

You can now take the QCNN code found on the TensorFlow Quantum research branch and prepare it to run in a distributed fashion. Let’s clone the source code:

git clone https://github.com/tensorflow/quantum.git && cd quantum && git checkout origin/research && cd qcnn_multiworker

Or, if you are using SSH keys to authenticate to GitHub:

git clone git@github.com:tensorflow/quantum.git && cd quantum && git checkout origin/research && cd qcnn_multiworker

Code Setup

The training directory contains the necessary pieces for performing distributed training of your QCNN. The combination of training/qcnn.py and common/qcnn_common.py is the same as the hybrid QCNN example in TensorFlow Quantum, but with a few feature additions:

  • Training can now optionally leverage multiple machines with tf.distribute.MultiWorkerMirroredStrategy.
  • TensorBoard integration, which you will explore in more detail in the next section.

MultiWorkerMirroredStrategy is the mechanism in TensorFlow to perform synchronized distributed training. Your existing model has been augmented for distributed training with just a few extra lines of code.

At the beginning of training/qcnn.py, we set up MultiWorkerMirroredStrategy:

strategy = tf.distribute.MultiWorkerMirroredStrategy()

In the model preparation step, we then pass in this strategy as an argument:

... = qcnn_common.prepare_model(strategy)

Each worker of our QCNN distributed training job will run a copy of this Python code. Every worker needs to know the network endpoint of all other workers. The TF_CONFIG environment variable is typically used for this purpose, but in our case, the tf-operator injects it automatically behind the scenes.

After the model is trained, weights are uploaded to your Cloud Storage bucket to be accessed later by the inference job.

if task_type == 'worker' and task_id == 0:
qcnn_weights_path='/tmp/qcnn_weights.h5'
qcnn_model.save_weights(qcnn_weights_path)
upload_blob(args.weights_gcs_bucket, qcnn_weights_path, f'qcnn_weights.h5')

Kubernetes Deployment Setup

Before proceeding to the Kubernetes deployment setup and launching your workers, several parameters need to be configured in the tutorial source code to match your own setup. The provided script, setup.sh, can be used to simplify this process.

Open setup.sh and configure parameter values inside, if you haven’t already done so in a previous step. Then run

./setup.sh param

At this point, the remaining steps in this section can be performed in one command:

make training

The rest of this section walks through the Kubernetes setup in detail.

Prior to running as containers in Kubernetes, the QCNN job needs to be packaged as a container image using Docker and uploaded to the Container Registry. The Dockerfile contains the specification for the image. To build and upload the image, run:

docker build -t gcr.io/${PROJECT}/qcnn .
docker push gcr.io/${PROJECT}/qcnn

Next, you’ll complete the Workload Identity setup by creating the Kubernetes service account using common/sa.yaml. This service account will be used by the QCNN containers.

apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
iam.gke.io/gcp-service-account: ${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com
name: ${SERVICE_ACCOUNT_NAME}

The annotation tells GKE this Kubernetes service account should be bound to the IAM service account you created previously. Let’s create this service account:

kubectl apply -f common/sa.yaml

The last step is to create the distributed training job. training/qcnn.yaml contains the Kubernetes specifications for your job. In Kubernetes, multiple containers with related functions are grouped into a single entity called a Pod, which is the most fundamental unit of work that can be scheduled. Typically, users leverage existing resource types such as Deployment and Job to create and manage workloads. You’ll instead use TFJob (as specified in the `kind` field), which is not a Kubernetes built-in resource type but rather a Custom Resource provided by the tf-operator, making it easier to work with TensorFlow workloads.

Notably, the TFJob spec contains the field tfReplicaSpecs.Worker, which lets you configure a MultiWorkerMirroredStrategy worker. Values of PS (parameter server), Chief, and Evaluator are also supported for asynchronous and other forms of distributed training. Under the hood, tf-operator creates two Kubernetes resources for each worker replica:

  • A Pod, using the Pod spec template at tfReplicaSpecs.Worker.template. This runs the container you’ve built previously on Kubernetes.
  • A Service, which exposes a well-known network endpoint visible within the Kubernetes cluster to give access to the worker’s gRPC training server. Other workers can communicate with its server by simply pointing to <service_name>:<port> (the alternative form of <service_name>.<service_namespace>.svc:<port> works as well).
TFJob
The TFJob generates one Service and Pod per worker replica. Once the TFJob is updated, changes are reflected in the underlying Services and Pods. Worker status is also reported in the TFJob.
The Service
The Service exposes worker servers to the rest of the cluster. Each worker communicates with other workers by using the destination worker’s Service name as the DNS name.

Within the worker spec, there are a few notable fields:

  • replicas: Number of worker replicas. It’s possible for multiple replicas to be scheduled on the same node, so this number is not limited to the number of nodes.
  • template: the Pod spec template for each worker replica
    • serviceAccountName: this gives the Pod access to the Kubernetes service account.
    • container:
      • image: Points to the Container Registry image you’ve built previously.
      • command: the container’s entry point command.
      • arg: command-line arguments.
      • ports: opens up one port for workers to communicate with each other, and another port for profiling.
    • affinity: this tells Kubernetes that you prefer to schedule worker Pods on different nodes as much as possible, to maximize resource utilization.

To create the TFJob:

kubectl apply -f training/qcnn.yaml

Inspecting the Deployment

Congratulations! Your distributed training is now underway. To check the job’s status, run kubectl get pods a few times (or add -w to stream the output). Eventually you should see there are the same number of qcnn-worker Pods as your replicas parameter, and they all have status Running:

NAME            READY   STATUS    RESTARTS
qcnn-worker-0 1/1 Running 0
qcnn-worker-1 1/1 Running 0

To access the worker’s log output, run:

kubectl logs <worker_pod_name>

or add -f to stream the output. The output of qcnn-worker-0 looks like this:


I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc:/
/qcnn-worker-0.default.svc:2222

I tensorflow/core/profiler/rpc/profiler_server.cc:46] Profiler server listening on [::]:2223 selecte
d port:2223

Epoch 1/50

4/4 [==============================] - 7s 940ms/step - loss: 0.9387 - accuracy: 0.0000e+00 - val_loss: 0.7432 - val_accuracy: 0.0000e+00

I tensorflow/core/profiler/lib/profiler_session.cc:71] Profiler session collecting data.
I tensorflow/core/profiler/lib/profiler_session.cc:172] Profiler session tear down.

Epoch 50/50
4/4 [==============================] - 1s 222ms/step - loss: 0.1468 - accuracy: 0.4101 - val_loss: 0.2043 - val_accuracy: 0.4583
File /tmp/qcnn_weights.h5 uploaded to qcnn_weights.h5.

The output of qcnn-worker-1 should be similar except the last line is missing. The chief worker (worker 0) is responsible for saving weights of the entire model.

You can also verify that model weights are saved by visiting the Storage Browser in Cloud Console and browsing through the storage bucket you created previously.

To delete the training job, run

kubectl delete -f training/qcnn.yaml

4. Understanding Training Performance Using TensorBoard

TensorBoard is TensorFlow’s visualization toolkit. By integrating your TensorFlow Quantum model with TensorBoard, you get many visualizations about your model out of the box, such as training loss & accuracy, visualizing the model graph, and program profiling.

Code Setup

To enable TensorBoard for your job, create a TensorBoard callback and pass it into model.fit():

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=args.logdir,
histogram_freq=1,
update_freq=1,
profile_batch='10, 20')

history = qcnn_model.fit(x=train_excitations,
y=train_labels,
batch_size=32,
epochs=50,
verbose=1,
validation_data=(test_excitations, test_labels),
callbacks=[tensorboard_callback])

The profile_batch parameter enables the TensorFlow Profiler in programmatic mode, which samples the program during the training step range you specify here. You can also enable the sampling mode,

tf.profiler.experimental.server.start(args.profiler_port)

which allows on-demand profiling initiated either by a different program or through the TensorBoard UI.

TensorBoard Features

Here we’ll highlight a subset of TensorBoard’s many powerful features used in this tutorial. Check out the TensorBoard guide to learn more.

Loss and Accuracy

Loss is the quantity that the model aims to minimize during training, computed via a loss function. Accuracy is the fraction of samples during training where predictions match labels. The loss metric is exported by default. To enable the accuracy metric, add the following to the model.compile() step:

qcnn_model.compile(..., metrics=[‘accuracy’])

Custom Metrics

In addition to loss and accuracy, TensorBoard also supports custom metrics. For example, the tutorial code exports the QCNN readout tensor as a histogram.

Profiler

The TensorFlow Profiler is a helpful tool in debugging performance bottlenecks in your model training job.

In this tutorial, we use both the programmatic mode, in which profiling is done for a predefined training step range, as well as the sampling mode, in which profiling can be done on-demand. For a MultiWorkerMirroredStrategy setup, currently programmatic mode only outputs profiling data from the chief (worker 0), whereas sampling mode is able to profile all workers.

When you first open the Profiler, the data displayed is from the programmatic mode. The overview page gives you a sense of how long training took during each step. This will act as a reference as you experiment with different methods of improving training performance, whether that’s by scaling infrastructure (adding more VMs to the cluster, using VMs with more CPU and memory, integrating with hardware accelerators) or improving code efficiency.

Perfomance Summary

The trace viewer gives the duration breakdown of all the training instructions under the hood, providing a detailed view to identify execution time bottlenecks.

Kubernetes Deployment Setup

To view the TensorBoard UI, you can create a TensorBoard instance in Kubernetes. The Kubernetes setup is in training/tensorboard.yaml. This file contains two objects:

  • A Deployment containing 1 Pod replica of the same worker container image, but run with a TensorBoard command: tensorboard --logdir=gs://${BUCKET_NAME}/${LOGDIR_NAME} --port=5001 --bind_all
  • A Service creating a network load balancer to make the TensorBoard UI accessible on the Internet, so you can view it in your browser.

It is also possible to run a local instance of TensorBoard on your workstation by pointing --logdir to the same Cloud Storage bucket, although additional IAM permissions setup is required.

Create this Kubernetes setup:

kubectl apply -f training/tensorboard.yaml

In the output of kubectl get pods, you should see there’s a Pod with the prefix qcnn-tensorboard which is eventually in Running status. To get the IP address of the TensorBoard instance, run

kubectl get svc tensorboard-service -w
NAME                  TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)
tensorboard-service LoadBalancer 10.123.240.9 <pending> 5001:32200/TCP

The load balancer takes some time to provision so you may not see the IP right away. Once it’s available, go to <ip>:5001 in your browser to access the TensorBoard UI.

With TensorFlow 2.4 and higher, it’s possible to profile multiple workers in sampling mode: workers can be profiled while a training job is running, by clicking “Capture Profile” in the Tensorboard Profiler and “Profile Service URL” to qcnn-worker-<replica_id>:2223. To enable this, the profiler port needs to be exposed by the worker service. The tutorial source code provides a script which patches all worker Services generated by the TFJob with a profiler port. Run

training/apply_profiler_ports.sh

Note that the need to manually patch Services is temporary, and there is currently planned work in tf-operator to support specifying additional ports in the TFJob.

5. Running Inference

After completing the distributed training job, model weights are stored in your Cloud Storage bucket. You can then use these weights to construct an inference program, and then create an inference job in the Kubernetes cluster. It is also possible to run an inference program on your local workstation, although it requires additional IAM permissions to grant access to Cloud Storage.

Code Setup

Inference source code is available in the inference/ directory. The main file, qcnn_inference.py, mostly reuses the model construction code in common/qcnn_common.py, but loads model weights from your Cloud Storage bucket instead:

qcnn_weights_path = '/tmp/qcnn_weights.h5'
download_blob(args.weights_gcs_bucket, args.weights_gcs_path, qcnn_weights_path)
qcnn_model.load_weights(qcnn_weights_path)

It then applies the model to a test set and computes the mean squared error.

results = qcnn_model(test_excitations).numpy().flatten()
loss = tf.keras.losses.mean_squared_error(test_labels, results)

Kubernetes Deployment Setup

The remaining steps in this section can be performed in one command:

make inference

The inference program is built into the Docker image from the training step, so you don’t need to build a new image here. The inference job spec, inference/inference.yaml, contains a Job with its Pod spec pointing to the image but executes qcnn_inference.py instead. Run kubectl apply -f inference/inference.yaml to create the job.

The Pod prefixed with inference-qcnn should eventually be in Running status (kubectl get pods). In the log output of the inference Pod (kubectl logs <pod_name>), the mean squared error should be close to the final loss shown in the TensorBoard UI.


Blob qcnn_weights.h5 downloaded to /tmp/qcnn_weights.h5.
[-0.8220097 0.40201923 -0.82856977 0.46476707 -1.1281478 0.23317486
0.00584182 1.3351855 0.35139582 -0.09958048 1.2205497 -1.3038696
1.4065738 -1.1120421 -0.01021352 1.4553616 -0.70309246 -0.0518395
1.4699622 -1.3712595 -0.01870352 1.2939589 1.2865802 0.847203
0.3149605 1.1705848 -1.0051676 1.2537074 -0.2943283 -1.3489063
-1.4727883 1.4566276 1.3417912 0.9123422 0.2942805 -0.791862
1.2984066 -1.1139404 1.4648925 -1.6311806 -0.17530376 0.70148027
-1.0084027 0.09898916 0.4121615 0.62743163 -1.4237025 -0.6296255 ]
Test Labels
[-1 1 -1 1 -1 1 1 1 1 -1 1 -1 1 -1 1 1 -1 -1 1 -1 -1 1 1 1
1 1 -1 1 -1 -1 -1 1 1 1 -1 -1 1 -1 1 -1 -1 1 -1 1 1 1 -1 -1]
Mean squared error: tf.Tensor(0.29677835, shape=(), dtype=float32)

6. Cleaning Up

And this wraps up our journey through distributed training! After you are done experimenting with the tutorial, this section walks you through the steps to clean up Google Cloud resources.

First, remove the Kubernetes deployments. Run:

make delete-inference
kubectl delete -f training/tensorboard.yaml

and, if you haven’t done so already,

make delete-training

Then, delete the GKE cluster. This deletes the underlying VMs as well.

gcloud container clusters delete ${CLUSTER_NAME} --zone=${ZONE}

Next, delete the training data in your Google Cloud Storage.

gsutil rm -r gs://${BUCKET_NAME}

And lastly, remove the worker container image from Container Registry following these instructions using the Cloud Console. Look for the image name qcnn.

Next Steps

Now that you’ve tried out the multi-worker setup, try setting it up with your project! As all the tools mentioned in this tutorial continue to grow, best practices for training with multiple workers will change over time. Check back on the tutorial directory in the TensorFlow Quantum GitHub repository for updates!

As you continue to scale your experiment, you might eventually hit infrastructure limitations that require advanced configuration of the technologies used in this tutorial due to the complexity of working in a distributed environment. For a deeper dive into some of them, check out these resources:

If you are interested in conducting large scale QML research in Tensorflow Quantum, check out our research credit application page to apply for cloud credits on Google Cloud.Read More

Leveraging Machine Learning for Unstructured Data Processing at Pixie

A guest post by James Bartlett and Zain Asgar of Pixie.

At Pixie, our goal is to enable developers to quickly understand and debug production systems. We achieve this by providing developers easy access to an assortment of metric and log data from their production system. For example, we collect structured information about CPU and memory usage for each process in their system, as well as many types of unstructured data (for example, the body of an HTTP request, or the error message from a program).

These are just two examples, we collect many other types of data, as well. For this blog post, we will focus on the vast amounts of unstructured data we collect in Pixie such as HTTP request/response bodies. We foresee a future where this unstructured machine data can be queried as easily and efficiently as the structured data. To achieve this, we leverage state-of-the-art NLP techniques to learn the structure of the data.

In this article, we’d like to share our experience and efforts here, in the hopes they are useful to inform your thinking on similar problems.

HTTP clustering

Suppose a developer using Pixie wanted to get an idea of which types of HTTP requests are particularly slow. Instead of forcing the developer to sift through many individual HTTP requests by hand, we can instead cluster the HTTP requests semantically and then show them a timeseries of latencies for each type of semantically clustered request. To demonstrate this, let’s walk through the end result and then we’ll come back to how we got to this point. We will use Pixie to explore a demo application called Online Boutique. Once we have Pixie deployed to a kubernetes cluster running Online Boutique, we can start to explore. For example, we can look at a graph of the network connections within the Online Boutique application:

graph of the network connections within the Online Boutique application

As you can see in the service graph, there’s a frontend service that handles incoming requests and sends them to their respective microservices. So let’s delve into the HTTP requests sent to the frontend service and their corresponding latencies.

HTTP Request Body

Latency (ms)

“product_id=L9ECAV7KIM&quantity=3

3.325302

“email=someone%40example.com&street_address=1600+Amphitheatr…

102.625462

“product_id=OLJCESPC7Z&quantity=3”

3.4530790000000002

“product_id=L9ECAV7KIM&quantity=5”

4.828718

“product_id=0PUK6V6EV0&quantity=2”

5.319163

“email=someone%40example.com&street_address=1600+Amphitheatr

107.361424

“product_id=0PUK6V6EV0&quantity=4”

3.81733

“currency_code=EUR”

0.203676

“currency_code=USD”

0.220932

“product_id=0PUK6V6EV0&quantity=4”

4.538055

From this small sample of requests, it’s not immediately clear what’s going on. It looks like the requests with `email=…?address=…` are much slower than the others, but we can’t be sure these examples weren’t just outliers. Instead of looking at more data, let’s use our soon-to-be-explained unstructured text clustering techniques, to cluster the HTTP requests semantically by the contents of their bodies.

plot of the average 99th percentile response latency for requests for each semantic cluster

Here you can see a plot of the average 99th percentile response latency for requests for each semantic cluster. Using this view, you can quickly determine the three broad categories of requests coming into the frontend service, as well as the latency profiles of those requests. Immediately, we see that the “email” cluster of requests has significantly higher average p99 latency than the other clusters, and we see that the “product” cluster has occasional latency spikes. Both of these are actionable insights we can debug further. Now let’s dive in and discuss how we got to this point.

Model Development Details

Requirements

Since our models will be deployed on customers’ production clusters, they must be lightweight and performant; ideally fast enough to handle data at line rate with minimal CPU overhead. Any training on customer data must occur on the customer cluster to maintain data isolation guarantees. Furthermore, since the data plane is entirely contained on customer clusters, we have strict storage limitations for data, so we must leverage ML techniques to intelligently sample the data we collect.

Dataset

Due to our stringent data isolation requirements we’re using the loghub dataset to bootstrap our model training. This dataset is a collection of log messages from various contexts (Android sys logs, Apache Server logs, supercomputer/HPC logs, etc). To test the models generalization to unseen log formats, we reserved the Android log data for testing, and trained on the remainder of the log data.

We use Google’s SentencePiece to tokenize the log messages. In particular, we use their implementation of unigram language model based subword tokenization with a vocab size of 16k. The following image shows a word cloud of all 16k vocabulary subword pieces that are generated by our tokenization. The size of the words indicate the frequency in the dataset.

Word cloud showing vocabulary subword pieces from Logpai Loghub machine log dataset tokenization.
Word cloud showing vocabulary subword pieces from Logpai Loghub machine log dataset tokenization.

This word cloud provides insight into the biases of our dataset. For example, about 30% of the data is Windows logs, as you can see by the high frequency of the token “windows”, and “microsoft”. Also, if you have a keen eye, you might think we have a lot of frowny faces in our data set, but in fact “):” is almost always preceded by an opening parenthesis, as in the following examples:

[Thu Jan 26 12:23:07 2006] [error] config.update(): Can't create vm
[Fri Jan 27 11:55:16 2006] [error] [client 202.143.128.18] client sent HTTP/1.1 request without hostname (see RFC2616 section 14.23): /

Model Architecture

Using this tokenized dataset, we train a self-attention based model using left-to-right next word prediction (à la OpenAI’s GPT models). Left-to-right next word prediction is the task of trying to predict the next token given a sequence of prior context tokens. The left-to-right part distinguishes it from BERT style models that use bidirectional context (we will try bidirectional models in the future). This TensorFlow tutorial demonstrates training of a similar architecture, the only difference being we drop the encoder side of the architecture in the tutorial.

The architecture we use is displayed in the figure below. We have 6 decoder blocks, each with a self-attention layer and a feed-forward layer. Note that, for simplicity, the diagram leaves out the skip connections over the self-attention and feed-forward layers, as well as the layer normalizations that go with those skip connections.

GPT-style language model architecture
GPT-style language model architecture

All in all, this architecture has 6.47M parameters, making it quite small in comparison to state-of-the-art language models. DistillBERT, for instance, has 66M parameters. On the other hand, the largest version of GPT-3 has 175B parameters.

We trained this architecture for 10 epochs with roughly 100 million unique log messages per epoch. After each epoch, we ran the model on a validation set and the model from the epoch with the highest validation accuracy was used as the final model. We achieved a test accuracy of 63.13% for next word prediction on the holdout Android log data. Given that we haven’t yet explored hyperparameter tuning, or other optimizations, this level of accuracy is a promising starting point.

We now have a way to predict future tokens in machine log data from context, with somewhat decent accuracy. However, this doesn’t immediately help us with our original goal of extracting structured features from the data. To further this goal, we will explore the feature space generated by the language model, rather than the predictions of the language model.

The goal is to transform our complicated data space into a fixed-dimensional feature space which we can then use for subsequent tasks. In order to do this we need to transform the outputs of the language model into a fixed-dimensional vector, which we will call the feature vector. One way to do this comes from BERT style models.

With BERT style models the way to extend the pre-trained language model to supervised tasks is to add a fully connected network on the output of the <CLS> (or <s>) token of the sentence, and then fine-tune the model with the fully-connected network on some classification task (this is illustrated in the figure below). This leads to a natural feature vector as the output prior to the softmax layer.

Alammar, J (2018). The Illustrated Transformer
Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/

We plan to explore this method in the future, however for now we would like to see what results we can get without adding any extra supervision. This requires a heuristic approach to turn our sequence of outputs into a fixed-length vector. One way to do this is to use a max-pooling operator on the sequence dimension of the output. Suppose our language model outputs a sequence of 256-dimensional vectors, then a max-pooling on the sequence dimension will output a single 256-dimensional vector, where each dimension is the maximum value of that dimension across all outputs in the sequence. The idea behind this approach is that neurons that have stronger responses are more important to include in the final representation.

Results

We can test how well this method works for clustering on a subset of the loghub data that I’ve hand labeled into semantic clusters. Below are three of the log messages in the hand labelled test data set. The first two are labelled to be in the same semantic cluster, since both relate to failures to find files, the last is from a different cluster, since it’s an unrelated message about flushing data.

[Wed Nov 09 22:30:05 2005] [error] [client 216.138.114.25] script not found or unable to stat: /var/www/cgi-bin/awstats.p

[Sat Jan 28 19:29:29 2006] [error] [client 211.154.174.50] File does not exist: /var/www/html/modules

20171230-12:25:37:318|Step_StandStepCounter|30002312|flush sensor data

Using the hand-labelled test set, we can measure how well the model separates the different clusters. To do this, we use the KMeans algorithm to generate a clustering based on the output of the model, and then compare this clustering to the hand-labelled clustering. On this test set, the model’s adjusted rand score, a metric where 0.0 is random labelling and 1.0 is perfect labelling, was 0.53. As with next word prediction accuracy, the performance isn’t great but a good starting point.

We can also view a low-dimensional representation of the feature space for the model, using PCA to reduce the dimensionality to two. The figures below show the first two PCA dimensions of the embeddings for each point in the test data set. The colors represent the semantic cluster the point belongs to. Note that since these are plots in a two-dimensional subspace of the embedding space, the absolute position of points carries little meaning, more meaning is derived from the tightness of each of the clusters. In the figure below, we can see that the model separates some of the classes reasonably well, but fails on others.

2-dimensional representation of the feature space of the model.
2-dimensional representation of the feature space of the model.

Using this method, we should be able to cluster unstructured data in Pixie, and tag it with its semantic cluster ID, hence extracting a structured feature from our unstructured data. This particular feature is, as yet, not very human-interpretable, but we will get to that later.

Inference

So let’s try to implement this method within the Pixie system. In order to do that we first need to convert our model into TensorFlow Lite and then load it into the Pixie execution engine. We decided to use TensorFlow Lite because we need to minimize overhead as much as possible, and in the future we would like the flexibility to deploy to heterogeneous edge devices including Raspberry PI’s and ARM microcontrollers.

Converting to TensorFlow Lite is pretty simple. We create a TF function for our model and call the builtin converter to generate a tensorflow lite model protobuf file:

model = tf.keras.models.load_model(model_path)
@tf.function(input_signature=[tf.TensorSpec([1, max_length], dtype=tf.int32)
def pred_fn(encoded_text):
# Create a mask that masks out 0 tokens, and future tokens for next word prediction.
mask = create_padded_lookahead_mask(max_length)
# Our saved model outputs both its next word predictions, and the activations of its final layer. We only use the activations of the final layer for clustering purposes.
model_preds, last_layer_output = model([encoded_text, mask], training=False)
# Max pool over the seq dimension.
return tf.reduce_max(last_layer_output, axis=1)

converter = tf.lite.TFLiteConverter.from_concrete_functions([fn.get_concrete_function()])
tflite_model = converter.convert()

Pixie’s query engine allows querying and manipulating data collected by Pixie. This engine already has a KMeans operator, so all we need to do is load our tflite model into the engine, and then write a custom PxL script (a script in Pixie’s scripting language based on Python/Pandas) to cluster our data. We are working on a public API to load in custom ML models into the engine, but for now we will use some internal features to do that. Once the model is loaded in, we can use it on any unstructured data in the Pixie Platform.

Some of the areas we are currently exploring include our vision of federated differentially-private training of models, bidirectional language models ala BERT, compression schemes for unstructured data based on learned structural representations of the data, and anomaly detection on unstructured data

Our goal on the Pixie ML team is to harness ML to simplify developers’ access to monitoring data, while operating in heterogeneous edge environments. If any of this interests you, or you have other questions feel free to join our slack group.

Pixie is an open-source project that gives you instant visibility into your application. It provides access to metrics, events, traces and logs without changing code. Pixie is in the process of being contributed to the CNCF (Cloud Native Compute Foundation). Pixie was originally created at Pixie Labs, Inc., but contributed to open source by New Relic, Inc.

James is a software engineer at the New Relic on the Pixie Team. He was a founding engineer at Pixie Labs.

Zain is the GM/GVP of Pixie and Open Source at New Relic. He is also an Adjunct Professor of Computer Science at Stanford University. He was the Co-founder/CEO of Pixie Labs.

Read More