Tensorflow – Vedere AI

What’s new in TensorFlow 2.19

March 13, 2025

by TensorFlow Blog Tensorflow

Posted by the TensorFlow team

TensorFlow 2.19 has been released! Highlights of this release include changes to the C++ API in LiteRT, bfloat16 support for tflite casting, discontinue of releasing libtensorflow packages. Learn more by reading the full release notes.

Note: Release updates on the new multi-backend Keras will be published on keras.io, starting with Keras 3.0. For more information, please see https://keras.io/keras_3/.

TensorFlow Core

LiteRT

The public constants tflite::Interpreter:kTensorsReservedCapacity and tflite::Interpreter:kTensorsCapacityHeadroom are now const references, rather than constexpr compile-time constants. (This is to enable better API compatibility for TFLite in Play services while preserving the implementation flexibility to change the values of these constants in the future.)

TF-Lite

tfl.Cast op is now supporting bfloat16 in the runtime kernel. tf.lite.Interpreter gives a deprecation warning redirecting to its new location at ai_edge_litert.interpreter, as the API tf.lite.Interpreter will be deleted in TF 2.20. See the migration guide for details.

Libtensorflow

We have stopped publishing libtensorflow packages but it can still be unpacked from the PyPI package.

Introducing Wake Vision: A High-Quality, Large-Scale Dataset for TinyML Computer Vision Applications

December 5, 2024

by TensorFlow Blog Tensorflow

Posted by Colby Banbury, Emil Njor, Andrea Mattia Garavagno, Vijay Janapa Reddi

TinyML is an exciting frontier in machine learning, enabling models to run on extremely low-power devices such as microcontrollers and edge devices. However, the growth of this field has been stifled by a lack of tailored large and high-quality datasets. That’s where Wake Vision comes in—a new dataset designed to accelerate research and development in TinyML.

A vibrant, abstract representation of a human figure is formed by swirling lines and dots of rainbow colors. A large, bright blue eye is centrally located on the figure's torso.

Why TinyML Needs Better Data

The development of TinyML requires compact and efficient models, often only a few hundred kilobytes in size. The applications targeted by standard machine learning datasets, like ImageNet, are not well-suited for these highly constrained models.

Existing datasets for TinyML, like Visual Wake Words (VWW), have laid the groundwork for progress in the field. However, their smaller size and inherent limitations pose challenges for training production-grade models. Wake Vision builds upon this foundation by providing a large, diverse, and high-quality dataset specifically tailored for person detection—the cornerstone vision task for TinyML.

What Makes Wake Vision Different?

A table displaying the number of images used for training, validation, and testing different datasets, including Wake Vision, Visual Wake Words, CIFAR-100, and PASCAL VOC 2012. The table shows the total number of images and the number of person images in each dataset split.

Wake Vision is a new, large-scale dataset with roughly 6 million images, almost 100 times larger than VWW, the previous state-of-the-art dataset for person detection in TinyML.
The dataset provides two distinct training sets:

Wake Vision (Large): Prioritizes dataset size.
Wake Vision (Quality): Prioritizes label quality.

Wake Vision’s comprehensive filtering and labeling process significantly enhances the dataset’s quality.

Why Data Quality Matters for TinyML Models

In traditional overparameterized models, it is widely believed that data quantity matters more than data quality, as an overparameterized model can adapt to errors in the training data. But according to the image below, TinyML tells a different story:

Five line graphs illustrate the Wake Vision Test Score with varying percentages of training data quality used, comparing models by parameter count (78K, 309K, 1.2M, 4.9M, and 11M) and error rate (7%, 15%, and 30%).

The figure above shows that high-quality labels (less error) are more beneficial for under-parameterized models than simply having more data. Larger, error-prone datasets can still be valuable when paired with fine-grained techniques.

By providing two versions of the training set, Wake Vision enables researchers to explore the balance between dataset size and quality effectively.

Real-World Testing: Wake Vision’s Fine-Grained Benchmarks

Five images are shown, each with text underneath describing the content as Perceived Older Person, Near Person, Bright Image, Perceived Female Person, and Depicted Person.

Unlike many open-source datasets, Wake Vision offers fine-grained benchmarks and detailed tests for real-world applications like those shown in the above figure. These enable the evaluation of model performance in real-world scenarios, such as:

Distance: How well the model detects people at various distances from the camera.
Lighting Conditions: Performance in well-lit vs. poorly-lit environments.
Depictions: Handling of varied representations of people (e.g., drawings, sculptures).
Perceived Gender and Age: Detecting biases across genders and age groups.

These benchmarks give researchers a nuanced understanding of model performance in specific, real-world contexts and help identify potential biases and limitations early in the design phase.

Key Performance Gains With Wake Vision

The performance gains achieved using Wake Vision are impressive:

Up to a 6.6% increase in accuracy over the established VWW dataset.
Error rate reduction from 7.8% to 2.2% with manual label validation on evaluation sets.
Robustness across various real-world conditions, from lighting to perceived age and gender.

Furthermore, combining the two Wake Vision training sets, using the larger set for pre-training and the quality set for fine-tuning, yields the best results, highlighting the value of both datasets when used in sophisticated training pipelines.

Wake Vision Leaderboard: Track and Submit New Top-Performing Models

The Wake Vision website features a Leaderboard, providing a dedicated platform to assess and compare the performance of models trained on the Wake Vision dataset.

The leaderboard enables a clear and detailed view of how models perform under various conditions, with performance metrics like accuracy, error rates, and robustness across diverse real-world scenarios. It’s an excellent resource for both seasoned researchers and newcomers looking to improve and validate their approaches.

Explore the leaderboard to see the current rankings, learn from high-performing models, and submit your own to contribute to advancing the state of the art in TinyML person detection.

Making Wake Vision Easy to Access

Wake Vision is available through popular dataset services such as:

With its permissive license (CC-BY 4.0), researchers and practitioners can freely use and adapt Wake Vision for their TinyML projects.

Get Started with Wake Vision Today!

The Wake Vision team has made the dataset, code, and benchmarks publicly available to accelerate TinyML research and enable the development of better, more reliable person detection models for ultra-low-power devices.

To learn more and access the dataset, visit Wake Vision’s website, where you can also check out a leaderboard of top-performing models on the Wake Vision dataset – and see if you can create better performing models!

MLSysBook.AI: Principles and Practices of Machine Learning Systems Engineering

November 19, 2024

by TensorFlow Blog Tensorflow

Posted by Jason Jabbour, Kai Kleinbard and Vijay Janapa Reddi (Harvard University)

Everyone wants to do the modeling work, but no one wants to do the engineering.

If ML developers are like astronauts exploring new frontiers, ML systems engineers are the rocket scientists designing and building the engines that take them there.

Introduction

“Everyone wants to do modeling, but no one wants to do the engineering,” highlights a stark reality in the machine learning (ML) world: the allure of building sophisticated models often overshadows the critical task of engineering them into robust, scalable, and efficient systems.

The reality is that ML and systems are inextricably linked. Models, no matter how innovative, are computationally demanding and require substantial resources—with the rise of generative AI and increasingly complex models, understanding how ML infrastructure scales becomes even more critical. Ignoring the system’s limitations during model development is a recipe for disaster.

Unfortunately, educational resources on the systems side of machine learning are lacking. There are plenty of textbooks and materials on deep learning theory and concepts. However, we truly need more resources on the infrastructure and systems side of machine learning. Critical questions—such as how to optimize models for specific hardware, deploy them at scale, and ensure system efficiency and reliability—are still not adequately understood by ML practitioners. This lack of understanding is not due to disinterest but rather a gap in available knowledge.

One significant resource addressing this gap is MLSysBook.ai. This blog post explores key ML systems engineering concepts from MLSysBook.ai and maps them to the TensorFlow ecosystem to provide practical insights for building efficient ML systems.

The Connection Between Machine Learning and Systems

Many think machine learning is solely about extracting patterns and insights from data. While this is fundamental, it’s only part of the story. Training and deploying these “deep” neural network models often necessitates vast computational resources, from powerful GPUs and TPUs to massive datasets and distributed computing clusters.

Consider the recent wave of large language models (LLMs) that have pushed the boundaries of natural language processing. These models highlight the immense computational challenges in training and deploying large-scale machine learning models. Without carefully considering the underlying system, training times can stretch from days to weeks, inference can become sluggish, and deployment costs can skyrocket.

Building a successful machine-learning solution involves the entire system, not just the model. This is where ML systems engineering takes the reins, allowing you to optimize model architecture, hardware selection, and deployment strategies, ensuring that your models are not only powerful in theory but also efficient and scalable.

To draw an analogy, if developing algorithms is like being an astronaut exploring the vast unknown of space, then ML systems engineering is similar to the work of rocket scientists building the engines that make those journeys possible. Without the precise engineering of rocket scientists, even the most adventurous astronauts would remain earthbound.

An abstract circular design resembling a network or neural pathways consisting of interconnected nodes and lines in shades of blue, pink, and gray, against a white background

Bridging the Gap: MLSysBook.ai and System-Level Thinking

One important new resource this blog post offers for insights into ML systems engineering is an open-source “textbook” — MLSysBook.ai —developed initially as part of Harvard University’s CS249r Tiny Machine Learning course and HarvardX’s TinyML online series. This project, which has expanded into an open, collaborative initiative, dives deep into the end-to-end ML lifecycle.

It highlights that the principles governing ML systems, whether designed for tiny embedded devices or large data centers, are fundamentally similar. For instance, while tiny machines might employ INT8 for numeric operations to save resources, larger systems often utilize FP16 for higher precision—the fundamental concepts, such as quantization, span across both scenarios.

Key concepts covered in this resource include:

Data Engineering: Setting the foundation by efficiently collecting, preprocessing, and managing data to prepare it for the machine learning pipeline.
Model Development: Crafting and refining machine learning models to meet specific tasks and performance goals.
Optimization: Fine-tuning model performance and efficiency, ensuring effective use of hardware and resources within the system.
Deployment: Transitioning models from development to real-world production environments while scaling and adapting them to existing infrastructure.
Monitoring and Maintenance: Continuously tracking system health and performance to maintain reliability, address issues, and adapt to evolving data and requirements.

In an efficient ML system, data engineering lays the groundwork by preparing and organizing raw data, which is essential for any machine learning process. This ensures data can be transformed into actionable insights during model development, where machine learning models are created and refined for specific tasks. Following development, optimization becomes critical for enhancing model performance and efficiency, ensuring that models are tuned to run effectively on the designated hardware and within the system’s constraints.

The seamless integration of these steps then extends into the deployment phase, where models are brought into real-world production environments. Here, they must be scaled and adapted to function effectively within existing infrastructure, highlighting the importance of robust ML systems engineering. However, the lifecycle of an ML system continues after deployment; continuous monitoring and maintenance are vital. This ongoing process ensures that ML systems remain healthy, reliable and perform optimally over time, adapting to new data and requirements as they arise.

A flowchart diagrams the dependencies between different machine learning concepts, tools, and systems. Beige boxes represent concepts like 'Data Engineering' and tools like 'TensorFlow Data', while blue boxes indicate higher-level systems like 'ML Systems Engineering Principles' and 'Efficient ML Systems'. Arrows and dotted lines illustrate the relationships and workflow between these elements.

A mapping of MLSysBook.AI’s core ML systems engineering concepts to the TensorFlow ecosystem, illustrating how specific TensorFlow tools support each stage of the machine learning lifecycle, ultimately contributing to the creation of efficient ML systems.

SocratiQ: An Interactive AI-Powered Generative Learning Assistant

One of the exciting innovations we’ve integrated into MLSysBook.ai is SocratiQ—an AI-powered learning assistant designed to foster a deeper and more engaging connection with content focused on machine learning systems. By leveraging a Large Language Model (LLM), SocratiQ turns learning into a dynamic, interactive experience that allows students and practitioners to engage with and co-create their educational journey actively.

With SocratiQ, readers transition from passive content consumption to an active, personalized learning experience. Here’s how SocratiQ makes this possible:

Interactive Quizzes: SocratiQ enhances the learning process by automatically generating quizzes based on the reading content. This feature encourages active reflection and reinforces understanding without disrupting the learning flow. Learners can test their comprehension of complex ML systems concepts.

moving image of an interactive quiz in SocratiQ

Adaptive, In-Content Learning: SocratiQ offers real-time conversations with the LLM without pulling learners away from the content they’re engaging with. Acting as a personalized Teaching Assistant (TA), it provides tailored explanations.

moving image of an real-time conversation with the LLM in SocratiQ

Progress Assessment and Gamification: Learners’ progress is tracked and stored locally in their browser, providing a personalized path to developing skills without privacy concerns. This allows for evolving engagement as the learner progresses through the material.

A Quiz Performance Dashboard in SocratiQ

SocratiQ strives to be a supportive guide that respects the primacy of the content itself. It subtly integrates into the learning flow, stepping in when needed to provide guidance, quizzes, or explanations—then stepping back to let the reader continue undistracted. This design ensures that SocratiQ works harmoniously within the natural reading experience, offering support and personalization while keeping the learner immersed in the content.

We plan to integrate capabilities such as research lookups and case studies. The aim is to create a unique learning environment where readers can study and actively engage with the material. This blend of content and AI-driven assistance transforms MLSysBook.ai into a living educational resource that grows alongside the learner’s understanding.

Mapping MLSysBook.ai’s Concepts to the TensorFlow Ecosystem

MLSysBook.AI focuses on the core concepts in ML system engineering while providing strategic tie-ins to the TensorFlow ecosystem. The TensorFlow ecosystem offers a rich environment for realizing many of the principles discussed in MLSysBook.AI. This makes the TensorFlow ecosystem a perfect match for the key ML systems concepts covered in MLSysBook.AI, with each tool supporting a specific stage of the machine learning process:

TensorFlow Data (Data Engineering): Supports efficient data preprocessing and input pipelines.
TensorFlow Core (Model Development): Central to model creation and training.
TensorFlow Lite (Optimization): Enables model optimization for various deployment scenarios, especially critical for edge devices.
TensorFlow Serving (Deployment): Facilitates smooth model deployment in production environments.
TensorFlow Extended (Monitoring and maintenance): Offers comprehensive tools for ongoing system health and performance.

Note that MLSysBook.AI does not explicitly teach or focus on TensorFlow-specific concepts or implementations. The book’s primary goal is to explore fundamental ML system engineering principles. The connections drawn in this blog post to the TensorFlow ecosystem are simply intended to illustrate how these core concepts align with tools and practices used by industry practitioners, providing a bridge between theoretical understanding and real-world application.

Support ML Systems Education: Every Star Counts 🌟

If you find this blog post valuable and want to improve ML systems engineering education, please consider giving the MLSysBook.ai GitHub repository a star ⭐.

Thanks to our sponsors, each ⭐ added to the MLSysBook.ai GitHub repository translates to donations supporting students and minorities globally by funding their research scholarships, empowering them to drive innovation in machine learning systems research worldwide.

Every star counts—help us reach the generous funding cap!

Conclusion

The gap between ML modeling and system engineering is closing, and understanding both aspects is important for creating impactful AI solutions. By embracing ML system engineering principles and leveraging powerful tools like those in the TensorFlow ecosystem, we can go beyond building models to creating complete, optimized, and scalable ML systems.

As AI continues to evolve, the demand for professionals who can bridge the gap between ML algorithms and systems implementation will only grow. Whether you’re a seasoned practitioner or just starting your ML journey, investing time in understanding ML systems engineering will undoubtedly pay dividends in your career and the impact of your work. If you’d like to learn more, listen to our MLSysBook.AI podcast, generated by Google’s NotebookLM.

Remember, even the most brilliant astronauts need skilled engineers to build their rockets!

Acknowledgments

We thank Josh Gordon for his suggestion to write this blog post and for encouraging and sharing ideas on how the book could be a useful resource for the TensorFlow community.

What’s new in TensorFlow 2.18

October 28, 2024

by TensorFlow Blog Tensorflow

Posted by the TensorFlow team

TensorFlow 2.18 has been released! Highlights of this release (and 2.17) include NumPy 2.0, LiteRT repository, CUDA Update, Hermetic CUDA and more. For the full release notes, please click here.

Note: Release updates on the new multi-backend Keras will be published on keras.io, starting with Keras 3.0. For more information, please see https://keras.io/keras_3/.

TensorFlow Core

NumPy 2.0

The upcoming TensorFlow 2.18 release will include support for NumPy 2.0. While the majority of TensorFlow APIs will function seamlessly with NumPy 2.0, this may break some edge cases of usage, e.g., out-of-boundary conversion errors and numpy scalar representation errors. You can consult the following common solutions.

Note that NumPy’s type promotion rules have been changed (See NEP 50 for details). This may change the precision at which computations happen, leading either to type errors or to numerical changes to results. Please see the NumPy 2 migration guide.

We’ve updated some TensorFlow tensor APIs to maintain compatibility with NumPy 2.0 while preserving the out-of-boundary conversion behavior in NumPy 1.x.

LiteRT Repository

We’re making some changes to how LiteRT (formerly known as TFLite) is developed. Over the coming months, we’ll be gradually transitioning TFLite’s codebase to LiteRT. Once the migration is complete, we’ll start accepting contributions directly through the LiteRT repository. There will no longer be any binary TFLite releases and developers should switch to LiteRT for the latest updates.

Hermetic CUDA

If you build TensorFlow from source, Bazel will now download specific versions of CUDA, CUDNN and NCCL distributions, and then use those tools as dependencies in various Bazel targets. This enables more reproducible builds for Google ML projects and supported CUDA versions because the build no longer relies on the locally installed versions. More details are provided here.

CUDA Update

TensorFlow binary distributions now ship with dedicated CUDA kernels for GPUs with a compute capability of 8.9. This improves the performance on the popular Ada-Generation GPUs like NVIDIA RTX 40**, L4 and L40.

To keep Python wheel sizes in check, we made the decision to no longer ship CUDA kernels for compute capability 5.0. That means the oldest NVIDIA GPU generation supported by the precompiled Python packages is now the Pascal generation (compute capability 6.0). For Maxwell support, we either recommend sticking with TensorFlow version 2.16, or compiling TensorFlow from source. The latter will be possible as long as the used CUDA version still supports Maxwell GPUs.

What’s new in TensorFlow 2.17

July 18, 2024

by TensorFlow Blog Tensorflow

Posted by the TensorFlow team

TensorFlow 2.17 has been released! Highlights of this release (and 2.16) include CUDA update, upcoming Numpy 2.0, and more. For the full release notes, please click here.

Note: Release updates on the new multi-backend Keras will be published on keras.io, starting with Keras 3.0. For more information, please see https://keras.io/keras_3/.

TensorFlow Core

CUDA Update

Numpy 2.0

Upcoming TensorFlow 2.18 release will include support for Numpy 2.0. This may break some edge cases of TensorFlow API usage.

Drop TensorRT support

Starting with TensorFlow 2.18, support for TensorRT will be dropped. TensorFlow 2.17 will be the last version to include it.

Faster Dynamically Quantized Inference with XNNPack

April 9, 2024

by TensorFlow Blog Tensorflow

Posted by Alan Kelly, Software Engineer

We are excited to announce that XNNPack’s Fully Connected and Convolution 2D operators now support dynamic range quantization. XNNPack is TensorFlow Lite’s CPU backend and CPUs deliver the widest reach for ML inference and remain the default target for TensorFlow Lite. Consequently, improving CPU inference performance is a top priority. We quadrupled inference performance in TensorFlow Lite’s XNNPack backend compared to the single precision baseline by adding support for dynamic range quantization to the Fully Connected and Convolution operators. This means that more AI powered features may be deployed to older and lower tier devices.

Previously, XNNPack offered users the choice between either full integer quantization, where the weights and activations are stored as signed 8-bit integers, or half-precision (fp16) or single-precision (fp32) floating-point inference. In this article we demonstrate the benefits of dynamic range quantization.

Dynamic Range Quantization

Dynamically quantized models are similar to fully-quantized models in that the weights for the Fully Connected and Convolution operators are quantized to 8-bit integers during model conversion. All other tensors are not quantized, they remain as float32 tensors. During model inference, the floating-point layer activations are converted to 8-bit integers before being passed to the Fully Connected and Convolution operators. The quantization parameters (the zero point and scale) for each row of the activation tensor are calculated dynamically based on the observed range of activations. This maximizes the accuracy of the quantization process as the activations make full use of the 8 quantized bits. In fully-quantized models, these parameters are fixed during model conversion, based on the range of the activation values observed using a representative dataset. The second difference between full quantization and dynamic range quantization is that the output of the Fully Connected and Convolution operators is in 32-bit floating-point format, as opposed to 8-bit integer for fully-quantized operators. With dynamic range quantization, we get most of the performance gains of full quantization, yet with higher overall accuracy.

Traditionally the inference of such models was done using TensorFlow Lite’s native operators. Now dynamically quantized models can benefit from XNNPack’s highly-optimized per-architecture implementations of the Fully Connected and Convolution2D operators. These operators are optimized for all architectures supported by XNNPack (ARM, ARM64, x86 SSE/AVX/AVX512 and WebAssembly), including the latest ArmV9 processors such as the Pixel 8’s Tensor G3 CPU or the One Plus 11’s SnapDragon 8 Gen 2 CPU.

How can you use it?

Two steps are required to use dynamic range quantization. You must first convert your model from TensorFlow with support for dynamic range quantization. Existing models already converted using dynamic range quantization do not need to be reconverted. Dynamic range quantization can be enabled during model conversion by enabling the
converter.optimizations = [tf.lite.Optimize.DEFAULT] converter flag. Unlike full integer quantization, no representative dataset is required and unsupported operators do not prevent conversion from succeeding. Dynamic range quantization is therefore far more accessible to non-expert users than full integer quantization.

From TensorFlow 2.17, dynamically quantized XNNPack inference will be enabled by default in prebuilt binaries. If you want to use it sooner, the nightly TensorFlow builds may be used.

Mixed Precision Inference

In our previous article we presented the impressive performance gains from using half precision inference. Half-precision and dynamic range quantization may now be combined within XNNPack to get the best possible on-device CPU inference performance on devices which have hardware fp16 support (most phones on sale today do). The Fully Connected and Convolution 2D operators can output fp16 data instead of fp32. The Pixel 3, released in 2018, was the first Pixel model with fp16 support. fp16 uses half as many bits to store a floating-point value compared to fp32, meaning that the relative accuracy of each value is reduced due to the significantly shorter mantissa (10 vs 23 bits). Not all models support fp16 inference, but if a model supports it, the computational cost of vectorized floating-point operators can be reduced by half as the CPU can process twice as much data per instruction. Dynamically quantized models with compute-intensive floating point operators, such as Batch Matrix Multiply and Softmax, can benefit from fp16 inference as well.

Performance Improvements

Below, we present benchmarks on four public models covering common computer vision tasks:

EfficientNetV2 – image classification and feature extraction
Inception-v3 – image classification
Deeplab-v3 – semantic segmentation
Stable Diffusion – image generation (diffusion model)

Each model was converted three times where possible: full float, full 8 bit signed integer quantization and dynamic range quantization. Stable Diffusion’s diffusion model could not be converted using full integer quantization due to unsupported operators. The speed-up versus the original float32 model using TFLite’s kernels is shown below.

FP32 refers to the baseline float32 model.
QS8 refers to full signed 8 bit integer quantization using XNNPack.
QD8-F32 refers to dynamically quantized 8 bit integers with fp32 activations using XNNPack.
QD8-F16 refers to dynamically quantized 8 bit integers with fp16 activations using XNNPack.

Graph showing speed-up versus float32 on pixel 8

The speed-up versus TFLite’s dynamically quantized Fully Connected and Convolution operators are shown below. By simply using the latest version of TensorFlow Lite, you can benefit from these speed-ups.

Graph showing speed-up versus TFLite DQ on Pixel 8

We can clearly see that dynamic range quantization is competitive with, and in some cases can exceed, the performance of full integer quantization. Stable Diffusion’s diffusion model runs up to 6.2 times faster than the original float32 model! This is a game changer for on-device performance.

We would expect that full integer quantization would be faster than dynamic range quantization as all operations are calculated using integer arithmetic. Furthermore, dynamic range quantization has the additional overhead of converting the floating point activations to quantized 8 bit integers. Surprisingly, in two of the three models tested, this is not true. Profiling the models using TFLite’s profiler solves the mystery. These models are slower due to a combination of quantization artifacts — quantized arithmetic is more efficient when the ratio of the input and output scales fall within a certain range and missing op support in XNNPack. These quantization parameters are determined during model conversion based on a provided representative dataset. Since the ratio of the scales falls outside the optimal range, a less optimal path must be taken and performance suffers.

Finally, we demonstrate that model precision is mostly preserved when using mixed precision inference, we compare the image generated by the dynamically quantized Stable Diffusion model using fp32 activations with that generated using fp16 activations with the random number generator seeded with the same number to verify that using fp16 activations for the diffusion model does not impact the quality of the generated image.

side by side comparison of images generated using fp16 inference on the left and fp32 inference on the right

Image generated using fp16 inference (left) and fp32 inference (right)

Both generated images are indeed lovely cats, corresponding to the given prompt. The images are indistinguishable which is a very strong indication that the diffusion model is suited to fp16 inference. Of course, as for all neural networks, any quantization strategy should be validated using a large validation dataset, and not just a single test.

Conclusions

Full integer quantization is hard, converting models is difficult, error prone and accuracy is not guaranteed. The representative dataset must be truly representative to minimize quantization errors. Dynamic range quantization offers a compromise between full integer quantization and fp32 inference: the models are of similar size to the fully-quantized model and performance gains are often similar and sometimes exceeds that of fully-quantized models. Using Stable Diffusion, we showed that dynamic range quantization and fp16 inference can be combined giving game changing performance improvements. XNNPack’s dynamic range quantization is now powering Gemini, Google Meet, and Chrome OS audio denoising and will launch in many other products this year. This same technology is now available to our open source users so try it for yourself using the models linked above and following the instructions in the How can I use it section!

Acknowledgements

We would like to thank Frank Barchard and Quentin Khan for contributions towards dynamic range quantization inference in TensorFlow Lite and XNNPack.

What’s new in TensorFlow 2.16

March 13, 2024

by TensorFlow Blog Tensorflow

Posted by the TensorFlow team

TensorFlow 2.16 has been released! Highlights of this release (and 2.15) include Clang as default compiler for building TensorFlow CPU wheels on Windows, Keras 3 as default version, support for Python 3.12, and much more! For the full release note, please click here.

Note: Release updates on the new multi-backend Keras will be published on keras.io starting with Keras 3.0. For more information, please see https://keras.io/keras_3/.

TensorFlow Core

Clang 17

Clang is now the preferred compiler to build TensorFlow CPU wheels on the Windows Platform starting with this release. The currently supported version is LLVM/clang 17. The official Wheels-published on PyPI will be based on Clang; however, users retain the option to build wheels using the MSVC compiler following the steps mentioned, as has been the case before. Intel owned the implementation and delivery of this change within the 3P Official Build program.

Keras 3

Keras 3 will be the default Keras version for TensorFlow 2.16 onwards. You may need to update your script to use Keras 3. Please refer to the new Keras documentation for Keras 3 (https://keras.io/keras_3). Keras 2 will continue to be released alongside TensorFlow as tf_keras. To continue using Keras 2 with TensorFlow 2.16+:

Install tf-keras vía pip install tf-keras~=2.16
Switch tf.keras to use Keras 2 (tf-keras), by setting environment variable TF_USE_LEGACY_KERAS=1 directly or in your Python program by doing import os;os.environ["TF_USE_LEGACY_KERAS"]=”1”. Please note that this needs to be set before importing TensorFlow and will set it for all packages in your Python runtime program.

Estimator API

tf.estimator API is removed. If you need to use the estimator API, you need to use TF 2.15 or an earlier version.

Apple Silicon

If you previously installed TensorFlow using pip install tensorflow-macos, please update your installation method. Use pip install tensorflow from now on. tensorflow-macos package will no longer receive updates. Future updates will be released to tensorflow.

Graph neural networks in TensorFlow

February 6, 2024

by TensorFlow Blog Tensorflow

Posted by Dustin Zelle – Software Engineer, Research and Arno Eigenwillig – Software Engineer, CoreML

This article is also shared on the Google Research Blog

Objects and their relationships are ubiquitous in the world around us, and relationships can be as important to understanding an object as its own attributes viewed in isolation — for example: transportation networks, production networks, knowledge graphs, or social networks. Discrete mathematics and computer science have a long history of formalizing such networks them as graphs, consisting of nodes arbitrarily connected by edges in various irregular ways. Yet most machine learning (ML) algorithms allow only for regular and uniform relations between input objects, such as a grid of pixels, a sequence of words, or no relation at all.

Graph neural networks, or GNNs for short, have emerged as a powerful technique to leverage both the graph’s connectivity (as in the older algorithms DeepWalk and Node2Vec) and the input features on the various nodes and edges. GNNs can make predictions for graphs as a whole (Does this molecule react in a certain way?), for individual nodes (What’s the topic of this document, given its citations?) or for potential edges (Is this product likely to be purchased together with that product?). Apart from making predictions about graphs, GNNs are a powerful tool used to bridge the chasm to more typical neural network use cases. They encode a graph’s discrete, relational information in a continuous way so that it can be included naturally in another deep learning system.

We are excited to announce the release of TensorFlow GNN 1.0 (TF-GNN), a production-tested library for building GNNs at large scale. It supports both modeling and training in TensorFlow as well as the extraction of input graphs from huge data stores. TF-GNN is built from the ground up for heterogeneous graphs where types and relations are represented by distinct sets of nodes and edges. Real-world objects and their relations occur in distinct types and TF-GNN’s heterogeneous focus makes it natural to represent them.

Inside TensorFlow, such graphs are represented by objects of type tfgnn.GraphTensor. This is a composite tensor type (a collection of tensors in one Python class) accepted as a first-class citizen in tf.data.Dataset, tf.function, etc. It stores both the graph structure and its features attached to nodes, edges and the graph as a whole. Trainable transformations of GraphTensors can be defined as Layers objects in the high-level Keras API, or directly using the tfgnn.GraphTensor primitive.

GNNs: Making predictions for an object in context

For illustration, let’s look at one typical application of TF-GNN: predicting a property of a certain type of node in a graph defined by cross-referencing tables of a huge database. For example, a citation database of Computer Science (CS) arXiv papers with one-to-many cites and many-to-one cited relationships where we would like to predict the subject area of each paper.

Like most neural networks, a GNN is trained on a dataset of many labeled examples (~millions), but each training step consists only of a much smaller batch of training examples (say, hundreds). To scale to millions, the GNN gets trained on a stream of reasonably small subgraphs from the underlying graph. Each subgraph contains enough of the original data to compute the GNN result for the labeled node at its center and train the model. This process — typically referred to as subgraph sampling — is extremely consequential for GNN training. Most existing tooling accomplishes sampling in a batch way, producing static subgraphs for training. TF-GNN provides tooling to improve on this by sampling dynamically and interactively.

moving image illustrating the process of subgraph sampling where small, tractable subgraphs are sampled from a larger graph to create input examples for GNN training.

Pictured, the process of subgraph sampling where small, tractable subgraphs are sampled from a larger graph to create input examples for GNN training.

TF-GNN 1.0 debuts a flexible Python API to configure dynamic or batch subgraph sampling at all relevant scales: interactively in a Colab notebook (like this one), for efficient sampling of a small dataset stored in the main memory of a single training host, or distributed by Apache Beam for huge datasets stored on a network filesystem (up to hundreds of millions of nodes and billions of edges). For details, please refer to our user guides for in-memory and beam-based sampling, respectively.

On those same sampled subgraphs, the GNN’s task is to compute a hidden (or latent) state at the root node; the hidden state aggregates and encodes the relevant information of the root node’s neighborhood. One classical approach is message-passing neural networks. In each round of message passing, nodes receive messages from their neighbors along incoming edges and update their own hidden state from them. After n rounds, the hidden state of the root node reflects the aggregate information from all nodes within n edges (pictured below for n = 2). The messages and the new hidden states are computed by hidden layers of the neural network. In a heterogeneous graph, it often makes sense to use separately trained hidden layers for the different types of nodes and edges.

Pictured, a simple message-passing neural network where, at each step, the node state is propagated from outer to inner nodes where it is pooled to compute new node states. Once the root node is reached, a final prediction can be made.

The training setup is completed by placing an output layer on top of the GNN’s hidden state for the labeled nodes, computing the loss (to measure the prediction error), and updating model weights by backpropagation, as usual in any neural network training.

Beyond supervised training (i.e., minimizing a loss defined by labels), GNNs can also be trained in an unsupervised way (i.e., without labels). This lets us compute a continuous representation (or embedding) of the discrete graph structure of nodes and their features. These representations are then typically utilized in other ML systems. In this way, the discrete, relational information encoded by a graph can be included in more typical neural network use cases. TF-GNN supports a fine-grained specification of unsupervised objectives for heterogeneous graphs.

Building GNN architectures

The TF-GNN library supports building and training GNNs at various levels of abstraction.

At the highest level, users can take any of the predefined models bundled with the library that are expressed in Keras layers. Besides a small collection of models from the research literature, TF-GNN comes with a highly configurable model template that provides a curated selection of modeling choices that we have found to provide strong baselines on many of our in-house problems. The templates implement GNN layers; users need only to initialize the Keras layers.

import tensorflow_gnn as tfgnn
from tensorflow_gnn.models import mt_albis

def model_fn(graph_tensor_spec: tfgnn.GraphTensorSpec):
  """Builds a GNN as a Keras model."""
  graph = inputs = tf.keras.Input(type_spec=graph_tensor_spec)

  # Encode input features (callback omitted for brevity).
  graph = tfgnn.keras.layers.MapFeatures(
      node_sets_fn=set_initial_node_states)(graph)

  # For each round of message passing...
  for _ in range(2):
    # ... create and apply a Keras layer.
    graph = mt_albis.MtAlbisGraphUpdate(
        units=128, message_dim=64,
        attention_type="none", simple_conv_reduce_type="mean",
        normalization_type="layer", next_state_type="residual",
        state_dropout_rate=0.2, l2_regularization=1e-5,
    )(graph)

  return tf.keras.Model(inputs, graph)

At the lowest level, users can write a GNN model from scratch in terms of primitives for passing data around the graph, such as broadcasting data from a node to all its outgoing edges or pooling data into a node from all its incoming edges (e.g., computing the sum of incoming messages). TF-GNN’s graph data model treats nodes, edges and whole input graphs equally when it comes to features or hidden states, making it straightforward to express not only node-centric models like the MPNN discussed above but also more general forms of GraphNets. This can, but need not, be done with Keras as a modeling framework on the top of core TensorFlow. For more details, and intermediate levels of modeling, see the TF-GNN user guide and model collection.

Training orchestration

While advanced users are free to do custom model training, the TF-GNN Runner also provides a succinct way to orchestrate the training of Keras models in the common cases. A simple invocation may look like this:

from tensorflow_gnn import runner

runner.run(
   task=runner.RootNodeBinaryClassification("papers", ...),
   model_fn=model_fn,
   trainer=runner.KerasTrainer(tf.distribute.MirroredStrategy(), model_dir="/tmp/model"),
   optimizer_fn=tf.keras.optimizers.Adam,
   epochs=10,
   global_batch_size=128,
   train_ds_provider=runner.TFRecordDatasetProvider("/tmp/train*"),
   valid_ds_provider=runner.TFRecordDatasetProvider("/tmp/validation*"),
   gtspec=...,
)

The Runner provides ready-to-use solutions for ML pains like distributed training and tfgnn.GraphTensor padding for fixed shapes on Cloud TPUs. Beyond training on a single task (as shown above), it supports joint training on multiple (two or more) tasks in concert. For example, unsupervised tasks can be mixed with supervised ones to inform a final continuous representation (or embedding) with application specific inductive biases. Callers only need substitute the task argument with a mapping of tasks:

from tensorflow_gnn import runner
from tensorflow_gnn.models import contrastive_losses

runner.run(
     task={
        "classification": runner.RootNodeBinaryClassification("papers", ...),
        "dgi": contrastive_losses.DeepGraphInfomaxTask("papers"),
      },
    ...
)

Additionally, the TF-GNN Runner also includes an implementation of integrated gradients for use in model attribution. Integrated gradients output is a GraphTensor with the same connectivity as the observed GraphTensor but its features replaced with gradient values where larger values contribute more than smaller values in the GNN prediction. Users can inspect gradient values to see which features their GNN uses the most.

Conclusion

In short, we hope TF-GNN will be useful to advance the application of GNNs in TensorFlow at scale and fuel further innovation in the field. If you’re curious to find out more, please try our Colab demo with the popular OGBN-MAG benchmark (in your browser, no installation required), browse the rest of our user guides and Colabs, or take a look at our paper.

Acknowledgements

The TF-GNN release 1.0 was developed by a collaboration between Google Research (Sami Abu-El-Haija, Neslihan Bulut, Bahar Fatemi, Johannes Gasteiger, Pedro Gonnet, Jonathan Halcrow, Liangze Jiang, Silvio Lattanzi, Brandon Mayer, Vahab Mirrokni, Bryan Perozzi, Anton Tsitsulin, Dustin Zelle), Google Core ML (Arno Eigenwillig, Oleksandr Ferludin, Parth Kothari, Mihir Paradkar, Jan Pfeifer, Rachael Tamakloe), and Google DeepMind (Alvaro Sanchez-Gonzalez and Lisa Wang).

TensorFlow 2.15 update: hot-fix for Linux installation issue

December 5, 2023

by TensorFlow Blog Tensorflow

Posted by the TensorFlow team

We are releasing a hot-fix for an installation issue affecting the TensorFlow installation process. The TensorFlow 2.15.0 Python package was released such that it requested tensorrt-related packages that cannot be found unless the user installs them beforehand or provides additional installation flags. This dependency affected anyone installing TensorFlow 2.15 alongside NVIDIA CUDA dependencies via pip install tensorflow[and-cuda]. Depending on the installation method, TensorFlow 2.14 would be installed instead of 2.15, or users could receive an installation error due to those missing dependencies.

To solve this issue as quickly as possible, we have released TensorFlow 2.15.0.post1 for the Linux x86_64 platform. This version removes the tensorrt Python package dependencies from the tensorflow[and-cuda] installation method. Support for TensorRT is otherwise unaffected as long as TensorRT is already installed on the system. Now, pip install tensorflow[and-cuda] works as originally intended for TensorFlow 2.15.

Using .post1 instead of a full minor release allowed us to push this release out quickly. However, please be aware of the following caveat: for users wishing to pin their Python dependency in a requirements file or other situation, under Python’s version specification rules, tensorflow[and-cuda]==2.15.0 will not install this fixed version. Please use ==2.15.0.post1 to specify this exact version on Linux platforms, or a fuzzy version specification, such as ==2.15.^*, to specify the most recent compatible version of TensorFlow 2.15 on all platforms.

Half-precision Inference Doubles On-Device Inference Performance

November 29, 2023

by TensorFlow Blog Tensorflow

Posted by Marat Dukhan and Frank Barchard, Software Engineers

CPUs deliver the widest reach for ML inference and remain the default target for TensorFlow Lite. Consequently, improving CPU inference performance is a top priority, and we are excited to announce that we doubled floating-point inference performance in TensorFlow Lite’s XNNPack backend by enabling half-precision inference on ARM CPUs. This means that more AI powered features may be deployed to older and lower tier devices.

Traditionally, TensorFlow Lite supported two kinds of numerical computations in machine learning models: a) floating-point using IEEE 754 single-precision (32-bit) format and b) quantized using low-precision integers. While single-precision floating-point numbers provide maximum flexibility and ease of use, they come at the cost of 4X overhead in storage and memory and exhibit a performance overhead compared to 8-bit integer computations. In contrast, half-precision (FP16) floating-point numbers pose an interesting alternative balancing ease-of-use and performance: the processor needs to transfer twice fewer bytes and each vector operation produces twice more elements. By virtue of this property, FP16 inference paves the way for 2X speedup for floating-point models compared to the traditional FP32 way.

For a long time FP16 inference on CPUs primarily remained a research topic, as the lack of hardware support for FP16 computations limited production use-cases. However, around 2017 new mobile chipsets started to include support for native FP16 computations, and by now most mobile phones, both on the high-end and the low-end. Building upon this broad availability, we are pleased to announce the general availability for half-precision inference in TensorFlow Lite and XNNPack.

Performance Improvements

Half-precision inference has already been battle-tested in production across Google Assistant, Google Meet, YouTube, and ML Kit, and demonstrated close to 2X speedups across a wide range of neural network architectures and mobile devices. Below, we present benchmarks on nine public models covering common computer vision tasks:

MobileNet v2 image classification [download]
MobileNet v3-Small image classification [download]
DeepLab v3 segmentation [download]
BlazeFace face detection [download]
SSDLite 2D object detection [download]
Objectron 3D object detection [download]
Face Mesh landmarks [download]
MediaPipe Hands landmarks [download]
KNIFT local feature descriptor [download]

These models were benchmarked on 5 popular mobile devices, including recent and older devices (Pixel 3a, Pixel 5a, Pixel 7, Galaxy M12 and Galaxy S22). The average speedup is shown below.

Graph of Average speedup for fp16 vs fp32

Single-threaded inference speedup with half-precision (FP16) inference compared to single-precision (FP32) across 5 mobile devices. Higher numbers are better.

The same models were also benchmarked on three laptop computers (MacBook Air M1, Surface Pro X and Surface Pro 9)

Single-threaded inference speedup with half-precision (FP16) inference compared to single-precision (FP32) across 3 laptop computers. Higher numbers are better.

Currently, the FP16-capable hardware supported in XNNPack is limited to ARM & ARM64 devices with ARMv8.2 FP16 arithmetics extension, which includes Android phones starting with Pixel 3, Galaxy S9 (Snapdragon SoC), Galaxy S10 (Exynos SoC), iOS devices with A11 or newer SoCs, all Apple Silicon Macs, and Windows ARM64 laptops based with Snapdragon 850 SoC or newer.

How Can I Use It?

To benefit from the half-precision inference in XNNPack, the user must provide a floating-point (FP32) model with FP16 weights and special “reduced_precision_support” metadata to indicate model compatibility with FP16 inference. The metadata can be added during model conversion using the _experimental_supported_accumulation_type attribute of the tf.lite.TargetSpec object:

...
converter.target_spec.supported_types = [tf.float16]
converter.target_spec._experimental_supported_accumulation_type = tf.dtypes.float16

When the compatible model is delegated to XNNPack on a hardware with native support for FP16 computations, XNNPack will transparently replace FP32 operators with their FP16 equivalents, and insert additional operators to convert model inputs from FP32 to FP16 and convert model outputs back from FP16 to FP32. If the hardware is not capable of FP16 arithmetics, XNNPack will perform model inference with FP32 calculations. Therefore, a single model can be transparently deployed on both recent and legacy devices.

Additionally, the XNNPack delegate provides an option to force FP16 inference regardless of the model metadata. This option is intended for development workflows, and in particular for testing end-to-end accuracy of the model when FP16 inference is used. In addition to devices with native FP16 arithmetics support, forced FP16 inference is supported on x86/x86-64 devices with AVX2 extension in emulation mode: all elementary floating-point operations are computed in FP32, then converted to FP16 and back to FP32. Note that such simulation is slow and not a bit-exact equivalent to native FP16 inference, but simulates the effects of restricted mantissa precision and exponent range in the native FP16 arithmetics. To force FP16 inference, either build TensorFlow Lite with --define xnnpack_force_float_precision=fp16 Bazel option, or apply XNNPack delegate explicitly and add TFLITE_XNNPACK_DELEGATE_FLAG_FORCE_FP16 flag to the TfLiteXNNPackDelegateOptions.flags bitmask passed into the TfLiteXNNPackDelegateCreate call:

TfLiteXNNPackDelegateOptions xnnpack_options =
    TfLiteXNNPackDelegateOptionsDefault();
...
xnnpack_options.flags |= TFLITE_XNNPACK_DELEGATE_FLAG_FORCE_FP16;
TfLiteDelegate* xnnpack_delegate =
    TfLiteXNNPackDelegateCreate(&xnnpack_options);

XNNPack provides full feature parity between FP32 and FP16 operators: all operators that are supported for FP32 inference are also supported for FP16 inference, and vice versa. In particular, sparse inference operators are supported for FP16 inference on ARM processors. Therefore, users can combine the performance benefits of sparse and FP16 inference in the same model.

Future Work

In addition to most ARM and ARM64 processors, the most recent Intel processors, code-named Sapphire Rapids, support native FP16 arithmetics via the AVX512-FP16 instruction set, and the recently announced AVX10 instruction set promises to make this capability widely available on x86 platform. We plan to optimize XNNPack for these instruction sets in a future release.

Acknowledgements

We would like to thank Alan Kelly, Zhi An Ng, Artsiom Ablavatski, Sachin Joglekar, T.J. Alumbaugh, Andrei Kulik, Jared Duke, Matthias Grundmann for contributions towards half-precision inference in TensorFlow Lite and XNNPack.

TensorFlow Core

LiteRT

TF-Lite

Libtensorflow

Why TinyML Needs Better Data

What Makes Wake Vision Different?

Why Data Quality Matters for TinyML Models

Real-World Testing: Wake Vision’s Fine-Grained Benchmarks

Key Performance Gains With Wake Vision

Wake Vision Leaderboard: Track and Submit New Top-Performing Models

Making Wake Vision Easy to Access

Get Started with Wake Vision Today!

Everyone wants to do the modeling work, but no one wants to do the engineering.

Introduction

The Connection Between Machine Learning and Systems

Bridging the Gap: MLSysBook.ai and System-Level Thinking

SocratiQ: An Interactive AI-Powered Generative Learning Assistant

Mapping MLSysBook.ai’s Concepts to the TensorFlow Ecosystem

Support ML Systems Education: Every Star Counts 🌟

Conclusion

Acknowledgments

TensorFlow Core

TensorFlow Core

CUDA Update

Numpy 2.0

Drop TensorRT support

Dynamic Range Quantization

How can you use it?

Mixed Precision Inference

Performance Improvements

Conclusions

Acknowledgements

TensorFlow Core

Clang 17

Keras 3

Estimator API

Apple Silicon

GNNs: Making predictions for an object in context

Building GNN architectures

Training orchestration

Conclusion

Acknowledgements

Performance Improvements

How Can I Use It?

Future Work

Acknowledgements

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.