Posted by Eric Johnson, TFRT Product Manager and Mingsheng Hong, TFRT Tech Lead/Manager
TensorFlow aims to make it easy for you to build and deploy ML models across many different devices. Yet, what it means to “build and deploy ML models” is not static and continues to change with increased investment in the ML ecosystem.
At the top-half of the TensorFlow stack, innovation is leading to more complex models and deployment scenarios. Researchers are inventing new algorithms that require more compute, while application developers are enhancing their products with these new techniques across edge and server.
At the bottom-half of the stack, the tension from increasing compute needs and rising compute costs due to the ending of Moore’s law has sparked a proliferation of new hardware aimed at specific ML use cases. Traditional chip makers, startups, and software companies alike (including Google) have invested in specialized silicon.
The result is that the needs of the ML ecosystem are vastly different than they were 4 or 5 years ago when TensorFlow was first created. Of course, we’ve continued to iterate with the release of 2.x, but the current TensorFlow stack is optimized for graph execution, and incurs non-trivial overhead when dispatching a single op. A high-performance low-level runtime is a key to enable the trends of today and empower the innovations of tomorrow.
Enter TFRT, a new TensorFlow RunTime. It aims to provide a unified, extensible infrastructure layer with best-in-class performance across a wide variety of domain specific hardware. It provides efficient use of multithreaded host CPUs, supports fully asynchronous programming models, and focuses on low-level efficiency.
TFRT will benefit a broad range of users, including:
- Researchers looking for faster iteration time and better error reporting when developing complex new models in eager mode.
- Application developers looking for improved performance when training and serving models in production.
- Hardware makers looking to integrate edge and datacenter devices into TensorFlow in a modular way.
What is TFRT?
TFRT is a new runtime that will replace the existing TensorFlow runtime. It is responsible for efficient execution of kernels – low-level device-specific primitives – on targeted hardware. It plays a critical part in both eager and graph execution, which is illustrated by this simplified diagram of the TensorFlow training stack:
|TFRT’s role in graph and eager execution within the TensorFlow training stack|
Note that everything in grey is part of TFRT. In eager execution, TensorFlow APIs call directly into the new runtime. In graph execution, your program’s computational graph is lowered to an optimized target-specific program and dispatched to TFRT. In both execution paths, the new runtime invokes a set of kernels that call into the underlying hardware devices to complete the model execution, as shown by the black arrows.
Key design points
Whereas the existing TensorFlow runtime was initially built for graph execution and training workloads, the new runtime will make eager execution and inference first-class citizens, while putting special emphasis on architecture extensibility and modularity. More specifically, TFRT has the following selected design highlights:
- To achieve higher performance, TFRT has a lock-free graph executor that supports concurrent op execution with low synchronization overhead, and a thin eager op dispatch stack so that eager API calls will be asynchronous and more efficient.
- To make extending the TF stack easier, we decoupled device runtimes from the host runtime, the core TFRT component that drives host CPU and I/O work.
- To get consistent behavior, TFRT leverages common abstractions, such as shape functions and kernels, across both eager and graph.
The power of MLIR
TFRT is also tightly-integrated with MLIR. For example:
- TFRT utilizes MLIR’s compiler infrastructure to generate an optimized, target-specific representation of your computational graph that the runtime executes.
- TFRT uses MLIR’s extensible type system to support arbitrary C++ types in the runtime, which removes tensor-specific limitations.
Together, TFRT and MLIR will improve TensorFlow’s unification, flexibility, and extensibility.
Early performance results from the inference and serving use case are encouraging. As part of a benchmarking study for TensorFlow Dev Summit 2020, we integrated TFRT with TensorFlow Serving and measured the latency of sending requests to the model and getting prediction results back. We picked a common MLPerf model, ResNet-50, and chose a batch size of 1 and a data precision of FP16 to focus our study on runtime related op dispatch overhead. In comparing performance of GPU inference over TFRT to the current runtime, we saw an improvement of 28% in average inference time. These early results are strong validation for TFRT, and we expect it to provide a big boost to performance. We hope you are as excited as we are!
TFRT is being integrated with TensorFlow, and will be enabled initially through an opt-in flag, giving the team time to fix any bugs and fine-tune performance. Eventually, it will become TensorFlow’s default runtime. Although it is still an early stage project, we have made the GitHub repository available to the community. We are limiting contributions to begin with, but encourage participation in the form of requirements and design discussions.
To learn more, please check out our Dev Summit 2020 presentation, where we first introduced TFRT to the world, and our MLIR Open Design Deep Dive presentation, where we provided a detailed overview of TFRT’s core components, low-level abstractions, and general design principles. And finally, if you want to keep up with all things TFRT, please join our new mailing list. Thanks!