Profiling XNNPACK with TFLite

Posted by Alan Kelly, Software Engineer

We are happy to share that detailed profiling information for XNNPACK is now available in TensorFlow 2.9.1 and later. XNNPACK is a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 platforms, and it is the default TensorFlow Lite CPU inference engine for floating-point models.

The most common and expensive neural network operators, such as fully connected layers and convolutions, are executed by XNNPACK so that you get the best performance possible from your model. Historically the profiler would measure the runtime for the entire section of delegated graph, meaning that the runtime of all delegated operators was accumulated in one result, making it difficult to identify the individual operations that were slow.

Previous TFLite profiling results when XNNPACK was used. The runtime of all delegated operators was accumulated in one row.

If you are using TensorFlow Lite 2.9.1 or later, it gives the per operator profile even for the section that is delegated to XNNPACK so that you no longer need to decide between fast inference and detailed performance information. The operator name, data layout (NHWC for example), datatype (FP32) and microkernel type (if applicable) are shown.

New detailed per-operator profiling information is now shown. The operator name, data layout, data type and microkernel type are visible.
Now, you get lots of helpful information, such as the runtime per operator and the percentage of the total runtime that it accounts for. The runtime of each node is given in the order in which they were executed. The most expensive operators are also listed.
The most expensive operators are listed. In this example, you can see that a deconvolution accounted for 33.91% of the total runtime.

XNNPACK can also perform inference in half-precision (16 bit) floating point format if the hardware supports these operations natively, and IEEE16 inference is supported for every floating-point operator in the model, and the model’s `reduced_precision_support` metadata indicates that it is compatible with FP16 inference. FP16 inference can also be forced. More information is available here. If half precision has been used, then F16 will be present in the Name column:

FP16 inference has been used.

Here, unsigned quantized inference has been used (QU8).

QU8 indicates that unsigned quantized inference has been used

And finally, sparse inference has been used. Sparse operators require that the data layout change from NHWC to NCHW as this is more efficient. This can be seen in the operator name.

SPMM microkernel indicates that the operator is evaluated via SParse matrix-dense Matrix Multiplication. Note that sparse inference use NCHW layout (vs the typical NHWC) for the operators.

Note that when some operators are delegated to XNNPACK, and others aren’t, two sets of profile information are shown. This happens when not all operators in the model are supported by XNNPACK. The next step in this project is to merge profile information from XNNPACK operators and TensorFlow Lite into one profile.

Next Steps

You can learn more about performance measurement and profiling in TensorFlow Lite by visiting this guide. Thanks for reading!

Read More