Model dynamism Support in Amazon SageMaker Neo

Amazon SageMaker Neo was launched at AWS re:Invent 2018. It made notable performance improvement on models with statically known input and output data shapes, typically image classification models. These models are usually composed of a stack of blocks that contain compute-intensive operators, such as convolution and matrix multiplication. Neo applies a series of optimizations to boost the model’s performance and reduce memory usage. The static feature significantly simplifies the compilation, and you can decide on runtime inference tasks such as memory sizes ahead of time using a dedicated analysis pass. Runtime is just acted as a topological graph walker that invokes each operator sequentially.

However, we have been seeing a growing number of customers requiring more advanced models to fulfill tasks like object detection. These models contain dynamic features, such as control flow, dynamic operations, dynamic data structures, and dynamic input and output shapes. This posts significant challenges to the existing deep learning compiler because they have been mainly confined to static models. To address this problem, existing solutions either use just-in-time compilation to compile and run the dynamic portion (XLA), which causes extra compilation overhead, or convert the dynamic model into a static representation first (TFLite). To meet your requirements, we designed and implemented a suite of techniques ranging from the front-end parser to the backend runtime to handle object detection and segmentation models trained by TensorFlow, PyTorch, and MXNet. In this post, we’ll walk you through how Neo supports object detection and semantic segmentation models. We also compare inference performance improvements for both instance and edge type devices for Neo object detection and segmentation models.

Methodology

This section describes how object detection and semantic segmentation models are supported in Neo. We discuss the following:

How the front end handles popular frameworks differently
How the backend is designed to support dynamism
An example using the AWS Command Line Interface (AWS CLI) to demonstrate how to easy it is to perform inference for an object detection model in Neo

Frontend

The approaches vary for each framework because they handle dynamism, particularly control flow, differently. For example, MXNet doesn’t use any control flow to implement the object detection and segmentation models, which allows us to have a quick one-to-one operator mapping from MXNet to Relay operators. PyTorch has control flow primitives, such as If and Loop, which largely simplifies the conversion because we can create Relay If statements and recursion functions correspondingly.

Among the most popular frameworks, TensorFlow is the most difficult to support because it doesn’t directly employ conditional and looping operators to implement control flow. Instead, low-level data flow primitives, such as Merge, Exit, Switch, NextIteration, and Enter, are used to express complex control flow logic for the better support of parallel and distributed execution. For more information, see Implementation of Control Flow in TensorFlow.

To decompile these primitives into the original control flow operators, we proposed dedicated analysis and pattern matching techniques that have been contributed back to the Apache TVM. For more information, see the RFC Decompile TensorFlow Control Flow Primitives to Relay and Enhance TensorFlow Frontend Control Flow Support.

Backend

The backend compiler has worked well in supporting static models where the input data type and shape for each tensor is known at the compile-time. However, this assumption doesn’t hold for dynamic models, such as TensorFlow SSD, because the data shapes can only be determined at runtime.

To support dynamic data shapes, we introduced a special dimension called Any to represent statically unknown dimensions. For instance, a tensor type could be represented as Tensor[(5, Any), float32], where the second dimension was unknown. Accordingly, we defined some type inference rules to infer the type of the tensor when Any shape is involved.

To get the data shape of a tensor at runtime, we defined shape functions to compute the output shape of the tensor to determine the size of required memory. Based on the categories of the operators, shape functions were classified into three patterns:

Data-independent shapes – Are used for operators whose output shape is only determined by the shapes of the inputs, such as 2D convolution.
Data-dependent shapes – Require the real input value instead of the shape to compute the output shapes. For example, arange needs the value of start, stop, and step to compute the output shape.
Upper bound shapes – Are used to quickly estimate an upper bound shape for the output in order to avoid redundant computation. This is useful because operators, such as Non Maximum Suppression (NMS), involve non-trivial computation to infer the output shape at runtime, and the amount of computation for the shape function could be on par with that of running the operator.

To effectively run the dynamic models, we designed a virtual machine as an execution engine to invoke runtime type inference, handle control flow, and dispatch operator kernels. We compiled the model into a machine-dependent kernel code and machine-independent bytecode. They were then loaded and run by the virtual machine.

Because each instruction works on coarse-grained data, such as tensor, the instructions are compactly organized, meaning the dispatching overhead isn’t a concern. We designed the virtual machine in a register-based manner to simplify the design and allow users to read and modify the code easily. We designed a set of instructions to control running each type of bytecode, such as storage allocation, tensor memory allocation on the storage, control flow, and kernel invocation.

After the virtual machine loads the compiled bytecode and kernels, it interprets the bytecode in a dispatching loop by checking it op-code and invoking appropriate logic. For more information, see Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference.

Performing inference and object detection in Neo

This section provides an example to illustrate how you can compile a Faster R-CNN model from TensorFlow 1.15 and deploy it on an AWS C5 instance using Neo.

Prepare the pre-trained model by downloading it from the TensorFlow Detection Model Zoo and extracting it:

$ wget http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet50_coco_2018_01_28.tar.gz
$ tar -xzf faster_rcnn_resnet50_coco_2018_01_28.tar.gz

Get the frozen protobuf file and upload it to Amazon Simple Storage Service (Amazon S3):

$ tar -czf tf_frcnn.tar.gz faster_rcnn_resnet50_coco_2018_01_28.tar.gz
$ aws s3 cp tf_frcnn.tar.gz s3://<your-bucket>/<your-input-folder>

We can now compile the model using Neo. For this post, we use the AWS CLI. We first create a configuration JSON file where the required information is fed (such as the input size, framework, location of the output artifacts, and target platform that we compile the model for):

Create the configuration file with the following code:

{
    "CompilationJobName": "compile-tf-ssd",
    "RoleArn": "arn:aws:iam::<your-account>:role/service-role/AmazonSageMaker-ExecutionRole-yyyymmddThhmmss",
    "InputConfig": {
        "S3Uri": "s3://<your-bucket>/<your-input-folder>/tf_frcnn.tar.gz",
        "DataInputConfig":  "{'image_tensor': [1,512,512,3]}",
        "Framework": "TENSORFLOW"
    },
    "OutputConfig": {
        "S3OutputLocation": "s3://<your-bucket>/<your-output-folder>",
        "TargetPlatform": {
            "Os": "LINUX",    
            "Arch": "X86_64"
        },
        "CompilerOptions": "{'mcpu': 'skylake-avx512'}"
    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 1800}
}

Compile it with SageMaker CLI:

$ aws sagemaker create-compilation-job --cli-input-json file://config.json --region us-west-2

Finally, we’re ready to deploy the compiled model with DLR.

Before the deployment, download the compiled artifacts from the S3 bucket where it was saved to:

$ aws s3 cp s3://<your-bucket>/<output-folder>/output_artifacts.tar.gz tf_frcnn_compiled.tar.gz
$ mkdir compiled_model
$ tar -xzf tf_frcnn_compiled.tar.gz -C compiled_model

Install DLR for inference:
```
$ pip install dlr
```

Perform inference as the following:

if __name__ == "__main__":
    data = cv2.imread("input_image.jpg")
    data = cv2.resize(data, (512, 512), interpolation=cv2.INTER_AREA)
    data = np.expand_dims(data, 0)
    model = dlr.DLRModel('compiled_model', 'cpu', 0)
    result = model.run(data)

Performance comparison

In this section, we compare the performance of the most widely used TF object detection and segmentation models on a variety of EC2 server platforms and NVIDIA Jetson based edge devices. We use the models from the TensorFlow Detection Model Zoo. As discussed earlier, these models show dynamism and are significantly more complex than the static models like ResNet50. We use Neo to compile these models and generate high-performance machine code for a variety of target platforms. Here, we show the performance comparison for these models across many hardware devices against the best baseline available for the hardware platforms.

EC2 C5.9x large server instance

C5 instances are Intel Xeon server instances suitable for compute-intensive deep learning applications. For this comparison, we report the average latency for the TensorFlow baseline and Neo-compiled model. All the reported latency numbers are in milliseconds. We observe that Neo outperforms TensorFlow for all the three models, and by up to 20% for the Mask R-CNN ResNet-50 model.

Model name	TF 1.15.0	Neo	Speedup
ssd_mobilenet_v1_coco	17.96	16.39	1.09579
faster_rcnn_resnet50_coco	152.62	142.3	1.07252
mask_rcnn_resnet50_atrous_coco	391.91	326.44	1.20056

EC2 m6g.8x large server instance

M6 instances are the ARM Graviton server instances suitable for compute-intensive deep learning applications. To get a baseline, we use the TensorFlow packages provided from ARM Tool-Solutions. Our observations are similar to C5 instances. Neo outperforms TensorFlow, and we observe significant speedup for large models like Faster RCNN and MaskRCNN.

Model name	TF 1.15.0	Neo	Speedup
ssd_mobilenet_v1_coco	29.04	28.75	1.01009
faster_rcnn_resnet50_coco	290.64	202.71	1.43377
mask_rcnn_resnet50_atrous_coco	623.98	368.81	1.69187

NVIDIA server instance and edge devices

Finally, we compare the performance of the MobileNet SSD model on NVIDIA Jetson based edge devices—Jetson Xavier and Jetson Nano. MobileNet SSD is a popular object detection model for edge devices. This is because it has low compute and memory requirements, and is suitable for already resource-constrained edge devices. To have a performance baseline, we use the TF-TRT package, where TensorFlow is integrated with NVIDIA TensorRT as the backend. We present the comparison in the following table. We observe that Neo achieves significant speedup for both Xavier and Nano edge devices.

Performance comparison for ssd_mobilenet_v1_coco
Hardware device	TF 1.15	Neo	Speedpup
NVIDIA Jetson Nano	163	140	1.16429
Jetson Xavier	109	56	1.94643

Summary

This post described how Neo supports model dynamism. Multiple techniques were proposed from the front-end parser to backend runtime to enable the model support. We compared the inference performance of Neo object detection and segmentation models against those required by the TensorFlow framework or TensorFlow backed with TensorRT. We observed that Neo obtained speedups for these models on both instances and edge devices.r

This solution doesn’t have any the service API changes, so you can still use the original API to compile new models. All code has been contributed back to the Apache TVM. For more information about compiling a model using Apache TVM, see Compile PyTorch Object Detection Models.

Acknowledgements: We sincerely thank the following engineers and applied scientists who have contributed to the support of dynamic models: Haichen Shen, Wei Chen, Yong Wu, Yao Wang, Animesh Jain, Trevor Morris, Rohan Mukherjee, Ricky Das

About the Author

Zhi Chen is a Senior Software Engineer at AWS AI who leads the deep learning compiler development in Amazon SageMaker Neo. He helps customers deploy the pre-trained deep learning models from different frameworks on various platforms. Zhi obtained his PhD from University of California, Irvine in Computer Science, where he focused on compilers and performance optimization.

Vedere AI