New Amazon SageMaker Neo features to run more models faster and more efficiently on more hardware platforms

Amazon SageMaker Neo enables developers to train machine learning (ML) models once and optimize them to run on any Amazon SageMaker endpoints in the cloud and supported devices at the edge. Since Neo was first announced at re:Invent 2018, we have been continuously working with the Neo-AI open-source communities and several hardware partners to increase the types of ML models Neo can compile, the types of target hardware Neo can compile for, and to add new inference performance optimization techniques.

As of this writing, Neo optimizes models trained in DarkNet, Gluon, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, ONNX, and XGBoost for inference on Android, iOS, Linux, and Windows machines based on processors from Ambarella, Apple, ARM, Intel, NVIDIA, NXP, Qualcomm, Texas Instruments, and Xilinx. Models optimized by Neo can perform up to 25 times faster with no loss in accuracy.

Over the past few months, Neo has added a number of key new features:

  • Expanded support for PC and mobile devices
  • Heterogeneous execution with NVIDIA TensorRT
  • Bring Your Own Codegen (BYOC) framework
  • Inference optimized containers
  • Compilation for dynamic models

In this post, we summarize how these new features allow you to run more models on more hardware platforms both faster and more efficiently.

Expanded support for PC and mobile devices

Earlier in 2020, Neo launched support for Windows on x86 processor-based devices, allowing you to run your models faster and more efficiently on personal computers and other Windows devices. In addition, Neo launched support for Android on ARM-based processors and Qualcomm processors with Hexagon DSP.

Most recently, Apple and AWS partnered to automate model conversion to Core ML format using Neo. As a result, ML app developers can now train models in SageMaker and convert them to Core ML format with a click of button, and deploy the models on iOS and MacOS devices.

Heterogeneous execution with NVIDIA TensorRT

Neo uses the NVIDIA TensorRT acceleration library to increase the speedup of ML models on NVIDIA Jetson devices at the edge and AWS g4dn and p3 instances in the AWS Cloud. The TensorRT library supports a subset of operators commonly used in deep learning models.

Previously, Neo used TensorRT only when the entire computational graph of the model and all its operators could be accelerated by the library. As a result, few models could take advantage of TensorRT acceleration.

Recently, Neo added the capability to partition a model into sub-graphs, in which case a part of the model can be handled by TRT while the other part can be compiled by Apache TVM. To execute the compiled model, Neo runtime uses the heterogeneous execution mechanism to run both parts on the hardware. With this approach, Neo can provide the best available performance for a broader range of frameworks and models.

Bring your own codegen

We also expanded the heterogeneous execution approach to other hardware targets. Neo partnered with chip vendors to use the Bring Your Own Codegen (BYOC) mechanism in TVM to plug in partners’ proprietary toolchains for their ML accelerators, such as Ambarella’s CV Tools and Texas Instruments’ TIDL, with the Neo compilation API.

When you compile, Neo partitions a model so you can run the supported portion supported in the ML accelerators and the rest on the host CPU. With this approach, Neo maximizes the utilization of the ML accelerator on the chip, increases the types of models that you can compile for the chip, and makes it easier for you to take advantage of new ML accelerators from chip vendors.

Inference optimized containers

Like all deep learning compilers, Neo supports a subset of operators and models in a given framework. Before adding this feature, Neo could only compile a model if all the operators from the model were supported by Neo. Now, when you use Neo to compile a MXNet, PyTorch, or TensorFlow model for CPU or GPU inferences in SageMaker hosted endpoints on AWS, Neo partitions the models, compiles a portion to accelerate performance, and leaves the un-compiled part of the model to continue running natively in the framework. You can use Neo’s inference optimized containers to deploy on SageMaker hosted endpoints. As a result, you can optimize any MXNet, PyTorch, and TensorFlow model with Neo for any SageMaker hosted endpoint.

Compilation for dynamic models

Deep learning models contain dynamic features, such as control flow, dynamic operations, dynamic data structures, and dynamic input and output shapes that pose significant challenges to existing deep learning compilers. These models, including some object detection and semantic segmentation models, are becoming increasingly popular. Recently, we added the ability in Neo to compile these dynamic models. You can now use Neo to optimize models with dynamic features, and get up to two times the performance speedup.

Summary

We continually make improvements and add supported hardware endpoints, models, and frameworks to Neo based on your feedback. We encourage you to sign in to the SageMaker console or use the Neo compilation API to compile your trained models for the target hardware of your interest. For more information about Neo, see the following:

 


About the Authors

Tingwei Huang is a product management leader at AWS AI Service.

 

 

 

 

Vin Sharma is a Engineering Leader for AWS Deep Learning. He leads the team building Neo, which helps ML models train once and run anywhere in the cloud and at the edge.

Read More