AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS has expanded the availability of Amazon EC2 Inf1 instances to four new AWS Regions, bringing the total number of supported Regions to 11: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), Europe (Frankfurt, Ireland, Paris), and South America (São Paulo).

Amazon EC2 Inf1 instances are powered by AWS Inferentia chips, which are custom-designed to provide you with the lowest cost per inference in the cloud and lower the barriers for everyday developers to use machine learning (ML) at scale. Customers using models such as YOLO v3 and YOLO v4 can get up to 1.85 times higher throughput and up to 40% lower cost per inference compared to the EC2 G4 GPU-based instances.

As you scale your use of deep learning across new applications, you may be bound by the high cost of running trained ML models in production. In many cases, up to 90% of the infrastructure cost spent on developing and running an ML application is on inference, making the need for high-performance, cost-effective ML inference infrastructure critical. Inf1 instances are built from the ground up to deliver faster performance and more cost-effective ML inference than comparable GPU-based instances. This gives you the performance and cost structure you need to confidently deploy your deep learning models across a broad set of applications.

AWS Neuron SDK performance and support for new ML models

You can deploy your ML models to Inf1 instances natively with popular ML frameworks such as TensorFlow, PyTorch, and MXNet. You can deploy your existing models to Amazon EC2 Inf1 instances with minimal code changes by using the AWS Neuron SDK, which is integrated with these popular ML frameworks. This gives you the freedom to maintain hardware portability and take advantage of the latest technologies without being tied to vendor-specific software libraries.

Since its launch, the Neuron SDK has seen a dramatic improvement in the breadth of models that deliver best-in-class performance at a fraction of the cost. This includes natural language processing models like the popular BERT, image classification models (ResNet and VGG), and object detection models (OpenPose and SSD). The latest Neuron release (1.8.0) provides optimizations that improve performance of YOLO v3 and v4, VGG16, SSD300, and BERT. It also improves operational deployments of large-scale inference applications, with a session management agent incorporated into all supported ML frameworks and a new neuron tool that allows you to easily scale monitoring of large fleets of inference applications.

Customer success stories

Since the launch of Inf1 instances, a broad spectrum of customers, from large enterprises to startups, as well as Amazon services, have begun using them to run production workloads.

Anthem is one of the nation’s leading health benefits companies, serving the healthcare needs of over 40 million members across dozens of states. They use deep learning to automate the generation of actionable insights from customer opinions via natural language models.

“Our application is computationally intensive and needs to be deployed in a highly performant manner,” says Numan Laanait, PhD, Principal AI/Data Scientist at Anthem. “We seamlessly deployed our deep learning inferencing workload onto Amazon EC2 Inf1 instances powered by the AWS Inferentia processor. The new Inf1 instances provide two times higher throughput to GPU-based instances and allowed us to streamline our inference workloads.”

Condé Nast, another AWS customer, has a global portfolio that encompasses over 20 leading media brands, including Wired, Vogue, and Vanity Fair.

“Within a few weeks, our team was able to integrate our recommendation engine with AWS Inferentia chips,” says Paul Fryzel, Principal Engineer in AI Infrastructure at Condé Nast. “This union enables multiple runtime optimizations for state-of-the-art natural language models on SageMaker’s Inf1 instances. As a result, we observed a performance improvement of a 72% reduction in cost than the previously deployed GPU instances.”

Getting started

The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service for building, training, and deploying ML models. If you prefer to manage your own ML application development platforms, you can get started by either launching Inf1 instances with AWS Deep Learning AMIs, which include the Neuron SDK, or use Inf1 instances via Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS) for containerized ML applications.

For more information, see Amazon EC2 Inf1 Instances.


About the Author

Gadi Hutt is a Sr. Director, Business Development at AWS. Gadi has over 20 years’ experience in engineering and business disciplines. He started his career as an embedded software engineer, and later on moved to product lead positions. Since 2013, Gadi leads Annapurna Labs technical business development and product management focused on hardware acceleration software and hardware products like the EC2 FPGA F1 instances and AWS Inferentia along side with its Neuron SDK, accelerating machine learning in the cloud.

Read More