Machine learning inference at scale using AWS serverless

With the growing adoption of Machine Learning (ML) across industries, there is an increasing demand for faster and easier ways to run ML inference at scale. ML use cases, such as manufacturing defect detection, demand forecasting, fraud surveillance, and many others, involve tens or thousands of datasets, including images, videos, files, documents, and other artifacts. These inference use cases typically require the workloads to scale to tens of thousands of parallel processing units. The simplicity and automated scaling offered by AWS serverless solutions makes it a great choice for running ML inference at scale. Using serverless, inferences can be run without provisioning or managing servers and while only paying for the time it takes to run. ML practitioners can easily bring their own ML models and inference code to AWS by using containers.

This post shows you how to run and scale ML inference using AWS serverless solutions: AWS Lambda and AWS Fargate.

Solution overview

The following diagram illustrates the solutions architecture for both batch and real-time inference options. The solution is demonstrated using a sample image classification use case. Source code for this sample is available on GitHub.

The diagram illustrates the solutions architecture for batch and real-time inferences. Batch inference uses AWS Fargate and AWS Batch, along with Amazon S3 and Amazon ECR. Real-time inference uses AWS Lambda and Amazon API Gateway.

AWS Fargate: Lets you run batch inference at scale using serverless containers. Fargate task loads the container image with the inference code for image classification.

AWS Batch: Provides job orchestration for batch inference by dynamically provisioning Fargate containers as per job requirements.

AWS Lambda: Lets you run real-time ML inference at scale. The Lambda function loads the inference code for image classification. Lambda function is also used to submit batch inference jobs.

Amazon API Gateway: Provides a REST API endpoint for the inference Lambda function.

Amazon Simple Storage Service (S3): Stores input images and inference results for batch inference.

Amazon Elastic Container Registry (ECR): Stores the container image with inference code for Fargate containers.

Deploying the solution

We have created an AWS Cloud Development Kit (CDK) template to define and configure the resources for the sample solution. CDK lets you provision the infrastructure and build deployment packages for both the Lambda Function and Fargate container. The packages include commonly used ML libraries, such as Apache MXNet and Python, along with their dependencies. The solution is running the inference code using a ResNet-50 model trained on the ImageNet dataset to recognize objects in an image. The model can classify images into 1000 object categories, such as keyboard, pointer, pencil, and many animals. The inference code downloads the input image and performs the prediction with the five classes that the image most relates with the respective probability.

To follow along and run the solution, you need access to:

To deploy the solution, open your terminal window and complete the following steps.

  1. Clone the GitHub repo
    $ git clone

  2. Navigate to the project directory and deploy the CDK application.
$ ./
$ ./ #If you are using AWS Cloud9

Enter Y to proceed with the deployment.

This performs the following steps to deploy and configure the required resources in your AWS account. It may take around 30 minutes for the initial deployment, as it builds the Docker image and other artifacts. Subsequent deployments typically complete within a few minutes.

  • Creates a CloudFormation stack (“MLServerlessStack”).
  • Creates a container image from the Dockerfile and the inference code for batch inference.
  • Creates an ECR repository and publishes the container image to this repo.
  • Creates a Lambda function with the inference code for real-time inference.
  • Creates a batch job configuration with Fargate compute environment in AWS Batch.
  • Creates an S3 bucket to store inference images and results.
  • Creates a Lambda function to submit batch jobs in response to image uploads to S3 bucket.

Running inference

The sample solution lets you get predictions for either a set of images using batch inference or for a single image at a time using real-time API endpoint. Complete the following steps to run inferences for each scenario.

Batch inference

Get batch predictions by uploading image files to Amazon S3.

  1. Using Amazon S3 console or using AWS CLI, upload one or more image files to the S3 bucket path ml-serverless-bucket-<acct-id><aws-region>/input.
    $ aws s3 cp <path to jpeg files> s3://ml-serverless-bucket-<acct-id>-<aws-region>/input/ --recursive

  2. This will trigger the batch job, which will spin-off Fargate tasks to run the inference. You can monitor the job status in AWS Batch console.
  3. Once the job is complete (this may take a few minutes), inference results can be accessed from the ml-serverless-bucket-<acct-id><aws-region>/output path.

Real-time inference

Get real-time predictions by invoking the REST API endpoint with an image payload.

  1. Navigate to the CloudFormation console and find the API endpoint URL (httpAPIUrl) from the stack output.
  2. Use an API client, like Postman or curl command, to send a POST request to the /predict API endpoint with image file payload.
    $ curl --request POST -H "Content-Type: application/jpeg" --data-binary @<your jpg file name> <your-api-endpoint-url>/predict

  3. Inference results are returned in the API response.

Additional recommendations and tips

Here are some additional recommendations and options to consider for fine-tuning the sample to meet your specific requirements:

  • Scaling – Update AWS Service Quotas in your account and Region as per your scaling and concurrency needs to run the solution at scale. For example, if your use case requires scaling beyond the default Lambda concurrent executions limit, then you must increase this limit to reach the desired concurrency. You also need to size your VPC and subnets with a wide enough IP address range to allow the required concurrency for Fargate tasks.
  • Performance – Perform load tests and fine tune performance across each layer to meet your needs.
  • Use container images with Lambda – This lets you use containers with both AWS Lambda and AWS Fargate, and you can simplify source code management and packaging.
  • Use AWS Lambda for batch inferences – You can use Lambda functions for batch inferences as well if the inference storage and processing times are within Lambda limits.
  • Use Fargate Spot – This lets you run interruption tolerant tasks at a discounted rate compared to the Fargate price, and reduce the cost for compute resources.
  • Use Amazon ECS container instances with Amazon EC2 – For use cases that need a specific type of compute, you can make use of EC2 instances instead of Fargate.

Cleaning up

Navigate to the project directory from the terminal window and run the following command to destroy all resources and avoid incurring future charges.

$ cdk destroy


This post demonstrated how to bring your own ML models and inference code and run them at scale using serverless solutions in AWS. The solution made it possible to deploy your inference code in AWS Fargate and AWS Lambda. Moreover, it also deployed an API endpoint using Amazon API Gateway for real-time inferences and batch job orchestration using AWS Batch for batch inferences. Effectively, this solution lets you focus on building ML models by providing an efficient and cost-effective way to serve predictions at scale.

Try it out today, and we look forward to seeing the exciting machine learning applications that you bring to AWS Serverless!

Additional Reading:

About the Authors

Poornima Chand is a Senior Solutions Architect in the Strategic Accounts Solutions Architecture team at AWS. She works with customers to help solve their unique challenges using AWS technology solutions. She focuses on Serverless technologies and enjoys architecting and building scalable solutions.

Greg Medard is a Solutions Architect with AWS Business Development and Strategic Industries. He helps customers with the architecture, design, and development of cloud-optimized infrastructure solutions. His passion is to influence cultural perceptions by adopting DevOps concepts that withstand organizational challenges along the way. Outside of work, you may find him spending time with his family, playing with a new gadget, or traveling to explore new places and flavors.

Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers using machine learning to solve their business challenges using the AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at edge, therefore, she has created her own lab with self-driving kit and prototype manufacturing production line, where she spends lot of her free time.

Vasu Sankhavaram is a Senior Manager of Solutions Architecture in Amazon Web Services (AWS). He leads Solutions Architects dedicated to Hitech accounts. Vasu holds an MBA from U.C. Berkeley, and a Bachelor’s degree in Engineering from University of Mysore, India. Vasu and his wife have their hands full with a son who’s a sophomore at Purdue, twin daughters in third grade, and a golden doodle with boundless energy.

Chitresh Saxena is a Senior Technical Account Manager at Amazon Web Services. He has a strong background in ML, Data Analytics and Web technologies. His passion is solving customer problems, building efficient and effective solutions on the cloud with AI, Data Science and Machine Learning.

Read More