Reduce the time taken to deploy your models to Amazon SageMaker for testing

Data scientists often train their models locally and look for a proper hosting service to deploy their models. Unfortunately, there’s no one set mechanism or guide to deploying pre-trained models to the cloud. In this post, we look at deploying trained models to Amazon SageMaker hosting to reduce your deployment time.

SageMaker is a fully managed machine learning (ML) service. With SageMaker, you can quickly build and train ML models and directly deploy them into a production-ready hosted environment. Additionally, you don’t need to manage servers. You get an integrated Jupyter notebook environment with easy access to your data sources. You can perform data analysis, train your models, and test them using your own algorithms or use SageMaker-provided ML algorithms that are optimized to run efficiently against large datasets spread across multiple machines. Training and hosting are billed by minutes of usage, with no minimum fees and no upfront commitments.

Solution overview

Data scientists sometimes train models locally using their IDE and either ship those models to the ML engineering team for deployment or just run predictions locally on powerful machines. In this post, we introduce a Python library that simplifies the process of deploying models to SageMaker for hosting on real-time or serverless endpoints.

This Python library gives data scientists a simple interface to quickly get started on SageMaker without needing to know any of the low-level SageMaker functionality.

If you have models trained locally using your preferred IDE and want to benefit from the scale of the cloud, you can use this library to deploy your model to SageMaker. With SageMaker, in addition to all the scaling benefits of a cloud-based ML platform, you have access to purpose-built training tools (distributed training, hyperparameter tuning), experiment management, model management, bias detection, model explainability, and many other capabilities that can help you in any aspect of the ML lifecycle. You can choose from the three most popular frameworks for ML: Scikit-learn, PyTorch, and TensorFlow, and can pick the type of compute you want. Defaults are provided along the way so users of this library can deploy their models without needing to make complex decisions or learn new concepts. In this post, we show you how to get started with this library and optimize deploying your ML models on SageMaker hosting.

The library can be found in the GitHub repository.

The SageMaker Migration Toolkit

The SageMakerMigration class is available through a Python library published to GitHub. Instructions to install this library are provided in the repository; make sure that you follow the README to properly set up your environment. After you install this library, the rest of this post talks about how you can use it.

The SageMakerMigration class consists of high-level abstractions over SageMaker APIs that significantly reduce the steps needed to deploy your model to SageMaker, as illustrated in the following figure. This is intended for experimentation so developers can quickly get started and test SageMaker. It is not intended for production migrations.

For Scikit-learn, PyTorch, and TensorFlow models, this library supports deploying trained models to a SageMaker real-time endpoint or serverless endpoint. To learn more about the inference options in SageMaker, refer to Deploy Models for Inference.

Real-time vs. serverless endpoints

Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support auto scaling.

SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale ML models. Serverless Inference is ideal for workloads that have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers.

Depending on your use case, you may want to quickly host your model on SageMaker without actually having an instance always on and incurring costs, in which case a serverless endpoint is a great solution.

Prepare your trained model and inference script

After you’ve identified the model you want to deploy on SageMaker, you must ensure the model is presented to SageMaker in the right format. SageMaker endpoints generally consist of two components: the trained model artifact (.pth, .pkl, and so on) and an inference script. The inference script is not always mandatory, but if not provided, the default handlers for the serving container that you’re using are applied. It’s essential to provide this script if you need to customize your input/output functionality for inference.

The trained model artifact is simply a saved Scikit-learn, PyTorch, or TensorFlow model. For Scikit-learn, this is typically a pickle file, for PyTorch this is a .pt or .pth file, and for TensorFlow this is a folder with assets, .pb files, and other variables.

Generally, you need to able to control how your model processes input and performs inference, and control the output format for your response. With SageMaker, you can provide an inference script to add this customization. Any inference script used by SageMaker must have one or more of the following four handler functions: model_fn, input_fn, predict_fn, and output_fn.

Note that these four functions apply to PyTorch and Scikit-learn containers specifically. TensorFlow has slightly different handlers because it’s integrated with TensorFlow Serving. For an inference script with TensorFlow, you have two model handlers: input_handler and output_handler. Again, these have the same preprocessing and postprocessing purpose that you can work with, but they’re configured slightly differently to integrate with TensorFlow Serving. For PyTorch models, model_fn is a compulsory function to have in the inference script.

model_fn

This is the function that is first called when you invoke your SageMaker endpoint. This is where you write your code to load the model. For example:

def model_fn(model_dir):
    model = Your_Model()
    with open(os.path.join(model_dir, 'model.pth'), 'rb') as f:
        model.load_state_dict(torch.load(f))
    return model

Depending on the framework and type of model, this code may change, but the function must return an initialized model.

input_fn

This is the second function that is called when your endpoint is invoked. This function takes the data sent to the endpoint for inference and parses it into the format required for the model to generate a prediction. For example:

def input_fn(request_body, request_content_type):
    """An input_fn that loads a pickled tensor"""
    if request_content_type == 'application/python-pickle':
        return torch.load(BytesIO(request_body))
    else:
        # Handle other content-types here or raise an Exception
        # if the content type is not supported.
        pass

The request_body contains the data to be used for generating inference from the model and is parsed in this function so that it’s in the required format.

predict_fn

This is the third function that is called when your model is invoked. This function takes the preprocessed input data returned from input_fn and uses the model returned from model_fn to make the prediction. For example:

def predict_fn(input_data, model):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()
    with torch.no_grad():
        return model(input_data.to(device))

You can optionally add output_fn to parse the output of predict_fn before returning it to the client. The function signature is def output_fn(prediction, content_type).

Move your pre-trained model to SageMaker

After you have your trained model file and inference script, you must put these files in a folder as follows:

#SKLearn Model

model_folder/
    model.pkl
    inference.py
    
# Tensorflow Model
model_folder/
    0000001/
        assets/
        variables/
        keras_metadata.pb
        saved_model.pb
    inference.py
    
# PyTorch Model
model_folder/
    model.pth
    inference.py

After your model and inference script have been prepared and saved in this folder structure, your model is ready for deployment on SageMaker. See the following code:

from sagemaker_migration import frameworks as fwk

if __name__ == "__main__":
    ''' '''
    sk_model = fwk.SKLearnModel(
        version = "0.23-1", 
        model_data = 'model.joblib',
        inference_option = 'real-time',
        inference = 'inference.py',
        instance_type = 'ml.m5.xlarge'
    )
    sk_model.deploy_to_sagemaker()

After deployment of your endpoint, make sure to clean up any resources you won’t utilize via the SageMaker console or through the delete_endpoint Boto3 API call.

Conclusion

The goal of the SageMaker Migration Toolkit project is to make it easy for data scientists to onboard their models onto SageMaker to take advantage of cloud-based inference. The repository will continue to evolve and support more options for migrating workloads to SageMaker. The code is open source and we welcome community contributions through pull requests and issues.

Check out the GitHub repository to explore more on utilizing the SageMaker Migration Toolkit, and feel free to also contribute examples or feature requests to add to the project!


About the authors

Kirit Thadaka is an ML Solutions Architect working in the Amazon SageMaker Service SA team. Prior to joining AWS, Kirit spent time working in early stage AI startups followed by some time in consulting in various roles in AI research, MLOps, and technical leadership.

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Read More

Automated Deployment of TensorFlow Models with TensorFlow Serving and GitHub Actions

Automated Deployment of TensorFlow Models with TensorFlow Serving and GitHub Actions

Posted by Chansung Park and Sayak Paul (ML-GDEs)


If you are an applications developer, or if your organization doesn’t have a dedicated ML Engineering team, it is common to deploy a machine learning model without worrying about the end to end machine learning pipeline or MLOps. TFX and TensorFlow Serving can help you create the heart of an MLOps infrastructure. 

In this post, we will share how we serve a TensorFlow image classification model as RESTful and gRPC based services with TensorFlow Serving on a Kubernetes (k8s) cluster running on Google Kubernetes Engine (GKE) through a set of GitHub Actions workflows. 

Overview

In any GitHub project, you can make releases, with up to 2 GB of assets included in each release when using a free account. This is a good place to manage different versions of machine learning models for various reasons. One can also replace this with a more private component for managing model versions such as Google Cloud Storage buckets. For our purposes, the 2 GB space provided by GitHub Releases will be enough.

Figure 1. Three steps to deploy TF Serving on GKE (original).

The basic idea is to:

  1. Automatically detect a newly released version of a TensorFlow-based ML model in GitHub Releases
  2. Build a custom TensorFlow Serving Docker image containing the released ML model
  3. Deploy it on a k8s cluster running on GKE through a set of GitHub Actions.

The entire workflow can be logically divided into three subtasks, so it’s a good idea to write three separate composite GitHub Actions:

  • First subtask handles the environmental setup
    • GCP Authentication (GCP credentials are injected from the GitHub Action Secret)
    • Install gcloud CLI toolkit to access the GKE cluster for the third subtask
    • Authenticate Docker to push images to the Google Cloud Registry (GCR)
    • Connect to a designated GKE cluster for further accesses
  • Second subtask builds a custom TensorFlow Serving image
    • Download and extract your latest released SavedModel from your GitHub repository
    • Run the official or a custom built TensorFlow Serving docker image
    • Copy the extracted SavedModel into the running TensorFlow Serving docker container
    • Commit the changes of the running container and give it a new name with the tags of special token to denote GCR, GCP project ID, and latest
    • Push the committed image to the GCR
  • Third subtask deploys the custom built TensorFlow Serving image to the GKE cluster
    • Download the Kustomize toolkit to handle overlay configurations
    • Pick one of the scenarios from the various experiments
    • Apply Deployment, Service, and ConfigMap according to the selected experiment to the currently connected GKE cluster
      • ConfigMap is used for batching-enabled scenarios to inject batching configurations dynamically into the Deployment.

There are a number of parameters that you can customize such as the GCP project ID, GKE cluster name, the repository where the ML model will be released, and so on. The full list of parameters can be found here. As noted above, the GCP credentials should be set as a GitHub Action Secret beforehand. If the entire workflow goes without any errors, you will see something similar to the output below.

NAME         TYPE            CLUSTER-IP      EXTERNAL-IP     PORT(S)                            AGE
tfs-server   LoadBalancer    xxxxxxxxxx      xxxxxxxxxx       8500:30869/TCP,8501:31469/TCP      23m

The combinations of the EXTERNAL-IP and the PORT(S) represent endpoints where external users can connect to the TensorFlow Serving pods in the k8s cluster. As you see, two ports are exposed, and 8500 and 8501 are for RESTful and gRPC services respectively. One thing to note is that we used LoadBalancer as the service type, but you may want to consider including Ingress controllers such as GKE Ingress for securing the k8s clusters with SSL/TLS and defining more flexible routing rules in production. You can check out the complete logs from the past runs.

Build a Custom TensorFlow Serving Image within a GitHub Action

As described in the overview and the official document, a custom TensorFlow Serving Docker image can be built in five steps. We also provide a notebook for local testing of these steps. In this section, we show how to write a composite GitHub Action for this partial subtask of the whole workflow (note that .inputs, .env, and ${{ }} for the environment variables are omitted for brevity).

First, a model can be downloaded by an external robinraju/release-downloader GitHub Action with custom information about the URL of the GitHub repository and the filename in the list of assets from the latest release. The default filename is saved_model.tar.gz.

Second, the downloaded file should be decompressed to fetch the actual SavedModel that TensorFlow Serving can understand.

runs:
  using: “composite”
  steps:
      – name: Download the latest SavedModel release
        uses: robinraju/release-downloader@v1.3
        with:
          repository: $MODEL_RELEASE_REPO
          fileName: $MODEL_RELEASE_FILE

          latest: true
         
      – name: Extract the SavedModel
        run: |
          mkdir MODEL_NAME
          tar -xvf $MODEL_RELEASE_FILE –strip-components=1 –directory $MODEL_NAME
   
      – name: Run the CPU Optimized TensorFlow Serving container
        run: |
          docker run -d –name serving_base $BASE_IMAGE_TAG
         
      – name: Copy the SavedModel to the running TensorFlow Serving container
        run: |
          docker cp $MODEL_NAME serving_base:/models/$MODEL_NAME
         
      – id: push-to-registry
        name: Commit and push the changed running TensorFlow Serving image
        run: |
          export NEW_IMAGE_NAME=tfserving-$MODEL_NAME:latest
          export NEW_IMAGE_TAG=gcr.io/$GCP_PROJECT_ID/$NEW_IMAGE_NAME
          echo “::set-output name=NEW_IMAGE_TAG::$(echo $NEW_IMAGE_TAG)”
          docker commit –change “ENV MODEL_NAME $MODEL_NAME” serving_base $NEW_IMAGE_TAG
          docker push $NEW_IMAGE_TAG

Third, we can modify a running TensorFlow Serving Docker container by placing a custom SavedModel inside. In order to do this, we need to run the base TensorFlow Serving container instantiated either from the official image or a custom-built image. We have used the CPU-optimized version as the base image by compiling from source, and it is publicly available here.

Fourth, the SavedModel should be copied to the /models directory inside the running TensorFlow Serving container. In the last step, we set the MODEL_NAME environment variable to let TensorFlow Serving know which model to expose as services, and commit the two changes that we made to the base image. Finally, the updated TensorFlow Serving Docker image can be pushed into the designated GCR.

Notes on the TensorFlow Serving Parameters

We consider three TensorFlow Serving specific parameters in this post: tensorflow_inter_op_parallelism, tensorlfow_inter_op_parallelism, and the batching option. Here, we provide brief overviews of each of them.

Parallelism threads: tesorflow_intra_op_parallelism controls the number of threads to parallelize the execution of an individual operation. tensorflow_inter_op_parallelism controls the number of threads to parallelize the execution of multiple independent operations. To know more, refer to this resource.

Batching: As mentioned above, we can allow TensorFlow Serving to batch requests by setting the enable_batching parameter to True. If we do so, we also need to define the batching configurations for TensorFlow in a separate file (passed via the batching_parameters_file argument). Please refer to this resource for more information about the options we can specify in that file.

Configuring TensorFlow Serving

Once you have a custom TensorFlow Serving Docker image, you can deploy it with the k8s resource objects: Deployment and ConfigMap as shown below. This section shows how to write ConfigMap to write batching configurations and Deployment to add TensorFlow Serving specific runtime options. We also show you how to mount the ConfigMap to inject batching configurations into TensorFlow Serving’s batching_parameters_file option.

apiVersion: apps/v1

kind: Deployment


    spec:
      containers:
      – image: gcr.io/gcp-ml-172005/tfs-resnet-cpu-opt:latest
        name: tfs-k8s
        imagePullPolicy: Always
        args: [“–tensorflow_inter_op_parallelism=2”,
              “–tensorflow_intra_op_parallelism=8”,
              “–enable_batching=true”,
              “–batching_parameters_file=/etc/tfs-config/batching_config.txt”]
        …
        volumeMounts:
          – mountPath: /etc/tfs-config/batching_config.txt
            subPath: batching_config.txt
            name: tfs-config

The URI of the custom built TensorFlow Serving Docker image can be specified in spec.containers.image, and the behavior of TensorFlow Serving can be customized by providing arguments in the spec.containers.args in the Deployment. This post shows how to configure three kinds of custom behavior: tensorflow_inter_op_parallelism, tensorflow_intra_op_parallelism, and enable_batching.

apiVersion: v1

kind: ConfigMap
metadata:
  name: tfs-config
data:
  batching_config.txt: |
    max_batch_size { value: 128 }
    batch_timeout_micros { value: 0 }
    max_enqueued_batches { value: 2 }
    num_batch_threads { value: 2 }

When enable_batching is set to true, we can further customize the batch inference by defining its specific batching-related configurations in a ConfigMap. Then, the ConfigMap can be mounted as a file with spec.containers.volumeMounts, and we can specify which file to look up for the batching_parameters_file argument in Deployment.

Kustomize to Manage Various Experiments

As you see, there are lots of parameters to determine the behavior of TensorFlow Serving, and the optimal values for them are usually found by running experiments. Indeed, we have experimented with various parameters within a number of different environmental setups: different numbers of nodes, different numbers of vCPU cores, and different RAM capacity.

├── base

|   ├──kustomization.yaml

|   ├──deployment.yaml

|   └──service.yaml
└── experiments
    ├── 2vCPU+4GB+inter_op2

    …

    ├── 4vCPU+8GB+inter_op2
    …

    ├── 8vCPU+64GB+inter_op2_w_batch

    |   ├──kustomization.yaml

    |   ├──deployment.yaml

    |   └──tfs-config.yaml
    …

We used kustomize to manage the YAML files of various experiments. We keep common YAML files of Deployment and Service in the base directory while having specific YAML files for certain experimental environments and configurations under the experiments directory. With this and kustomize, the contents of the base YAML files could be easily overlaid with different numbers of replicas, different values of tensorflow_inter_op_parallelism, tensorflow_intra_op_parallelism, enable_batching, and batch configurations.

runs:
  using: “composite”
  steps:
    – name: Setup Kustomize
      …

    – name: Deploy to GKE
      working-directory: .kube/
      run: |-
        ./kustomize build experiments/$TARGET_EXPERIMENT | kubectl apply -f –

You can simply select the experiment that you want to test or that you think is optimal by setting $TARGET_EXPERIMENT. For example, the best experiment that we found was “8vCPU+16GB+inter_op4” which means each VM is configured with an 8vCPU and 16GB RAM while tensorflow_inter_op_parallelism is set to 4. Then the kustomize build command will provision the YAML files for the selected experiment for the k8s clusters.

Costs

We used the GCP cost estimator for this purpose. Pricing for each experiment configuration was assumed to be live for 24 hours per month (which was sufficient for our experiments).


Machine Configuration (E2 series) Pricing (USD)

2vCPUs, 4GB RAM, 8 Nodes

11.15
4vCPUs, 8GB RAM, 4 Nodes

11.15
8vCPUs, 16GB RAM, 2 Nodes

11.15

8vCPUs, 64GB RAM, 2 Nodes

18.21

Conclusion

In this post, we discussed how to automatically deploy and experiment with an already trained model with various configurations. We leveraged TensorFlow Serving, Kubernetes, and GitHub Actions to streamline the deployment and experiments. We hope that you found this setup useful and reliable and that you will use this in your own model deployment projects.


Acknowledgements

We are grateful to the ML Developer Programs team that provided GCP credits for supporting our experiments. We also thank Hannes Hapke and Robert Crowe for providing us with helpful feedback and guidance.

Read More