Amazon AWS – Page 153

Run your local machine learning code as Amazon SageMaker Training jobs with minimal code changes

April 25, 2023

by Dipankar Patro Amazon AWS

We recently introduced a new capability in the Amazon SageMaker Python SDK that lets data scientists run their machine learning (ML) code authored in their preferred integrated developer environment (IDE) and notebooks along with the associated runtime dependencies as Amazon SageMaker training jobs with minimal code changes to the experimentation done locally. Data scientists typically carry out several iterations of experimentation in data processing and training models while working on any ML problem. They want to run this ML code and carry out the experimentation with ease of use and minimal code change. Amazon SageMaker Model Training helps data scientists run fully managed large-scale training jobs on AWS’s compute infrastructure. SageMaker Training also helps data scientists with advanced tools such as Amazon SageMaker Debugger and Profiler to debug and analyze their large-scale training jobs.

For customers with small budgets, small teams, and tight timelines, every single new concept and line of code rewritten to run on SageMaker makes them less productive towards their core tasks, namely data processing and training ML models. They want to write code once in the framework of their choice and be able to move seamlessly from running code in their notebooks or laptops to running code at scale using SageMaker capabilities.

With this new capability of the SageMaker Python SDK, data scientists can onboard their ML code to the SageMaker Training platform in a few minutes. You just need to add a single line of code to your ML code, and SageMaker intelligently comprehends your code along with the datasets and workspace environment setup and runs it as a SageMaker Training job. You can then take advantage of the key capabilities of the SageMaker Training platform, like the ability to scale jobs easily, and other associated tools like Debugger and Profiler. In this release, you can run your local machine learning (ML) Python code as a single-node Amazon SageMaker training job or multiple parallel jobs. Distributed training jobs(across multiple nodes) are not supported by remote functions.

In this post, we show you how to use this new capability to run local ML code as a SageMaker Training job.

Solution overview

You can now run your ML code written in your IDE or notebook as a SageMaker Training job by annotating the function, which acts as an entry point to the user’s code base, with a simple decorator. Upon invocation, this capability automatically takes a snapshot of all the associated variables, functions, packages, environment variables, and other runtime requirements from your ML code, serializes them, and submits them as a SageMaker Training job. It integrates with the recently announced SageMaker Python SDK feature for setting default values for parameters. This capability simplifies the SageMaker constructs that you need to learn to be able to run code using SageMaker Training. Data scientists can write, debug, and iterate their code in any preferred IDE (such as Amazon SageMaker Studio, notebooks, VS Code, or PyCharm). When ready, you can annotate your Python function with the @remote decorator and run it as a SageMaker job at scale.

This capability takes familiar open-source Python objects as arguments and outputs. Furthermore, you don’t need to understand container lifecycle management and can simply run your workloads across different compute contexts (such as a local IDE, Studio, or training jobs) with minimal configuration overheads. To run any local code as a SageMaker Training job, this capability infers the configurations required to run jobs, such as the AWS Identity and Access Management (IAM) role, encryption key, and network configuration, from the Studio or IDE settings (which can be the default settings) and passes them to the platform by default. You have the flexibility to customize your runtime in the SageMaker managed infrastructure using the inferred configuration or override them at the SDK-level by passing them as arguments to the decorator.

This new capability of the SageMaker Python SDK transforms your ML code in an existing workspace environment and any associated data processing code and datasets into a SageMaker Training job. This capability looks for ML code wrapped inside a @remote decorator and automatically translates it into a job that runs in either Studio or a local IDE such as PyCharm.

In the following sections, we walk through the features of this new capability and how to launch python functions as SageMaker Training jobs.

Prerequisites

To use this new SageMaker Python SDK capability and run the code associated with this post, you need the following prerequisites:

An AWS account that will contain all your AWS resources
An IAM role to access SageMaker
Access to Studio or a SageMaker notebook instance or an IDE such as PyCharm

Use the SDK from Studio and SageMaker notebooks

You can use this capability from Studio by launching a notebook and wrapping your code with a @remote decorator inside the notebook. You first need to import the remote function using the following code:

from sagemaker.remote_function import remote

When you use the decorator function, this capability will automatically interpret the function of your code and run it as a SageMaker Training job.

You can also use this capability from a SageMaker notebook instance. You first need to start a notebook instance, open Jupyter or Jupyter Lab on it, and launch a notebook. Then import the remote function as shown in the preceding code and wrap your code with the @remote decorator. We include an example of how to use the decorator function and the associated settings later in this post.

Use the SDK from your local environment

You can also use this capability from your local IDE. As a prerequisite, you must have the AWS Command Line Interface (AWS CLI), SageMaker Python SDK, and AWS SDK for Python (Boto3) installed in your local environment. You need to import these libraries in your code, set the SageMaker session, specify settings, and decorate your function with the @remote decorator. In the following example code, we run a simple divide function as a SageMaker Training job:

import boto3
import sagemaker
from sagemaker.remote_function import remote

sm_session = sagemaker.Session(boto_session=boto3.session.Session(region_name="us-west-2"))
settings = dict(
    sagemaker_session=sm_session,
    role=<IAM_ROLE_NAME>
    instance_type="ml.m5.xlarge",
)
@remote(**settings)
def divide(x, y):
    return x / y
if __name__ == "__main__":
    print(divide(2, 3.0))

We can use a similar methodology to run advanced functions as training jobs, as shown in the next section.

Launch Python functions as SageMaker jobs

The new SageMaker Python SDK feature allows you to run Python functions as SageMaker Training jobs. Any Python code, ML training code developed by data scientists using their preferred local IDEs (PyCharm, VS Code), SageMaker notebooks, or Studio notebooks can be launched as a managed SageMaker job.

In ML workloads using this capability, associated datasets, dependencies, and workspace environment setups are serialized using the ML code and run as a SageMaker job synchronously and asynchronously.

You can add a @remote decorator annotation to any Python code including a local ML processing or training function to launch it as a managed SageMaker Training job, thereby taking advantage of the scale, performance, and cost benefits of SageMaker. This can be achieved with minimal code changes by adding a decorator to the Python function code. Invocation to the decorated function is run synchronously, and the function run waits until the SageMaker job is complete.

In the following example, we use the @remote decorator to launch SageMaker jobs in decorator mode using an ml.m5.large instance. SageMaker uses training jobs to launch this function as a managed job.

from sagemaker.remote_function import remote
from numpy as np

@remote(instance_type="ml.m5.large")
def matrix_multiply(a, b):
    return np.matmul(a, b)

a = np.array([[1, 0], [0, 1]])
b = np.array([1, 2])

assert matrix_multiply(a, b) == np.array([1,2])

You can also use decorator mode to launch SageMaker jobs, Python packages, and dependencies. You can include environment variables such as VPC, subnets, and security groups to launch SageMaker training jobs in the environment.yml file. This allows ML engineers and admins to configure these environment variables so data scientists can focus on ML model building and iterate faster. See the following code:

from sagemaker.remote_function import remote

@remote(instance_type="ml.g4dn.xlarge",dependencies = "./environment.yml")
def train_hf_model(
    train_input_path,test_input_path,s3_output_path = None,
    *,epochs = 1, train_batch_size = 32, eval_batch_size = 64,
    warmup_steps = 500,learning_rate = 5e-5
    ):  
    model_name = "distilbert-base-uncased"
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    ... <TRUCNATED>
    return os.path.join(s3_output_path, model_dir), eval_result

You can use RemoteExecutor to launch Python functions as SageMaker jobs asynchronously. The executor asynchronously polls SageMaker Training jobs to update the status of the job. The RemoteExecutor class is an implementation of the concurrent.futures.Executor, which is used to submit SageMaker Training jobs asynchronously. See the following code:

from sagemaker.remote_function import RemoteExecutor

def train_hf_model(
    train_input_path,test_input_path,s3_output_path = None,
    *,epochs = 1, train_batch_size = 32, eval_batch_size = 64,
    warmup_steps = 500,learning_rate = 5e-5
    ):  
    model_name = "distilbert-base-uncased"
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    ...<TRUNCATED>
    return os.path.join(s3_output_path, model_dir), eval_result


with RemoteExecutor(instance_type="ml.g4dn.xlarge", dependencies = './requirements.txt') as e:
    future = e.submit(divide, train_input_path,test_input_path,s3_output_path,
                      epochs, train_batch_size, eval_batch_size,warmup_steps,learning_rate)

Customize the runtime environment

Decorator mode and RemoteExecutor allow you to define and customize the runtime environments for the SageMaker job. The runtime dependencies, including Python packages and environment variables for SageMaker jobs, can be specified to customize the runtime. In order to run local Python code as SageMaker managed jobs, the Python package and dependencies need to be made available to SageMaker. ML engineers or data science administrators can configure networking and security configurations such as VPC, subnets, and security groups for SageMaker jobs, so data scientists can use these centrally managed configurations while launching SageMaker jobs. You can use either a requirements.txt file or a Conda environment.yaml file.

When dependencies are defined with requirements.txt, the packages will be installed using pip in the job runtime. If the image used for running the job comes with Conda environments, packages will be installed in the Conda environment declared to use for jobs. The following code shows an example requirements.txt file:

datasets
transformers
torch
scikit-learn
s3fs==0.4.2
sagemaker>=2.148.0

You can pass your Conda environment.yaml file to create the Conda environment you would like your code to run in during the training job. If the image used for running the job declares a Conda environment to run the code under, we will update the declared Conda environment with the given specification. The following code is an example of a Conda environment.yaml file:

name: sagemaker_example
channels:
  - conda-forge
dependencies:
  - python=3.10
  - pandas
  - pip:
      - sagemaker

Alternatively, you can set dependencies=”auto_capture” to let the SageMaker Python SDK capture the installed dependencies in the active Conda environment. You must have an active Conda environment for auto_capture to work. Note that there are prerequisites for auto_capture to work; we recommend that you pass in your dependencies as a requirement.txt or Conda environment.yml file as described in the previous section.

For more details, refer to Run your local code as a SageMaker Training job.

Configurations for SageMaker jobs

Infrastructure-related settings can be offloaded to a configuration file that admin users could help set up. You only need to set it up one time. Infrastructure settings cover the network configuration, IAM roles, Amazon Simple Storage Service (Amazon S3) folder for input, output data, and tags. Refer to Configuring and using defaults with the SageMaker Python SDK for more details.

SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        Dependencies: path/to/requirements.txt
        EnvironmentVariables: {"EnvVarKey": "EnvVarValue"}
        ImageUri: 366666666666.dkr.ecr.us-west-2.amazonaws.com/my-image:latest
        InstanceType: ml.m5.large
        RoleArn: arn:aws:iam::366666666666:role/MyRole
        S3KmsKeyId: somekmskeyid
        S3RootUri: s3://my-bucket/my-project
        SecurityGroupIds:
          - sg123
        Subnets:
          - subnet-1234
        Tags:
          - {"Key": "someTagKey", "Value": "someTagValue"}
        VolumeKmsKeyId: somekmskeyid

Implementation

Deep learning models like PyTorch or TensorFlow can also be run within Studio by running the code as a training job within the notebook. To showcase this capability in Studio, you can clone this repo into your Studio and run the notebook located in the GitHub repository.

This example demonstrates an end-to-end binary text classification use case. We are using the Hugging Face transformers and datasets library to fine-tune a pre-trained transformer on binary text classification. In particular, the pre-trained model will be fine-tuned using the IMDb dataset.

When you clone the repository, you should locate the following files:

config.yaml – Most of the decorator arguments can be offloaded to the configuration file in order to separate out the infrastructure-related settings from the code base
huggingface.ipynb – This contains the code to train a pre-trained HuggingFace model, which will be fine-tuned using the IMDB dataset
requirements.txt – This file contains all the dependencies to run the function that will be used in this notebook for running the code and running the training remotely in a GPU instance as a training job

When you open the notebook, you will be prompted to set up the notebook environment. You can select the Data Science 3.0 image with the Python 3 kernel and ml.m5.large as the fast launch instance type for running the notebook code. This instance type is significantly faster in spinning up an environment.

The training job will be run in an ml.g4dn.xlarge instance as defined in the config.yaml file:

SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        # role arn is not required if in SageMaker Notebook instance or SageMaker Studio
        # Uncomment the following line and replace with the right execution role if in a local IDE
        # RoleArn: <IAM_ROLE_ARN>
        InstanceType: ml.g4dn.xlarge
        Dependencies: ./requirements.txt

The requirements.txt file dependencies to run the function for training the Hugging Face model include the following:

datasets
transformers
torch
scikit-learn
# lock s3fs to this specific version as more recent ones introduce dependency on aiobotocore, which is not compatible with botocore
s3fs==0.4.2
sagemaker>=2.148.0,<3

The Hugging Face notebook showcases how to run the training remotely via the @remote function, which is run synchronously. Therefore, the function run for training the model will wait until the SageMaker Training job is complete. The training will be run remotely with a GPU instance wherein the instance type is defined in the preceding configuration file.

from sagemaker.remote_function import remote

@remote(s3_root_uri=s3_root_folder, keep_alive_period_in_seconds=600)
def train_hf_model(
    train_input_path,
    test_input_path,
    s3_output_path = None,
    *,
    epochs = 1,
    train_batch_size = 32,
    eval_batch_size = 64,
    warmup_steps = 500,
    learning_rate = 5e-5
):  
    model_dir = 'model'

    train_dataset = load_from_disk(train_input_path, keep_in_memory=True)
    test_dataset = load_from_disk(test_input_path, keep_in_memory=True)
    
    model_name = 'distilbert-base-uncased'
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    
    training_args = TrainingArguments(
        output_dir=model_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=eval_batch_size,
        warmup_steps=warmup_steps,
        evaluation_strategy="epoch",
        logging_dir="logs/",
        learning_rate=float(learning_rate),
    )

    # create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        tokenizer=tokenizer,
    )
    
    print("Starting model training..")
    trainer.train()
        
    trainer.save_model(model_dir)

After you run the training job, you can run the rest of the cells in the notebook to inspect the evaluation metrics and classify the text on our trained model.

You can also view the training job status that got remotely triggered in the GPU instance on the SageMaker dashboard by navigating back to the SageMaker console.

As soon as the training job is complete, it continues to run the instructions in the notebook for evaluation and classification. Similar jobs can be trained and run via the remote executor function embedded within Studio notebooks to carry out the runs asynchronously.

Integration with SageMaker experiments inside a @remote function

You can pass your experiment name, run name, and other parameters into your remote function to create a SageMaker experiments run. The following code example imports the experiment name, the name of the run, and the parameters to log for each run:

from sagemaker.remote_function import remote
from sagemaker.experiments.run import Run
# Define your remote function
@remote
def train(value_1, value_2, exp_name, run_name):
...
...
#Creates the experiment
with Run(
  experiment_name=exp_name,
  run_name=run_name,
  sagemaker_session=sagemaker_session
) as run:
...
...
#Define values for the parameters to log
run.log_parameter("param_1", value_1)
run.log_parameter("param_2", value_2)
...
...
#Define metrics to log
run.log_metric("metric_a", 0.5)
run.log_metric("metric_b", 0.1)

# Invoke your remote function
train(1.0, 2.0, "my-exp-name", "my-run-name")

In the preceding example, the parameters p1 and p2 are logged over time inside a training loop. Common parameters may include batch size or epochs. In the example, the metrics A and B are logged for a run over time inside a training loop. Common metrics may include accuracy or loss. For more information, see Create an Amazon SageMaker Experiment.

Conclusion

In this post, we introduced a new SageMaker Python SDK capability that enables data scientists to run their ML code in their preferred IDE as SageMaker Training jobs. We discussed the prerequisites needed to use this capability along with its features. We also showed how to use this capability in Studio, SageMaker notebook instances, and your local IDE. In addition, we provided sample code examples to demonstrate how to use this capability. As a next step, we recommend trying this capability in your IDE or SageMaker by following the code examples referenced in this post.

About the Authors

Dipankar Patro is a Software Development Engineer at AWS SageMaker, innovating and building MLOps solutions to help customers adopt AI/ML solutions at scale. He has an MS in Computer Science and his areas of interest are Computer Security, Distributed Systems and AI/ML.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Manoj Ravi is a Senior Product Manager for Amazon SageMaker. He is passionate about building next-gen AI products and works on software and tools to make large-scale machine learning easier for customers. He holds an MBA from Haas School of Business and a Masters in Information Systems Management from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.

Shikhar Kwatra is an AI/ML Specialist Solutions Architect at Amazon Web Services, working with a leading Global System Integrator. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partner in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.

Vikram Elango is a Sr. AI/ML Specialist Solutions Architect at AWS, based in Virginia, US. He is currently focused on generative AI, LLMs, prompt engineering, large model inference optimization, and scaling ML across enterprises. Vikram helps financial and insurance industry customers with design and thought leadership to build and deploy machine learning applications at scale. In his spare time, he enjoys traveling, hiking, cooking, and camping.

Perform intelligent search across emails in your Google workspace using the Gmail connector for Amazon Kendra

April 25, 2023

by Roshan Thomas Amazon AWS

Many organizations use Gmail for their business email needs. Gmail for Business is part of Google Workspace, which provides a set of productivity and collaboration tools like Google Drive, Google Docs, Google Sheets, and more. For any organization, emails contain a wealth of information, which could be within the subject of an email, the message content, or even email attachments. Performing an intelligent search on email interactions with coworkers can help find answers to questions, thereby improving employee productivity and enhancing the overall customer experience for the organization.

Amazon Kendra is a highly accurate and intelligent search service that allows your users to search unstructured and structured data using natural language processing (NLP) and advanced search algorithms. You can now use the Gmail connector for Amazon Kendra to index emails and email attachments in Gmail, and search for answers to your questions on this content using intelligent search in Amazon Kendra, powered by machine learning (ML).

This post walks you through the process of configuring the Gmail connector for Amazon Kendra for your organization’s Google Workspace, allowing you to index emails based on a defined scope and take advantage of the intelligent search capabilities of Amazon Kendra.

Solution overview

A data source is a data repository or location that Amazon Kendra connects to and indexes your documents or content. After you create an Amazon Kendra index, you can create one or many data sources and configure them to start ingesting documents from the data source. In our solution, we ingest emails and attachments from Gmail by configuring the new Gmail data source connector to filter for emails that meet a certain filter criterion. After the configuration is complete, we can synchronize the data source to index the documents, allowing you to perform intelligent search on the Amazon Kendra index.

Prerequisites

To enable the Gmail connector for Amazon Kendra, you need the following:

An AWS account
A Google Workspace account and an organization for your business with one or many users that have access to Gmail
Administrator account credentials to Google Workspace and the Google Cloud console

Configure Google Workspace

To enable Amazon Kendra to access and index emails from Gmail accounts within the organization and perform intelligent search on them, it’s essential to configure your organization’s Google Workspace. In the steps that follow, we create a service account that the Gmail connector uses to index emails. The service account is provided with authorization scopes to allow access to certain Gmail APIs. The authorization scopes express the permissions you request users to authorize for your app and are applicable for all emails within your organization’s Google Workspace.

Log in to your organization’s Google Cloud account.
Create a new project with an appropriate name and assign it to your organization. In our example, we name the project KendraGmailConnector.
Choose Create.

Monitor the progress of creation of the new project on the Notifications menu on the top right of the Google Cloud console.

After the project is created, choose the options menu, choose API & Services¸ and choose Library to view the API Library.

On the API Library, search for Admin SDK API and choose Enable. The Admin SDK API enables managing the Google Workspace account resources and audit usage.

Similarly, search for Gmail API on the API Library page and choose Enable. The Gmail API can help in viewing and managing Gmail mailbox data like threads, messages, and labels.

We now create a service account, which the Gmail connector for Amazon Kendra uses to access your organization’s emails based on the allowed API scope.

On the options menu, choose IAM & Admin, then choose Service Accounts.

Choose Create service account.

Enter a name for your service account. For this post, we name our service account AmazonKendraGmailConnector.
Enter your service account ID and account description.
Skip the optional steps Grant this service account access to project and Grant users access to this service account and choose Done.

Choose the service account you created to open the service account details page.
Note the unique ID for the service account (also known as a client ID), to use in a later step.

Next, we create keys for the service account, which allows it to be used by the Gmail connector for Amazon Kendra.

On the Keys tab, choose Add key.

For Key type, select JSON.
Choose Create.

This step downloads the private key to your computer, which must be kept safe to allow configuration on the Amazon Kendra console.

Choose Close.

The following screenshot shows an example of the credentials JSON file.

On the Details tab, expand the Advanced settings section.
Under Domain-wide delegation, choose View Google Workspace admin console.

Granting access to the service account via a domain-wide delegation to your organization’s data must be done with caution, and can be reversed by disabling or deleting the service account or removing access through the Google Workspace admin console.

Log in to the admin console using your Google Workspace admin credentials.
In the navigation pane, under Security, choose Access and data control, then choose API controls.
In the Domain-wide delegation section, choose Manage domain-wide delegation.

Choose Add new.

This brings up the Add a new client ID dialog.

Enter the unique ID for the service account you created earlier, and enter the following scopes to allow the service account to access the emails from Gmail:
1. https://www.googleapis.com/auth/gmail.readonly
2. https://www.googleapis.com/auth/admin.directory.user.readonly
Choose Authorize.

This concludes the configuration within the Google Cloud console and Google Workspace admin console.

Configure the Gmail connector for Amazon Kendra

In this section, we walk through the configuration steps for the Gmail connector for Amazon Kendra:

On the Amazon Kendra console, create a new index or open an existing index. For this post, we use the existing index EnterpriseKendraIndex.

Under Data management in the navigation pane, choose Data sources.
Choose Add data source.

On the list of data sources, find the Gmail connector and choose Add connector.

On the Specify data source details page, complete the following steps:
1. For Data source name, enter a name.
2. For Description, enter an optional description.
3. Leave the language as the default setting, English (en).
Amazon Kendra supports a select set of languages with full semantic search. These languages include Spanish, Japanese, French, and others. For more information, see Adding documents in languages other than English.
1. Add any tags to the index, then choose Next.

Next, we create an AWS Secrets Manager secret to store the Gmail authentication details, and use the values in the credentials JSON file that we downloaded earlier.

On the Define access and security page, complete the following steps:
1. 1. In the Authentication section, choose Create and add new secret, which opens the Create an AWS Secrets Manager secret dialog.
  2. For Secret name, enter a name.
  3. For Client email, enter the client email ID from the credentials JSON file.
  4. For Admin account email, enter the admin email for the Google Cloud console.
  5. For Private key, enter the private key from the credentials JSON file.
  6. Choose Save to return to the Define access and security page.
1. 1. In the Configure VPC and security group section, you can choose a VPC and the subnets that will contain the data source and security group that will grant access to the host. For our configuration, we choose No VPC.
  2. In the IAM role section, choose Create a new role and enter a role name.
  3. Choose Next.

On the Configure sync settings page, set the following parameters to sync all emails and email attachments sent from the admin email address:
1. In the Sync scope section, select Message attachments.
2. Under Additional configuration, configure filters for the emails to ingest into the Amazon Kendra index:
  1. For Date range, enter the start and end dates for emails to be crawled. Emails received on or after the start date and before the end date are included in the sync scope.
  2. For Email domains, enter the email from domains, email to domains, subject, CC, and BCC emails you wish to include or exclude in your index. For this post, we set the email from domain as the admin email address.
  3. For Keywords in subjects, include or exclude any documents with at least one keyword mentioned in their subjects.
  4. For Labels, add regular expression patterns to include or exclude certain labels or attachment types (up to 100 patterns).
  5. For Attachments, add regular expression patterns to include or exclude certain attachments (up to 100 patterns).

1. In the Sync mode section, you can either specify a full sync to sync and index all contents in all entities regardless of the previous sync status, or only sync new, modified, or deleted content. For this post, we select Full sync.

1. Lastly, we set an appropriate frequency for the sync. For this post, we choose Run on demand.
2. Choose Next.
On the Set field mappings page, you associate or create a mapping of the required data source fields with fields in your index. You can also create mappings for custom index fields. You can specify mapping for both messages and message attachments. For this post, we add field mappings in the Message section:
1. Select the Gmail field mappings subject, from, and to.
2. Choose Next.

On the Review and create page, review all the steps and choose Add data source to create your Gmail connector data source.
After the data source is created, on the Data sources page, select the data source (kendra-gmail-connector) and choose Sync now.

The amount of time the sync takes depends on the number of the emails that match the sync scope and the size of attachments that need be indexed. You can check the status of the sync operation for the Gmail data source if you choose the data source and scroll down to the Sync run history section. Choose the status of the individual sync to view more details.

This section shows the start and end times of the sync and also the number of documents that were added, deleted, failed, or modified during the sync. A status of Completed denotes a sync where there are no failures. In cases where a document being ingested is blank, the sync status is set to Completed with Errors with the number of failed documents listed as Failed, as shown in the following screenshot. In case of a sync failure, you can investigate the reason by either choosing the number of failed documents or by choosing the entry in the Details column, which brings up the Amazon CloudWatch logs. In the following example, two documents failed ingestion because they were blank.

After the sync is successful, you can perform a search on the Amazon Kendra index.

Search indexed content

To search on the indexed content, choose Search indexed content in the navigation pane on the Amazon Kendra console.

On the search console, enter any natural language question. In our example, we ask “What is SageMaker.” Amazon Kendra performs an intelligent search on the emails ingested into the index based on the scope of the sync and finds an answer, as shown in the following screenshot.

In this example, the Document fields section shows the field mappings that we specified while configuring our data source connector.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Gmail connector, delete the added data source.

Conclusion

In this post, we showed how organizations can now use the Gmail connector for Amazon Kendra to allow users to perform intelligent search on emails and email attachments, thereby improving employee productivity and customer satisfaction.

Additionally, we walked through how to define field mappings to the Amazon Kendra data source, allowing users to refine their search results.

To learn more about the Gmail connector for Amazon Kendra, refer to Gmail data source connector for Amazon Kendra.

About the Author

Roshan Thomas is a Senior Solutions Architect at Amazon Web Services. He is based in Melbourne, Australia, and works closely with power and utilities customers to accelerate their journey in the cloud. He is passionate about technology and helping customers architect and build solutions on AWS.

Amazon SageMaker Data Wrangler for dimensionality reduction

April 24, 2023

by Adeleke Coker Amazon AWS

In the world of machine learning (ML), the quality of the dataset is of significant importance to model predictability. Although more data is usually better, large datasets with a high number of features can sometimes lead to non-optimal model performance due to the curse of dimensionality. Analysts can spend a significant amount of time transforming data to improve model performance. Additionally, large datasets are costlier and take longer to train. If time is a constraint, model performance may be limited as a result.

Dimension reduction techniques can help reduce the size of your data while maintaining its information, resulting in quicker training times, lower cost, and potentially higher-performing models.

Amazon SageMaker Data Wrangler is a purpose-built data aggregation and preparation tool for ML. Data Wrangler simplifies the process of data preparation and feature engineering like data selection, cleansing, exploration, and visualization from a single visual interface. Data Wrangler has more than 300 preconfigured data transformations that can effectively be used in transforming the data. In addition, you can write custom transformation in PySpark, SQL, and pandas.

Today, we’re excited to add a new transformation technique that is commonly used in the ML world to the list of Data Wrangler pre-built transformations: dimensionality reduction using Principal Component Analysis. With this new feature, you can reduce the high number of dimensions in your datasets to one that can be used with popular ML algorithms with just a few clicks on the Data Wrangler console. This can have significant improvements in your model performance with minimal effort.

In this post, we provide an overview of this new feature and show how to use it in your data transformation. We will show how to use dimensionality reduction on large sparse datasets.

Overview of Principal Component Analysis

Principal Component Analysis (PCA) is a method by which the dimensionality of features can be transformed in a dataset with many numerical features into one with fewer features while still retaining as much information as possible from the original dataset. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. Several features in a dataset often have less impact on the final result and may increase the processing time of ML models. It can become difficult for humans to understand and solve such high-dimensional problems. Dimensionality reduction techniques like PCA can help solve this for us.

Solution overview

In this post, we show how you can use the dimensionality reduction transform in Data Wrangler on the MNIST dataset to reduce the number of features by 85% and still achieve similar or better accuracy than the original dataset. The MNIST (Modified National Institute of Standards and Technology) dataset, which is the de facto “hello world” dataset in computer vision, is a dataset of handwritten images. Each row of the dataset corresponds to a single image that is 28 x 28 pixels, for a total of 784 pixels. Each pixel is represented by a single feature in the dataset with a pixel value ranging from 0–255.

To learn more about the new dimensionality reduction feature, refer to Reduce Dimensionality within a Dataset.

Prerequisites

This post assumes that you have an Amazon SageMaker Studio domain set up. For details on how to set it up, refer to Onboard to Amazon SageMaker Domain Using Quick setup.

To get started with the new capabilities of Data Wrangler, open Studio after upgrading to the latest release and choose the File menu, New, and Flow, or choose New data flow from the Studio launcher.

Perform a Quick Model analysis

The dataset we use in this post contains 60,000 training examples and labels. Each row consists of 785 values: the first value is the label (a number from 0–9) and the remaining 784 values are the pixel values (a number from 0–255). First, we perform a Quick Model analysis on the raw data to get performance metrics and compare them with the model metrics post-PCA transformations for evaluation. Complete the following steps:

Download the MNIST dataset training dataset.
Extract the data from the .zip file and upload into an Amazon Simple Storage Service (Amazon S3) bucket.
In Studio, choose New and Data Wrangler Flow to create a new Data Wrangler flow.
Choose Import data to load the data from Amazon S3.
Choose Amazon S3 as your data source.
Select the dataset uploaded to your S3 bucket.
Leave the default settings and choose Import.

After the data is imported, Data Wrangler automatically validates the datasets and detects the data types for all the columns based on its sampling. In the MNIST dataset, because all the columns are long, we leave this step as is and go back to the data flow.

Choose Data flow at the top of the Data types page to return to the main data flow.

The flow editor now shows two blocks showcasing that the data was imported from a source and the data types recognized. You can also edit the data types if needed.

After confirming that the data quality is acceptable, we go back to the data flow and use Data Wrangler’s Data Quality and Insights Report. This report performs an analysis on the imported dataset and provides information about missing values, outliers, target leakage, imbalanced data, and a Quick Model analysis. Refer to Get Insights On Data and Data Quality for more information.

For this analysis, we only focus on the Quick Model part of the Data Quality report.

Choose the plus sign next to Data types, then choose Add analysis.
For Analysis type¸ choose Data Quality And Insights Report.
For Target column, choose label.
For Problem type, select Classification (this step is optional).
Choose Create.

For this post, we use the Data Quality and Insights Report to show how the model performance is mostly preserved using PCA. We recommend that you use a deep learning-based approach for better performance.

The following screenshot shows a summary of the dataset from the report. Fortunately, we don’t have any missing values. The time taken for the report to generate depends on the size of the dataset, number of features, and the instance size used by Data Wrangler.

The following screenshot shows how the model performed on the raw dataset. Here we notice that the model has an accuracy of 93.7% utilizing 784 features.

Use the Data Wrangler dimensionality reduction transform

Now let’s use the Data Wrangler dimensionality reduction transform to reduce the number of features in this dataset.

On the data flow page, choose the plus sign next to Data types, then choose Add transform.
Choose Add step.
Choose Dimensionality Reduction.

If you don’t see the dimensionality reduction option listed, you need to update Data Wrangler. For instructions, refer to Update Data Wrangler.

Configure the key variables that go into PCA:
1. For Transform, choose the dimensionality reduction technique that you want to use. For this post, we choose Principal component analysis.
2. For Input Columns, choose the columns that you want to include in the PCA analysis. For this example, we choose all the features except the target column label (you can also use the Select all feature to select all features and deselect features not needed). These columns need to be of numeric data type.
3. For Number of principal components, specify the number of target dimensions.
4. For Variance threshold percentage, specify the percentage of variation in the data that you want to explain by the principal components. The default value is 95; for this post, we use 80.
5. Select Center to center the data with the mean before scaling.
6. Select Scale to scale the data with the unit standard deviation.
  PCA gives more emphasis to variables with high variance. Therefore, if the dimensions are not scaled, we will get inconsistent results. For example, the value for one variable might lie in the range of 50–100, and another variable is 5–10. In this case, PCA will give more weight to the first variable. Such issues can be resolved by scaling the dataset before applying PCA.
7. For Output Format, specify if you want to output components into separate columns or vectors. For this post, we choose Columns.
8. For Output column, enter a prefix for column names generated by PCA. For this post, we enter PCA80_.
Choose Preview to preview the data, then choose Update.

After applying PCA, the number of columns will be reduced from 784 to 115—this is an 85% reduction in the number of features.

We can now use the transformed dataset and generate another Data Quality and Insights Report as shown in the following screenshot to observe the model performance.

We can see in the second analysis that the model performance has improved and accuracy increased to 91.8% compared to the first Quick Model report. PCA reduced the number of features in our dataset by 85% while maintaining the model accuracy at similar levels.

Based on the Quick Model analysis from the report, model performance is at 91.8%. With PCA, we reduced the columns by 85% while still maintaining the model accuracy at similar levels. For better results, you can try deep learning models, which might offer even better performance.

We found the following comparison in training time using Amazon SageMaker Autopilot with and without PCA dimensionality reduction:

With PCA dimensional reduction – 25 minutes
Without PCA dimensional reduction – 45 minutes

Operationalizing PCA

As data changes over time, it’s often desirable to retrain our parameters to new unseen data. Data Wrangler offers this capability through the use of refitting parameters. For more information on refitting trained parameters, refer to Refit trained parameters on large datasets using Amazon SageMaker Data Wrangler.

Previously, we applied PCA to a sample of the MNIST dataset containing 50,000 sample rows. Consequently, our flow file contains a model that has been trained on this sample and used for all created jobs unless we specify that we want to relearn those parameters.

To refit your model parameters on the MNIST training dataset, complete the following steps:

Create a destination for our flow file in Amazon S3 so we can create a Data Wrangler processing job.
Create a job and select Refit to learn new training parameters.

The Trained parameters section shows that there are 784 parameters. That is one parameter for each column because we excluded the label column in our PCA reduction.

Note that if we don’t select Refit in this step, the trained parameters learned during interactive mode will be used.

Create the job.
Choose the processing job link to monitor the job and find the location of the resulting flow file on Amazon S3.

This flow file contains the model learned on the entire MNIST train dataset.

Load this file into Data Wrangler.

Clean up

To clean up the environment so you don’t incur additional charges, delete the datasets and artifacts in Amazon S3. Additionally, delete the data flow file in Studio and shut down the instance it runs on. Refer to Shut Down Data Wrangler for more information.

Conclusion

Dimensionality reduction is a great technique to remove the unwanted variables from a model. It can be used to reduce the model complexity and noise in the data, thereby mitigating the common problem of overfitting in machine learning and deep learning models. In this blog we demonstrated that by reducing the number of features, we were still able to accomplish similar or higher accuracy for our models.

For more information about using PCA, refer to Principal Component Analysis (PCA) Algorithm. To learn more about the dimensionality reduction transform, refer to Reduce Dimensionality within a Dataset.

About the authors

Adeleke Coker is a Global Solutions Architect with AWS. He works with customers globally to provide guidance and technical assistance in deploying production workloads at scale on AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.

Abigail is a Software Development Engineer at Amazon SageMaker. She is passionate about helping customers prepare their data in DataWrangler and building distributed machine learning systems. In her free time, Abigail enjoys traveling, hiking, skiing, and baking.

Vishaal Kapoor is a Senior Applied Scientist with AWS AI. He is passionate about helping customers understand their data in Data Wrangler. In his spare time, he mountain bikes, snowboards, and spends time with his family.

Raviteja Yelamanchili is an Enterprise Solutions Architect with Amazon Web Services based in New York. He works with large financial services enterprise customers to design and deploy highly secure, scalable, reliable, and cost-effective applications on the cloud. He brings over 11+ years of risk management, technology consulting, data analytics, and machine learning experience. When he is not helping customers, he enjoys traveling and playing PS5.

Identify objections in customer conversations using Amazon Comprehend to enhance customer experience without ML expertise

April 24, 2023

by Ray Wang Amazon AWS

According to a PWC report, 32% of retail customers churn after one negative experience, and 73% of customers say that customer experience influences their purchase decisions. In the global retail industry, pre- and post-sales support are both important aspects of customer care. Numerous methods, including email, live chat, bots, and phone calls, are used to provide customer assistance. Since conversational AI has improved in recent years, many businesses have adopted cutting-edge technologies like AI-powered chatbots and AI-powered agent support to improve customer service while increasing productivity and lowering costs.

Amazon Comprehend is a fully managed and continuously trained natural language processing (NLP) service that can extract insight about the content of a document or text. In this post, we explore how AWS customer Pro360 used the Amazon Comprehend custom classification API, which enables you to easily build custom text classification models using your business-specific labels without requiring you to learn machine learning (ML), to improve customer experience and reduce operational costs.

Pro360: Accurately detect customer objections in chatbots

Pro360 is a marketplace that aims to connect specialists with industry-specific talents with potential clients, allowing them to find new opportunities and expand their professional network. It allows customers to communicate directly with experts and negotiate a customized price for their services based on their individual requirements. Pro360 charges clients when successful matches occur between specialists and clients.

Pro360 had to deal with a problem related to unreliable charges that led to consumer complaints and reduced trust with the brand. The problem was that it was difficult to understand the customer’s objective during convoluted conversations filled with multiple aims, courteous denials, and indirect communication. Such conversations were leading to erroneous charges that reduced customer satisfaction. As an example, a customer may start a conversation and stop immediately, or end the conversation by politely declining by saying “I am busy” or “Let me chew on it.” Also, due to cultural differences, some customers might not be used to expressing their intentions clearly, particularly when they want to say “no.” This made it even more challenging.

To solve this problem, Pro360 initially added options and choices for the customer, such as “I would like more information” or “No, I have other options.” Instead of typing their own question or query, the customer simply chooses the options provided. Nonetheless, the problem was still not solved because customers preferred to speak plainly and in their own natural language while interacting with the system. Pro360 identified that the problem was a result of rules-based systems, and that moving to an NLP-based solution would result in a better understanding of customer intent, and lead to better customer satisfaction.

Custom classification is a feature of Amazon Comprehend, which allows you to develop your own classifiers using small datasets. Pro360 utilized this feature to build a model with 99.2% accuracy by training on 800 data points and testing on 300 data points. They followed a three-step approach to build and iterate the model to achieve their desired level of accuracy from 82% to 99.3%. Firstly, Pro360 defined two classes, reject and non-reject, that they wanted to use for classification. Secondly, they removed irrelevant emojis and symbols such as ~ and ... and identified negative emojis to improve the model’s accuracy. Lastly, they defined three additional content classifications to improve the misidentification rate, including small talk, ambiguous response, and reject with a reason, to further iterate the model.

In this post, we share how Pro360 utilized Amazon Comprehend to track down consumer objections during discussions and used a human-in-the-loop (HITL) mechanism to incorporate customer feedback into the model’s improvement and accuracy, demonstrating the ease of use and efficiency of Amazon Comprehend.

“Initially, I believed that implementing AI would be costly. However, the discovery of Amazon Comprehend enables us to efficiently and economically bring an NLP model from concept to implementation in a mere 1.5 months. We’re grateful for the support provided by the AWS account team, solution architecture team, and ML experts from the SSO and service team.”

– LC Lee, founder and CEO of Pro360.

Solution overview

The following diagram illustrates the solution architecture covering real-time inference, feedback workflow, and human review workflow, and how those components contribute to the Amazon Comprehend training workflow.

In the following sections, we walk you through each step in the workflow.

Real-time text classification

To use Amazon Comprehend custom classification in real time, you need to deploy an API as the entry point and call an Amazon Comprehend model to conduct real-time text classification. The steps are as follows:

The client side calls Amazon API Gateway as the entry point to provide a client message as input.
API Gateway passes the request to AWS Lambda and calls the API from Amazon DynamoDB and Amazon Comprehend in Steps 3 and 4.
Lambda checks the current version of the Amazon Comprehend endpoint that stores data in DynamoDB, and calls an Amazon Comprehend endpoint to get real-time inference.
Lambda, with a built-in rule, checks the score to determine whether it’s under the threshold or not. It then stores that data in DynamoDB and waits for human approval to confirm the evaluation result.

Feedback workflow

When the endpoint returns the classification result to the client side, the application prompts the end-user with a hint to get their feedback, and stores the data in the database for the next round (the training workflow). The steps for the feedback workflow are as follows:

The client side sends the user feedback by calling API Gateway.
API Gateway bypasses the request to Lambda. Lambda checks the format and stores it in DynamoDB.
The user feedback from Lambda is stored in DynamoDB and will be used for the next training process.

Human review workflow

The human review process helps us clarify data with a confidence score below the threshold. This data is valuable for improving the Amazon Comprehend model, and is added to the next iteration of retraining. We used Elastic Load Balancing as the entry point to conduct this process because the Pro360 system is built on Amazon Elastic Complute Cloud (Amazon EC2). The steps for this workflow are as follows:

We use an existing API on the Elastic Load Balancer as the entry point.
We use Amazon EC2 as the compute resource to build a front-end dashboard for the reviewer to tag the input data with lower confidence scores.
After the reviewer identifies the objection from the input data, we store the result in a DynamoDB table.

Amazon Comprehend training workflow

To start the training the Amazon Comprehend model, we need to prepare the training data. The following steps show you how to train the model:

We use AWS Glue to conduct extract, transform, and load (ETL) jobs and merge the data from two different DynamoDB tables and store it in Amazon Simple Storage Service (Amazon S3).
When the Amazon S3 training data is ready, we can trigger AWS Step Functions as the orchestration tool to run the training job, and we pass the S3 path into the Step Functions state machine.
We invoke a Lambda function to validate that the training data path exists, and then trigger an Amazon Comprehend training job.
After the training job starts, we use another Lambda function to check the training job status. If the training job is complete, we get the model metric and store it in DynamoDB for further evaluation.
We check the performance of the current model with a Lambda model selection function. If the current version’s performance is better than the original one, we deploy it to the Amazon Comprehend endpoint.
Then we invoke another Lambda function to check the endpoint status. The function updates information in DynamoDB for real-time text classification when the endpoint is ready.

Summary and next steps

In this post, we showed how Amazon Comprehend enables Pro360 to build an AI-powered application without ML expert practitioners, which is able to increase the accuracy of customer objection detection. Pro360 was able to build a custom-purposed NLP model in just 1.5 months, and now is able to identify 90% of customer polite rejections and detect customer intent with 99.2% overall accuracy. This solution not only enhances the customer experience, increasing 28.5% retention rate growth, but also improves financial outcomes, decreasing the operation cost by 8% and reducing the workload for customer service agents.

However, identifying customer objections is just the first step in improving the customer experience. By continuing to iterate on the customer experience and accelerate revenue growth, the next step is to identify the reasons for customer objections, such as lack of interest, timing issues, or influence from others, and to generate the appropriate response to increase the sales conversion rate.

To use Amazon Comprehend to build custom text classification models, you can access the service through the AWS Management Console. To learn more about how to use Amazon Comprehend, check out Amazon Comprehend developer resources.

About the Authors

Ray Wang is a Solutions Architect at AWS. With 8 years of experience in the IT industry, Ray is dedicated to building modern solutions on the cloud, especially in NoSQL, big data, and machine learning. As a hungry go-getter, he passed all 12 AWS certificates to make his technical field not only deep but wide. He loves to read and watch sci-fi movies in his spare time.

Josie Cheng is a HKT AI/ML Go-To-Market at AWS. Her current focus is on business transformation in retail and CPG through data and ML to fuel tremendous enterprise growth. Before joining AWS, Josie worked for Amazon Retail and other China and US internet companies as a Growth Product Manager.

Shanna Chang is a Solutions Architect at AWS. She focuses on observability in modern architectures and cloud-native monitoring solutions. Before joining AWS, she was a software engineer. In her spare time, she enjoys hiking and watching movies.

Wrick Talukdar is a Senior Architect with the Amazon Comprehend Service team. He works with AWS customers to help them adopt machine learning on a large scale. Outside of work, he enjoys reading and photography.

Create SageMaker Pipelines for training, consuming and monitoring your batch use cases

April 21, 2023

by Shelbee Eigenbrode Amazon AWS

Batch inference is a common pattern where prediction requests are batched together on input, a job runs to process those requests against a trained model, and the output includes batch prediction responses that can then be consumed by other applications or business functions. Running batch use cases in production environments requires a repeatable process for model retraining as well as batch inference. That process should also include monitoring that model to measure performance over time.

In this post, we show how to create repeatable pipelines for your batch use cases using Amazon SageMaker Pipelines, Amazon SageMaker model registry, SageMaker batch transform jobs, and Amazon SageMaker Model Monitor. This solution highlights the ability to use the fully managed features within SageMaker MLOps to reduce operational overhead through fully managed and integrated capabilities.

Solution overview

There are multiple scenarios for performing batch inference. In some cases, you may be retraining your model every time you run batch inference. Alternatively, you may be training your model less frequently than you are performing batch inference. In this post, we focus on the second scenario. For this example, let’s assume you have a model that is trained periodically, roughly one time per month. However, batch inference is performed against the latest model version on a daily basis. This is a common scenario, in which the model training lifecycle is different than the batch inference lifecycle.

The architecture supporting the introduced batch scenario contains two separate SageMaker pipelines, as shown in the following diagram.

We use the first pipeline to train the model and baseline the training data. We use the generated baseline for ongoing monitoring in the second pipeline. The first pipeline includes the steps needed to prepare data, train the model, and evaluate the performance of the model. If the model performs acceptably according to the evaluation criteria, the pipeline continues with a step to baseline the data using a built-in SageMaker Pipelines step. For the data drift Model Monitor type, the baselining step uses a SageMaker managed container image to generate statistics and constraints based on your training data. This baseline is then used to monitor for signals of data drift during batch inference. Finally, the first pipeline completes when a new model version is registered into the SageMaker model registry. At this point, the model can be approved automatically, or a secondary manual approval can be required based on a peer review of model performance and any other identified criteria.

In the second pipeline, the first step queries the model registry for the latest approved model version and runs the data monitoring job, which compares the data baseline generated from the first pipeline with the current input data. The final step in the pipeline is performing batch inference against the latest approved model.

The following diagram illustrates the solution architecture for each pipeline.

For our dataset, we use a synthetic dataset from a telecommunications mobile phone carrier. This sample dataset contains 5,000 records, where each record uses 21 attributes to describe the customer profile. The last attribute, Churn, is the attribute that we want the ML model to predict. The target attribute is binary, meaning the model predicts the output as one of two categories (True or False).

The following GitHub repo contains the code for demonstrating the steps performed in each pipeline. It contains three notebooks: to perform the initial setup, to create the model train and baseline pipeline, and create the batch inference and Model Monitor pipeline. The repository also includes additional Python source code with helper functions, used in the setup notebook, to set up required permissions.

|-Custom_IAM_policies
	| |—Custom_IAM_roles_policy
	| |—Custom_Lambda_policy
|— pipeline_scripts
	| |— evaluate.py
	| |— preprocessing.py
|— 0.Setup.ipynb
|— 1.SageMakerPipeline-BaselineData-Train.ipynb
|— 2.SageMakerPipeline-ModelMonitoring-DataQuality-BatchTransform.ipynb
|— iam_helper.py
|— lambda_getapproved_model.py

Prerequisites

The following screenshot lists some permission policies that are required by the SageMaker execution role for the workflow. You can enable these permission policies through AWS Identity and Access Management (IAM) role permissions.

AmazonSageMaker-ExecutionPolicy-<...> is the execution role associated with the SageMaker user and has the necessary Amazon Simple Storage Service (Amazon S3) bucket policies. Custom_IAM_roles_policy and Custom_Lambda_policy are two custom policies created to support the required actions for the AWS Lambda function. To add the two custom policies, go to the appropriate role (associated with your SageMaker user) in IAM, click on Add permissions and then Create inline policy. Then, choose JSON inside Create policy, add the policy code for first custom policy and save the policy. Repeat the same for the second custom policy.

0.Setup.ipynb is a prerequisite notebook required before running notebooks 1 and 2. The code sets up the S3 paths for pipeline inputs, outputs, and model artifacts, and uploads scripts used within the pipeline steps. This notebook also uses one of the provided helper functions, create_lambda_role, to create a Lambda role that is used in notebook 2, 2.SageMakerPipeline-ModelMonitoring-DataQuality-BatchTransform.ipynb. See the following code:

# Create Lambda execution role for Lambda Function using helper function
from iam_helper import create_lambda_role

lambda_role = create_lambda_role("Lambda-SageMaker-GetModelRole")
print('Lambda Role:', lambda_role)

After you’ve successfully completed all of the tasks in the setup notebook, you’re ready to build the first pipeline to train and baseline the model.

Pipeline 1: Train and baseline pipeline

In this section, we take a deep dive into the SageMaker pipeline used to train and baseline the model. The necessary steps and code are in the 1.SageMakerPipeline-BaselineData-Train.ipynb notebook. This pipeline takes the raw customer churn data as input, and then performs the steps required to prepare the data, train the model, evaluate the model, baseline the model, and register the model in the model registry.

To build a SageMaker pipeline, you configure the underlying job (such as SageMaker Processing), configure the pipeline steps to run the job, and then configure and run the pipeline. We complete the following steps:

Configure the model build pipeline to prepare the data, train the model, and evaluate the model.
Configure the baseline step for the data drift with Model Monitor.
Configure steps to package the model and register the model version.
Configure a conditional step to evaluate model performance.

Configure the model build pipeline

The model build pipeline is a three-step process:

Prepare the data.
Train the model.
Evaluate the model.

To prepare the data, we configure a data processing step. This step runs a SageMaker Processing job, using the built-in ProcessingStep, to prepare the raw data on input for training and evaluation.

To train the model, we configure a training job step. This step runs a SageMaker Training job, using the built-in TrainingStep. For this use case, we perform binary classification using XGBoost. The output of this step is a model artifact, model.tar.gz, stored in Amazon S3.

The last step is responsible for evaluating model performance using the test holdout dataset. This step uses the built-in ProcessingStep with the provided code, evaluation.py, to evaluate performance metrics (accuracy, area under curve).

Configure the baseline step

To monitor the model and data, a baseline is required.

Monitoring for data drift requires a baseline of training data. The baseline step uses Pipelines’ built-in QualityCheckStep. This step automatically runs a SageMaker Processing job that uses the Model Monitor pre-built container image. We use this same container image for the baselining as well as the model monitoring; however, the parameters used during configuration of this step direct the appropriate behavior. In this case, we are baselining the data, so we need to ensure that the quality_check_config parameter is using DataQualityCheckConfig, which identifies the S3 input and output paths. We’re also setting register_new_baseline and skip_check to true. When these values are both set to true, it tells SageMaker to run this step as a baseline job and create a new baseline. To get a better understanding of the parameters that control the behavior of the SageMaker pre-built container image, refer to Baseline calculation, drift detection and lifecycle with ClarifyCheck and QualityCheck steps in Amazon SageMaker Model Building Pipelines.

See the following code:

# Configure the Data Quality Baseline Job

# Configure the transient compute environment
check_job_config = CheckJobConfig(
    role=role_arn,
    instance_count=1,
    instance_type="ml.c5.xlarge",
    volume_size_in_gb=120,
    sagemaker_session=session,
)

# Configure the data quality check input (training data), dataset format, and S3 output path
data_quality_check_config = DataQualityCheckConfig(
    baseline_dataset=data_preparation_step.properties.ProcessingOutputConfig.Outputs['train'].S3Output.S3Uri,
    dataset_format=DatasetFormat.csv(header=False, output_columns_position="START"),
    output_s3_uri=Join(on='/', values=['s3:/', bucket, bucket_prefix, ExecutionVariables.PIPELINE_EXECUTION_ID, 'dataqualitycheckstep'])
)

# Configure Pipeline Step - 'QualityCheckStep'
baseline_model_data_step = QualityCheckStep(
        name="DataQualityCheckStep",
        # skip_check, indicates a baselining job
        skip_check=True,
        register_new_baseline=True,
        quality_check_config=data_quality_check_config,
        check_job_config=check_job_config,
        model_package_group_name=model_package_group_name
    )

This step generates two JSON files as output:

statistics.json – Contains calculated statistics for each feature of the training dataset
constraints.json – Suggests data constraints based on the statistics collected

These constraints can also be modified and are used to detect signals of drift during model monitoring.

Configure steps to package and register the model version

Next, we configure the steps to package for deployment and register the model in the model registry using two additional pipeline steps.

The package model step packages the model for use with the SageMaker batch transform deployment option. model.create() creates a model entity, which will be included in the custom metadata registered for this model version and later used in the second pipeline for batch inference and model monitoring. See the following code:

# Configure step to package model for inference using Model object, model.create(

step_args = model.create()
    instance_type="ml.m5.large",
    accelerator_type="ml.eia1.medium",
)

create_model_step = ModelStep(
    name="CustomerChurnCreateModel",
    step_args=step_args,
)

The register model step registers the model version and associated metadata to the SageMaker model registry. This includes model performance metrics as well as metadata for the data drift baseline, including the Amazon S3 locations of the statistics and constraints files produced through the baselining step. You’ll also notice the additional custom metadata noted customer_metadata_properties pulling the model entity information that will be used later in the inference pipeline. The ability to provide custom metadata within the model registry is a great way to incorporate additional metadata that should be collected that isn’t explicitly defined in native SageMaker parameters. See the following code:

# Configure step to register model version using metadata and Model object: model.register()
model_registry_args = model.register(
    content_types=['text/csv'],
    response_types=['text/csv'],
    inference_instances=['ml.t2.medium', 'ml.m5.xlarge'],
    transform_instances=['ml.m5.xlarge'],
    model_package_group_name=model_package_group_name,
    customer_metadata_properties={"ModelName": create_model_step.properties.ModelName},
    drift_check_baselines=drift_check_baselines,
    approval_status='PendingManualApproval',
    model_metrics=model_metrics
)

register_step = ModelStep(
    name='RegisterModel',
    step_args=model_registry_args
)

Configure a conditional step to evaluate model performance

The conditional step, ConditionStep, compares model accuracy against an identified threshold and checks the quality of the trained model.

It reads the evaluation.json file and checks if the model accuracy, or whatever objective metric you are optimizing for, meets the criteria you’ve defined. In this case, the criteria is defined using one of the built-in conditions, ConditionGreaterThanOrEqualTo. If the condition is satisfied, the pipeline continues to baseline the data and perform subsequent steps in the pipeline. The pipeline stops if the condition is not met. Because the condition explicitly calls out the next steps in the pipeline, we have to ensure those steps are configured prior to configuring our conditional step. See the following code:

condition_step = ConditionStep(
    name='PerformanceConditionalCheck',
    conditions=[cond_gte],
    if_steps=[baseline_model_data_step,create_model_step, register_step],
    else_steps=[],
)

Define, create, and start the SageMaker pipeline

At this point, all the steps of the train and baseline pipeline are defined and configured. Now it’s time to define, create, and start the pipeline.

First, we define the pipeline, Pipeline(), providing a pipeline name and a list of steps previously configured to include in the pipeline. Next, we create the pipeline using training_pipeline.upsert(). Finally, we start the pipeline using training_pipeline.start(). See the following code:

step_list = [
             data_preparation_step,
             training_step,
             evaluation_step,
             condition_step]

training_pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        input_data,
      ],
    steps=step_list
)

When the pipeline starts running, you can visualize its status on Studio. The following diagram shows which steps from the pipeline process relate to the steps of the pipeline directed acyclic graph (DAG). After the train and baseline pipeline run successfully, it registers the trained model as part of the model group in the model registry. The pipeline is currently set up to register the model in a Pending state, which requires a manual approval. Optionally, you can configure the model registration step to automatically approve the model in the model registry. The second pipeline will pull the latest approved model from the registry for inference.

In Studio, you can choose any step to see its key metadata. As an example, the data quality check step (baseline step) within the pipeline DAG shows the S3 output locations of statistics.json and constraints.json in the Reports section. These are key files calculated from raw data used as a baseline.

After the pipeline has run, the baseline (statistics and constraints) for data quality monitoring can be inspected, as shown in the following screenshots.

Pipeline 2: Batch inference and Model Monitor pipeline

In this section, we dive into the second pipeline used for monitoring the new batch input data for signals of data drift and running batch inference using SageMaker Pipelines. The necessary steps and code are within 2.SageMakerPipeline-ModelMonitoring-DataQuality-BatchTransform.ipynb. This pipeline includes the following steps:

A Lambda step to retrieve the latest approved model version and associated metadata from the model registry.
A Model Monitor step to detect signals of data drift using the new input data and the baseline from Pipeline 1.
A batch transform step to process the batch input data against the latest approved model.

Configure a Lambda Step

Before we start the model monitoring and batch transform job, we need to query the model registry to get the latest approved model that we will use for batch inference.

To do this, we use a Lambda step, which allows us to include custom logic within our pipeline. The lambda_getapproved_model.py Lambda function queries the SageMaker model registry for a specific model package group provided on input to identify the latest approved model version and return related metadata. The output includes metadata created from our first pipeline:

Model package ARN
Packaged model name
S3 URI for statistics baseline
S3 URI for constraints baseline

The output is then used as input in the next step in the pipeline, which performs batch monitoring and scoring using the latest approved model.

To create and run the Lambda function as part of the SageMaker pipeline, we need to add the function as a LambdaStep in the pipeline:

lambda_getmodel_step = LambdaStep(
    name="LambdaStepGetApprovedModel",
    lambda_func=func,
    inputs={
        "model_package_group_name": model_package_group_name
     },
    outputs=[output_param_1, output_param_2,output_param_3,output_param_4,output_param_5])

Configure the data monitor and batch transform steps

After we create the Lambda step to get the latest approved model, we can create the MonitorBatchTransformStep. This native step orchestrates and manages two child tasks that are run in succession. The first task includes the Model Monitor job that runs a Processing job using a built-in container image used to monitor the batch input data and compare it against the constraints from the previously generated baseline from Pipeline 1. In addition, this step kicks off the batch transform job, which processes the input data against the latest approved model in the model registry.

This batch deployment and data quality monitoring step takes the S3 URI of the batch prediction input data on input. This is parameterized to allow for each run of the pipeline to include a new input dataset. See the following code:

transform_input_param = ParameterString(   
	name="transform_input",
    default_value=batch_prediction_data,
)

Next, we need to configure the transformer for the batch transform job that will process the batch prediction requests. In the following code, we pass in the model name that was pulled from the custom metadata of the model registry, along with other required parameters:

transformer = Transformer(
    model_name=lambda_getmodel_step.properties.Outputs["modelName"],
    instance_count=1,
    instance_type="ml.m5.xlarge",
    accept="text/csv",
    assemble_with="Line",
    output_path=batch_transform_output_path,
    sagemaker_session=pipeline_session,
)

transform_arg = transformer.transform(
    transform_input_param,
    content_type="text/csv",
    split_type="Line",
    input_filter="$[1:]",
)

The data quality monitor accepts the S3 URI of the baseline statistics and constraints for the latest approved model version from the model registry to run the data quality monitoring job during the pipeline run. This job compares the batch prediction input data with the baseline data to identify any violations signaling potential data drift. See the following code:

job_config = CheckJobConfig(role=role)
data_quality_config = DataQualityCheckConfig(
    baseline_dataset=transform_input_param,
    dataset_format=DatasetFormat.csv(header=False),
    output_s3_uri=batch_monitor_reports_output_path,
)

Next, we use MonitorBatchTransformStep to run and monitor the transform job. This step runs a batch transform job using the transformer object we configured and monitors the data passed to the transformer before running the job.

Optionally, you can configure the step to fail if a violation to data quality is found by setting the fail_on_violation flag to False.

See the following code:

from sagemaker.workflow.monitor_batch_transform_step import MonitorBatchTransformStep

transform_and_monitor_step = MonitorBatchTransformStep(
    name="MonitorCustomerChurnDataQuality",
    transform_step_args=transform_arg,
    monitor_configuration=data_quality_config,
    check_job_configuration=job_config,
    monitor_before_transform=True,
    # if violation is detected in the monitoring, you can skip it and continue running batch transform
    fail_on_violation=False,
    supplied_baseline_statistics=lambda_getmodel_step.properties.Outputs["s3uriStatistics"],
    supplied_baseline_constraints=lambda_getmodel_step.properties.Outputs["s3uriConstraints"],
)

Define, create, and start the pipeline

After we define the LambdaStep and MonitorBatchTransformStep, we can create the SageMaker pipeline.

See the following code:

from sagemaker.workflow.pipeline import Pipeline

pipeline_name = 'sagemaker-batch-inference-monitor'

batch_monitor_pipeline = Pipeline(
    name=pipeline_name,
    parameters=[transform_input_param],
    steps=[
        lambda_getmodel_step,
        transform_and_monitor_step
    ],
)

We can now use the upsert() method, which will create or update the SageMaker pipeline with the configuration we specified:

batch_monitor_pipeline.upsert(role_arn=role)

Although there are multiple ways to start a SageMaker pipeline, when the pipeline has been created, we can run the pipeline using the start() method.

Note that in order for the LambdaStep to successfully retrieve an approved model, the model that was registered as part of Pipeline 1 needs to have an Approved status. This can be done in Studio or using Boto3. Refer to Update the Approval Status of a Model for more information.

execution = batch_monitor_pipeline.start()

To run the SageMaker pipeline on a schedule or based on an event, refer to Schedule a Pipeline with Amazon EventBridge.

Review the Model Monitor reports

Model Monitor uses a SageMaker Processing job that runs the DataQuality check using the baseline statistics and constraints. The DataQuality Processing job emits a violations report to Amazon S3 and also emits log data to Amazon CloudWatch Logs under the log group for the corresponding Processing job. Sample code for querying Amazon CloudWatch logs is provided in the notebook.

We’ve now walked you through how to create the first pipeline for model training and baselining, as well as the second pipeline for performing batch inference and model monitoring. This allows you to automate both pipelines while incorporating the different lifecycles between training and inference.

To further mature this reference pattern, you can identify a strategy for feedback loops, providing awareness and visibility of potential signals of drift across key stakeholders. At a minimum, it’s recommended to automate exception handling by filtering logs and creating alarms. These alarms may need additional analysis by a data scientist, or you can implement additional automation supporting an automatic retraining strategy using new ground truth data by integrating the model training and baselining pipeline with Amazon EventBridge. For more information, refer to Amazon EventBridge Integration.

Clean up

After you run the baseline and batch monitoring pipelines, make sure to clean up any resources that won’t be utilized, either programmatically via the SageMaker console, or through Studio. In addition, delete the data in Amazon S3, and make sure to stop any Studio notebook instances to not incur any further charges.

Conclusion

In this post, you learned how to create a solution for a batch model that is trained less frequently than batch inference is performed against that trained model using SageMaker MLOps features, including Pipelines, the model registry, and Model Monitor. To expand this solution, you could incorporate this into a custom SageMaker project that also incorporates CI/CD and automated triggers using standardized MLOps templates. To dive deeper into the solution and code shown in this demo, check out the GitHub repo. Also, refer to Amazon SageMaker for MLOps for examples related to implementing MLOps practices with SageMaker.

About the Authors

Shelbee Eigenbrode is a Principal AI and Machine Learning Specialist Solutions Architect at Amazon Web Services (AWS). She has been in technology for 24 years spanning multiple industries, technologies, and roles. She is currently focusing on combining her DevOps and ML background into the domain of MLOps to help customers deliver and manage ML workloads at scale. With over 35 patents granted across various technology domains, she has a passion for continuous innovation and using data to drive business outcomes. Shelbee is a co-creator and instructor of the Practical Data Science specialization on Coursera. She is also the Co-Director of Women In Big Data (WiBD), Denver chapter. In her spare time, she likes to spend time with her family, friends, and overactive dogs.

Sovik Kumar Nath is an AI/ML solution architect with AWS. He has experience in designs and solutions for machine learning, business analytics within financial, operational, and marketing analytics; healthcare; supply chain; and IoT. Outside work, Sovik enjoys traveling and watching movies.

Marc Karp is a ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Improved ML model deployment using Amazon SageMaker Inference Recommender

April 20, 2023

by Shiva Raaj Kotini Amazon AWS

Each machine learning (ML) system has a unique service level agreement (SLA) requirement with respect to latency, throughput, and cost metrics. With advancements in hardware design, a wide range of CPU- and GPU-based infrastructures are available to help you speed up inference performance. Also, you can build these ML systems with a combination of ML models, tasks, frameworks, libraries, tools, and inference engines, making it important to evaluate the ML system performance for the best possible deployment configurations. You need recommendations on finding the most cost-effective ML serving infrastructure and the right combination of software configuration to achieve the best price-performance to scale these applications.

Amazon SageMaker Inference Recommender is a capability of Amazon SageMaker that reduces the time required to get ML models in production by automating load testing and model tuning across SageMaker ML instances. In this post, we highlight some of the recent updates to Inference Recommender:

SageMaker Python SDK support for running Inference Recommender
Inference Recommender usability improvements
New APIs that provide flexibility in running Inference Recommender
Deeper integration with Amazon CloudWatch for logging and metrics

Credit card fraud detection use case

Any fraudulent activity that is not detected and mitigated immediately can cause significant financial loss. Particularly, credit card payment fraud transactions need to be identified right away to protect the individual’s and company’s financial health. In this post, we discuss a credit card fraud detection use case, and learn how to use Inference Recommender to find the optimal inference instance type and ML system configurations that can detect fraudulent credit card transactions in milliseconds.

We demonstrate how to set up Inference Recommender jobs for a credit card fraud detection use case. We train an XGBoost model for a classification task on a credit card fraud dataset. We use Inference Recommender with a custom load to meet inference SLA requirements to satisfy peak concurrency of 30,000 transactions per minute while serving predictions results in less than 100 milliseconds. Based on Inference Recommender’s instance type recommendations, we can find the right real-time serving ML instances that yield the right price-performance for this use case. Finally, we deploy the model to a SageMaker real-time endpoint to get prediction results.

The following table summarizes the details of our use case.

Model Framework	XGBoost
Model Size	10 MB
End-to-End Latency	100 milliseconds
Invocations per Second	500 (30,000 per minute)
ML Task	Binary Classification
Input Payload	10 KB

We use a synthetically created credit card fraud dataset. The dataset contains 28 numerical features, time of the transaction, transaction amount, and class target variables. The class column corresponds to whether or not a transaction is fraudulent. The majority of data is non-fraudulent (284,315 samples), with only 492 samples corresponding to fraudulent examples. In the data, Class is the target classification variable (fraudulent vs. non-fraudulent) in the first column, followed by other variables.

In the following sections, we show how to use Inference Recommender to get ML hosting instance type recommendations and find optimal model configurations to achieve better price-performance for your inference application.

Which ML instance type and configurations should you select?

With Inference Recommender, you can run two types of jobs: default and advanced.

The default Instance Recommender job runs a set of load tests to recommended the right ML instance types for any ML use case. SageMaker real-time deployment supports a wide range of ML instances to host and serve the credit card fraud detection XGBoost model. The default job can run a load test on a selection of instances that you provide in the job configuration. If you have an existing endpoint for this use case, you can run this job to find the cost-optimized performant instance type. Inference Recommender will compile and optimize the model for a specific hardware of inference endpoint instance type using Amazon SageMaker Neo. It’s important to note that not all compilation results in improved performance. Inference Recommender will report compilation details when the following conditions are met:

Successful compilation of the model using Neo. There could be issues in the compilation process such as invalid payload, data type, or more. In this case, compilation information is not available.
Successful inference using the compiled model that shows performance improvement, which appears in the inference job response.

An advanced job is a custom load test job that allows you to perform extensive benchmarks based on your ML application SLA requirements, such as latency, concurrency, and traffic pattern. You can configure a custom traffic pattern to simulate credit card transactions. Additionally, you can define the end-to-end model latency to predict if a transaction is fraudulent and define the maximum concurrent transactions to the model for prediction. Inference Recommender uses this information to run a performance benchmark load test. The latency, concurrency, and cost metrics from the advanced job help you make informed decisions about the ML serving infrastructure for mission-critical applications.

Solution overview

The following diagram shows the solution architecture for training an XGBoost model on the credit card fraud dataset, running a default job for instance type recommendation, and performing load testing to decide the optimal inference configuration for the best price-performance.

The diagram shows the following steps:

Train an XGBoost model to classify credit card transactions as fraudulent or legit. Deploy the trained model to a SageMaker real-time endpoint. Package the model artifacts and sample payload (.tar.gz format), and upload them to Amazon Simple Storage Service (Amazon S3) so Inference Recommender can use these when the job is run. Note that the training step in this post is optional.
Configure and run a default Inference Recommender job on a list of supported instance types to find the right ML instance type that gives the best price-performance for this use case.
Optionally, run a default Inference Recommender job on an existing endpoint.
Configure and run an advanced Inference Recommender job to perform a custom load test to simulate user interactions with the credit card fraud detection application. This helps you find the right configurations to satisfy latency, concurrency, and cost for this use case.
Analyze the default and advanced Inference Recommender job results, which include ML instance type recommendation latency, performance, and cost metrics.

A complete example is available in our GitHub notebook.

Prerequisites

To use Inference Recommender, make sure to meet the prerequisites.

Python SDK support for Inference Recommender

We recently released Python SDK support for Inference Recommender. You can now run default and advanced jobs using a single function: right_size. Based on the parameters of the function call, Inference Recommender infers if it should run default or advanced jobs. This greatly simplifies the use of Inference Recommender using the Python SDK. To run the Inference Recommender job, complete the following steps:

Create a SageMaker model by specifying the framework, version, and image scope:

model = Model(
    model_data=model_url,
    role=role,
    image_uri = sagemaker.image_uris.retrieve(framework="xgboost", 
    region=region, 
    version="1.5-1", 
    py_version="py3", 
    image_scope='inference'),
    sagemaker_session=sagemaker_session
    )

Optionally, register the model in the SageMaker model registry. Note that parameters such as domain and task during model package creation are also optional parameters in the recent release.

model_package = model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    model_package_group_name=model_package_group_name,
    image_uri=model.image_uri,
    approval_status="Approved",
    framework="XGBOOST"
)

Run the right_size function on the supported ML inference instance types using the following configuration. Because XGBoost is a memory-intensive algorithm, we provide ml.m5 type instances to get instance type recommendations. You can call the right_size function on the model registry object as well.

model.right_size(
    sample_payload_url=sample_payload_url,
    supported_content_types=["text/csv"],
    supported_instance_types=["ml.m5.large", 
                              "ml.m5.xlarge", 
                              "ml.m5.2xlarge", 
                              "ml.m5.4xlarge", 
                              "ml.m5.12xlarge"],
    framework="XGBOOST",
    job_name="credit-card-fraud-default-job"
)
INFO:sagemaker:Advance Job parameters were not specified. Running Default job...

Define additional parameters to the right_size function to run an advanced job and custom load test on the model:
1. Configure the traffic pattern using the phases parameter. In the first phase, we start the load test with two initial users and create two new users for every minute for 2 minutes. In the following phase, we start the load test with six initial users and create two new users for every minute for 2 minutes. Stopping conditions for the load tests are p95 end-to-end latency of 100 milliseconds and concurrency to support 30,000 transactions per minute or 500 transactions per second.
2. We tune the endpoint against the environment variable OMP_NUM_THREADS with values [3,4,5] and we aim to limit the latency requirement to 100 milliseconds and achieve max concurrency of 30,000 invocations per minute. The goal is to find which value for OMP_NUM_THREADS provides the best performance.

from sagemaker.parameter import CategoricalParameter 
from sagemaker.inference_recommender.inference_recommender_mixin import (  
    Phase,  
    ModelLatencyThreshold 
) 
hyperparameter_ranges = [ 
    { 
        "instance_types": CategoricalParameter(["ml.m5.4xlarge"]), 
        "OMP_NUM_THREADS": CategoricalParameter(["3", "4", "6"]),
    } 
] 
phases = [ 
    Phase(duration_in_seconds=120, initial_number_of_users=2, spawn_rate=2), 
    Phase(duration_in_seconds=120, initial_number_of_users=6, spawn_rate=2) 
] 
model_latency_thresholds = [ 
    ModelLatencyThreshold(percentile="P95", value_in_milliseconds=100) 
]

model.right_size( 
    sample_payload_url=sample_payload_url, 
    supported_content_types=["text/csv"], 
    framework="XGBOOST", 
    job_duration_in_seconds=7200, 
    hyperparameter_ranges=hyperparameter_ranges, 
    phases=phases, # TrafficPattern 
    max_invocations=30000, # StoppingConditions 
    model_latency_thresholds=model_latency_thresholds,
    job_name="credit-card-fraud-advanced-job"
)
INFO:sagemaker:Advance Job parameters were specified. Running Advanced job...

Run Inference Recommender jobs using the Boto3 API

You can use the Boto3 API to launch Inference Recommender default and advanced jobs. You need to use the Boto3 API (create_inference_recommendations_job) to run Inference Recommender jobs on an existing endpoint. Inference Recommender infers the framework and version from the existing SageMaker real-time endpoint. The Python SDK doesn’t support running Inference Recommender jobs on existing endpoints.

The following code snippet shows how to create a default job:

sagemaker_client.create_inference_recommendations_job(
    JobName = "credit-card-fraud-default-job",
    JobType = 'Default',
    RoleArn = <ROLE_ARN>,
    InputConfig = {
        'ModelPackageVersionArn': <MODEL_PACKAGE_ARN>, #optional
        'Endpoints': ['EndpointName': <ENDPOINT_POINT>]
    }
)

Later in this post, we discuss the parameters needed to configure an advanced job.

Configure a traffic pattern using the TrafficPattern parameter. In the first phase, we start a load test with two initial users (InitialNumberOfUsers) and create two new users (SpawnRate) for every minute for 2 minutes (DurationInSeconds). In the following phase, we start the load test with six initial users and create two new users for every minute for 2 minutes. Stopping conditions (StoppingConditions) for the load tests are p95 end-to-end latency (ModelLatencyThresholds) of 100 milliseconds (ValueInMilliseconds) and concurrency to support 30,000 transactions per minute or 500 transactions per second (MaxInvocations). See the following code:

env_parameter_ranges = [{"Name": "OMP_NUM_THREADS", "Value": ["3", "4", "5"]}]

sagemaker_client.create_inference_recommendations_job(JobName=load_test_job_name,
        JobType='Advanced', RoleArn=role_arn, InputConfig={
    'ModelPackageVersionArn': model_package_arn, #optional
    'JobDurationInSeconds': 7200,
    'TrafficPattern': {'TrafficType': 'PHASES',
                       'Phases': [
                       {'InitialNumberOfUsers': 2,
                       'SpawnRate': 2, 
                       'DurationInSeconds': 120
                       },
                       {'InitialNumberOfUsers': 6, 
                       'SpawnRate': 6,
                       'DurationInSeconds': 120
                       }]},
    'ResourceLimit': {'MaxNumberOfTests': 10, 'MaxParallelOfTests': 3},
    'EndpointConfigurations': [{'InstanceType': 'ml.m5.4xlarge'
                                'EnvironmentParameterRanges': 
                                {'CategoricalParameterRanges': env_parameter_ranges}
                                }],
    }, StoppingConditions={'MaxInvocations': 30000,
                           'ModelLatencyThresholds': 
                           [{'Percentile': 'P95',
                            'ValueInMilliseconds': 100
                           }]})

Inference Recommender job results and metrics

The results of the default Inference Recommender job contain a list of endpoint configuration recommendations, including instance type, instance count, and environment variables. The results contain configurations for SAGEMAKER_MODEL_SERVER_WORKERS and OMP_NUM_THREADS associated with the latency, concurrency, and throughput metrics. OMP_NUM_THREADS is the model server tunable environment parameter. As shown in the details in the following table, with an ml.m5.4xlarge instance with SAGEMAKER_MODEL_SERVER_WORKERS=3 and OMP_NUM_THREADS=3, we got a throughput of 32,628 invocations per minute and model latency under 10 milliseconds. ml.m5.4xlarge had 100% improvement in latency, an approximate 115% increase in concurrency compared to the ml.m5.xlarge instance configuration. Also, it was 66% more cost-effective compared to the ml.m5.12xlarge instance configurations while achieving comparable latency and throughput.

Instance Type	Initial Instance Count	OMP_NUM_THREADS	Cost Per Hour	Max Invocations	Model Latency	CPU Utilization	Memory Utilization	SageMaker Model Server Workers
ml.m5.xlarge	1	2	0.23	15189	18	108.864	1.62012	1
ml.m5.4xlarge	1	3	0.922	32628	9	220.57001	0.69791	3
ml.m5.large	1	2	0.115	13793	19	106.34	3.24398	1
ml.m5.12xlarge	1	4	2.765	32016	4	215.32401	0.44658	7
ml.m5.2xlarge	1	2	0.461	32427	13	248.673	1.43109	3

We have included CloudWatch helper functions in the notebook. You can use the functions to get detailed charts of your endpoints during the load test. The charts have details on invocation metrics like invocations, model latency, overhead latency, and more, and instance metrics such as CPUUtilization and MemoryUtilization. The following example shows the CloudWatch metrics for our ml.m5.4xlarge model configuration.

You can visualize Inference Recommender job results in Amazon SageMaker Studio by choosing Inference Recommender under Deployments in the navigation pane. With a deployment goal for this use case (high latency, high throughput, default cost), the default Inference Recommender job recommended an ml.m5.4xlarge instance because it provided the best latency performance and throughput to support a maximum 34,600 invocations per minute (576 TPS). You can use these metrics to analyze and find the best configurations that satisfy latency, concurrency, and cost requirements of your ML application.

We recently introduced ListInferenceRecommendationsJobSteps, which allows you to analyze subtasks in an Inference Recommender job. The following code snippet shows how to use the list_inference_recommendations_job_steps Boto3 API to get the list of subtasks. This can help with debugging Inference Recommender job failures at the step level. This functionality is not supported in the Python SDK yet.

sm_client = boto3.client("sagemaker", region_name=region)
list_job_steps_response = sm_client.list_inference_recommendations_job_steps
                          (JobName='<JOB_NAME>')
print(list_job_steps_response)

The following code shows the response:

{
    "Steps": [
        {
            "StepType": "BENCHMARK",
            "JobName": "SMPYTHONSDK-<JOB_NAME>",
            "Status": "COMPLETED",
            "InferenceBenchmark": {
                "Metrics": {
                    "CostPerHour": 1.8359999656677246,
                    "CostPerInference": 1.6814110495033674e-06,
                    "MaxInvocations": 18199,
                    "ModelLatency": 40,
                    "CpuUtilization": 106.06400299072266,
                    "MemoryUtilization": 0.3920480012893677
                },
                "EndpointConfiguration": {
                    "EndpointName": "sm-epc-<ENDPOINTNAME>",
                    "VariantName": "sm-epc-<VARIANTNAME>",
                    "InstanceType": "ml.c5.9xlarge",
                    "InitialInstanceCount": 1
                },
                "ModelConfiguration": {
                    "EnvironmentParameters": [
                        {
                            "Key": "SAGEMAKER_MODEL_SERVER_WORKERS",
                            "ValueType": "String",
                            "Value": "1"
                        },
                        {
                            "Key": "OMP_NUM_THREADS",
                            "ValueType": "String",
                            "Value": "28"
                        }
                    ]
                }
            }
        },
     ...... <TRUNCATED>
    "ResponseMetadata": {
        "RequestId": "<RequestId>",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "x-amzn-requestid": "<x-amzn-requestid>",
            "content-type": "application/x-amz-json-1.1",
            "content-length": "1443",
            "date": "Mon, 20 Feb 2023 16:53:30 GMT"
        },
        "RetryAttempts": 0
    }
}

Run an advanced Inference Recommender job

Next, we run an advanced Inference Recommender job to find optimal configurations such as SAGEMAKER_MODEL_SERVER_WORKERS and OMP_NUM_THREADS on an ml.m5.4xlarge instance type. We set the hyperparameters of the advanced job to run a load test on different combinations:

hyperparameter_ranges = [ 
    { 
        "instance_types": CategoricalParameter(["ml.m5.4xlarge"]), 
        "OMP_NUM_THREADS": CategoricalParameter(["3", "4", "6"]),
    } 
]

You can view the advanced Inference Recommender job results on the Studio console, as shown in the following screenshot.

Using the Boto3 API or CLI commands, you can access all the metrics from the advanced Inference Recommender job results. InitialInstanceCount is the number of instances that you should provision in the endpoint to meet ModelLatencyThresholds and MaxInvocations mentioned in StoppingConditions. The following table summarizes our results.

Instance Type	Initial Instance Count	OMP_NUM_THREADS	Cost Per Hour	Max Invocations	Model Latency	CPU Utilization	Memory Utilization
ml.m5.2xlarge	2	3	0.922	39688	6	86.732803	3.04769
ml.m5.2xlarge	2	4	0.922	42604	6	177.164993	3.05089
ml.m5.2xlarge	2	5	0.922	39268	6	125.402	3.08665
ml.m5.4xlarge	2	3	1.844	38174	4	102.546997	2.68003
ml.m5.4xlarge	2	4	1.844	39452	4	141.826004	2.68136
ml.m5.4xlarge	2	5	1.844	40472	4	107.825996	2.70936

Clean up

Follow the instructions in the notebook to delete all the resources created as part of this post to avoid incurring additional charges.

Summary

Finding the right ML serving infrastructure, including instance type, model configurations, and auto scaling polices, can be tedious. This post showed how you can use the Inference Recommender Python SDK and Boto3 APIs to launch default and advanced jobs to find the optimal inference infrastructure and configurations. We also discussed the new improvements to Inference Recommender, including Python SDK support and usability improvements. Check out our GitHub repository to get started.

About the Authors

Shiva Raaj Kotini works as a Principal Product Manager in the AWS SageMaker inference product portfolio. He focuses on model deployment, performance tuning, and optimization in SageMaker for inference.

John Barboza is a Software Engineer at AWS. He has extensive experience working on distributed systems. His current focus is on improving the SageMaker inference experience. In his spare time, he enjoys cooking and biking.

Mohan Gandhi is a Senior Software Engineer at AWS. He has been with AWS for the last 10 years and has worked on various AWS services like Amazon EMR, Amazon EFA, and Amazon RDS. Currently, he is focused on improving the SageMaker inference experience. In his spare time, he enjoys hiking and marathons.

Ram Vegiraju is an ML Architect with the SageMaker service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Vikram Elango is an Sr. AIML Specialist Solutions Architect at AWS, based in Virginia USA. He is currently focused on Generative AI, LLMs, prompt engineering, large model inference optimization and scaling ML across enterprises. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Recent honors and awards for Amazon scientists

April 20, 2023

by Amazon AWS

Researchers honored for their contributions to the scientific community.Read More

Amazon Comprehend document classifier adds layout support for higher accuracy

April 19, 2023

by Anjan Biswas Amazon AWS

The ability to effectively handle and process enormous amounts of documents has become essential for enterprises in the modern world. Due to the continuous influx of information that all enterprises deal with, manually classifying documents is no longer a viable option. Document classification models can automate the procedure and help organizations save time and resources. Traditional categorization techniques, such as manual processing and keyword-based searches, become less efficient and more time-consuming as the volume of documents increases. This inefficiency causes lower productivity and higher operating expenses. Additionally, it can prevent crucial information from being accessible when needed, which could lead to a poor customer experience and impact decision-making. At AWS re:Invent 2022, Amazon Comprehend, a natural language processing (NLP) service that uses machine learning (ML) to discover insights from text, launched support for native document types. This new feature gave you the ability to classify documents in native formats (PDF, TIFF, JPG, PNG, DOCX) using Amazon Comprehend.

Today, we are excited to announce that Amazon Comprehend now supports custom classification model training with documents like PDF, Word, and image formats. You can now train bespoke document classification models on native documents that support layout in addition to text, increasing the accuracy of the results.

In this post, we provide an overview of how you can get started with training an Amazon Comprehend custom document classification model.

Overview

The capacity to understand the relative placements of objects within a defined space is referred to as layout awareness. In this case, it aids the model in understanding how headers, subheadings, tables, and graphics relate to one another inside a document. The model can more effectively categorize a document based on its content when it’s aware of the structure and layout of the text.

In this post, we walk through the data preparation steps involved, demonstrate the model training process, and discuss the benefits of using the new custom document classification model in Amazon Comprehend. As a best practice, you should consider the following points before you begin training the custom document classification model.

Evaluate your document classification needs

Identify the various types of documents they you may need to classify, along with the different classes or categories to support your use case. Determine the suitable classification structure or taxonomy after evaluating the amount and types of documents that need to be categorized. Document types may vary from PDF, Word, images, and so on. Ensure you have authorized access to a diverse set of labeled documents either via a document management system or other storage mechanisms.

Prepare your data

Ensure that the document files you intend to use for model training aren’t encrypted or locked—for example, make sure that your PDF files aren’t encrypted and locked with a password. You must decrypt such files before you can use them for training purposes. Label a sample of your documents with the appropriate categories or labels (classes). Determine whether single-label classification (multi-class mode) or multi-label classification is appropriate for your use case. Multi-class mode associates only a single class with each document, whereas multi-label mode associates one or more class with a document.

Consider model evaluation

Use the labeled dataset to train the model so it can learn to classify new documents accurately and evaluate how the newly trained model version performs by understanding the model metrics. To understand the metrics provided by Amazon Comprehend post-model training, refer to Custom classifier metrics. After the training process is complete, you can begin classifying documents asynchronously or in real time. We walk through how to train a custom classification model in the following sections.

Prepare the training data

Before we train our custom classification model, we need to prepare the training data. Training data is comprised of a set of labeled documents, which can be pre-identified documents from a document repository that you already have access to. For our example, we trained a custom classification model with a few different document types that are typically found in a health insurance claim adjudication process: patient discharge summary, invoices, receipts, and so on. We also need to prepare an annotations file in CSV format. Following is an example of an annotations file CSV data required for the training:

 discharge_summary,summary-1.pdf,1
 discharge_summary,summary-2.pdf,1
 invoice,invoice-1.pdf,1
 invoice,invoice-1.pdf,2
 invoice,invoice-2.pdf,1

The annotations CSV file must contain three columns. The first column contains the desired class (label) for the document, the second column is the document name (file name), and the last column is the page number of the document that you want to include in the training dataset. Because the training process supports native multi-page PDF and DOCX files, you must specify the page number in case the document is a multi-page document. If you want to include all pages of a multi-page document in the training dataset, you must specify each page as a separate line in the CSV annotations file. For example, in the preceding annotations file, invoice-1.pdf is a two-page document, and we want to include both pages in the classification dataset. Because files like PDF, PNG, and TIFF are image formats, the page number (third column) value must always be 1. If your dataset contains multi-frame (multi-page) TIF files, you must split them into separate TIF files in order to use them in the training process.

We prepared an annotations file called test.csv with the appropriate data to train a custom classification model. For each sample document, the CSV file contains the class that document belongs to, the location of the document in Amazon Simple Storage Service (Amazon S3), such as path/to/prefix/document.pdf, and the page number (if applicable). Because most of our documents are either single-page DOCX, PDF files, or TIF, JPG, or PNG files, the page number assigned is 1. Because our annotations CSV and sample documents are all under the same Amazon S3 prefix, we don’t need to explicitly specify the prefix in the second column. We also prepare at least 10 document samples or more for each class, and we used a mix of JPG, PNG, DOCX, PDF, and TIF files for training the model. Note that it’s usually recommended to have a diverse set of sample documents for model training to avoid overfitting of the model, which impacts its ability to recognize new documents. It’s also recommended that the number of samples per class is balanced, although it’s not required to have an exact same number of samples per class. Next, we upload the test.csv annotations file and all the documents into Amazon S3. The following image shows part of our annotations CSV file.

Train a custom classification model

Now that we have the annotations file and all our sample documents ready, we set up a custom classification model and train it. Before you begin setting up custom classification model training, make sure that the annotations CSV and sample documents exist in an Amazon S3 location.

On the Amazon Comprehend console, choose Custom classification in the navigation pane.
Choose Create new model.
For Model name, enter a unique name.
For Version name, enter a unique version name.
For Training model type, select Native documents.

This tells Amazon Comprehend that you intend to use native document types to train the model instead of serialized text.

For Classifier mode, select Using single-label mode.

This mode tells the classifier that we intend to classify documents into a single class. If you need to train a model with multi-label mode, meaning a document may belong to one or more than one class, you must set up the annotations file appropriately by specifying the classes of the document separated by a special character in the annotations CSV file. In that case, you would select the Using multi-label mode option.

For Annotation location on S3, enter the path of the annotations CSV file.
For Training data location on S3, enter the Amazon S3 location where your documents reside.
Leave all other options as default in this section.
In the Output data section, specify an Amazon S3 location for your output.

This is optional, but it’s a good practice to provide an output location because Amazon Comprehend will generate the post-model training evaluation metrics in this location. This data is useful to evaluate model performance, iterate, and improve the accuracy of your model.

In the IAM role section, choose an appropriate AWS Identity and Access Management (IAM) role that allows Amazon Comprehend to access the Amazon S3 location and write and read from it.
Choose Create to initiate the model training.

The model may take several minutes to train, depending on the number of classes and the dataset size. You can review the training status on the Custom classification page. The training process will display a Submitted status right after the training process starts and will change to Training status when the training process begins. After your model is trained, the Version status will change to Trained. If Amazon Comprehend finds inconsistencies in your training data, the status will show In error along with an alert that shows the appropriate error message so that you can take corrective action and restart the training process with the corrected data.

In this post, we demonstrated the steps to train a custom classifier model using the Amazon Comprehend console. You can also use the AWS SDK in any language (for example, Boto3 for Python) or the AWS Command Line Interface (AWS CLI) to initiate a custom classification model training. With either the SDK or AWS CLI, you can use the CreateDocumentClassifier API to initiate the model training, and subsequently use the DescribeDocumentClassifier API to check the status of the model.

After the model is trained, you can perform either real-time analysis or asynchronous (batch) analysis jobs on new documents. To perform real-time classification on documents, you must deploy an Amazon Comprehend real-time endpoint with the trained custom classification model. Real-time endpoints are best suited for use cases that require low-latency, real-time inference results, whereas for classifying a large set of documents, an asynchronous analysis job is more appropriate. To learn how you can perform asynchronous inference on new documents using a trained classification model, refer to Introducing one-step classification and entity recognition with Amazon Comprehend for intelligent document processing.

Benefits of the layout-aware custom classification model

The new classifier model offers a number of improvements. It’s not only easier to train the new model, but you can also train a new model with just a few samples for each class. Additionally, you no longer have to extract serialized plain text out of scanned or digital documents such as images or PDFs to prepare the training dataset. The following are some additional noteworthy improvements that you can expect from the new classification model:

Improved accuracy – The model now takes into account the layout and structure of documents, which leads to a better understanding of the structure and content of the documents. This helps distinguish between documents with similar text but different layouts or structures, resulting in increased classification accuracy.
Robustness – The model now handles variations in document structure and formatting. This makes it better suited for classifying documents from different sources with varying layouts or formatting styles, which is a common challenge in real-world document classification tasks. It’s compatible with several document types natively, making it versatile and applicable to different industries and use cases.
Reduced manual intervention – Higher accuracy leads to less manual intervention in the classification process. This can save time and resources, and increase operational efficiency in your document processing workload.

Conclusion

The new Amazon Comprehend document classification model, which incorporates layout awareness, is a game-changer for businesses dealing with large volumes of documents. By understanding the structure and layout of documents, this model offers improved classification accuracy and efficiency. Implementing a robust and accurate document classification solution using a layout-aware model can help your business save time, reduce operational costs, and enhance decision-making processes.

As a next step, we encourage you to try the new Amazon Comprehend custom classification model via the Amazon Comprehend console. We also recommend revisiting our custom classification model improvement announcements from last year and visit the GitHub repository for code samples.

About the authors

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.

Godwin Sahayaraj Vincent is an Enterprise Solutions Architect at AWS who is passionate about Machine Learning and providing guidance to customers to design, deploy and manage their AWS workloads and architectures. In his spare time, he loves to play cricket with his friends and tennis with his three kids.

Use streaming ingestion with Amazon SageMaker Feature Store and Amazon MSK to make ML-backed decisions in near-real time

April 19, 2023

by Mark Roy Amazon AWS

Businesses are increasingly using machine learning (ML) to make near-real-time decisions, such as placing an ad, assigning a driver, recommending a product, or even dynamically pricing products and services. ML models make predictions given a set of input data known as features, and data scientists easily spend more than 60% of their time designing and building these features. Furthermore, highly accurate predictions depend on timely access to feature values that change quickly over time, adding even more complexity to the job of building a highly available and accurate solution. For example, a model for a ride-sharing app can choose the best price for a ride from the airport, but only if it knows the number of ride requests received in the past 10 minutes and the number of passengers projected to land in the next 10 minutes. A routing model in a call center app can pick the best available agent for an incoming call, but it’s only effective if it knows the customer’s latest web session clicks.

Although the business value of near-real-time ML predictions is enormous, the architecture required to deliver them reliably, securely, and with good performance is complicated. Solutions need high-throughput updates and low-latency retrieval of the most recent feature values in milliseconds, something most data scientists aren’t prepared to deliver. As a result, some enterprises have spent millions of dollars inventing their own proprietary infrastructure for feature management. Other firms have limited their ML applications to simpler patterns like batch scoring until ML vendors provide more comprehensive off-the-shelf solutions for online feature stores.

To address these challenges, Amazon SageMaker Feature Store provides a fully managed central repository for ML features, making it easy to securely store and retrieve features without having to build and maintain your own infrastructure. Feature Store lets you define groups of features, use batch ingestion and streaming ingestion, retrieve the latest feature values with single-digit millisecond latency for highly accurate online predictions, and extract point-in-time correct datasets for training. Instead of building and maintaining these infrastructure capabilities, you get a fully managed service that scales as your data grows, enables sharing features across teams, and lets your data scientists focus on building great ML models aimed at game-changing business use cases. Teams can now deliver robust features once and reuse them many times in a variety of models that may be built by different teams.

This post walks through a complete example of how you can couple streaming feature engineering with Feature Store to make ML-backed decisions in near-real time. We show a credit card fraud detection use case that updates aggregate features from a live stream of transactions and uses low-latency feature retrievals to help detect fraudulent transactions. Try it out for yourself by visiting our GitHub repo.

Credit card fraud use case

Stolen credit card numbers can be bought in bulk on the dark web from previous leaks or hacks of organizations that store this sensitive data. Fraudsters buy these card lists and attempt to make as many transactions as possible with the stolen numbers until the card is blocked. These fraud attacks typically happen in a short time frame, and this can be easily spotted in historical transactions because the velocity of transactions during the attack differs significantly from the cardholder’s usual spending pattern.

The following table shows a sequence of transactions from one credit card where the cardholder first has a genuine spending pattern, and then experiences a fraud attack starting on November 4.

cc_num	trans_time	amount	fraud_label
…1248	Nov-01 14:50:01	10.15	0
… 1248	Nov-02 12:14:31	32.45	0
… 1248	Nov-02 16:23:12	3.12	0
… 1248	Nov-04 02:12:10	1.01	1
… 1248	Nov-04 02:13:34	22.55	1
… 1248	Nov-04 02:14:05	90.55	1
… 1248	Nov-04 02:15:10	60.75	1
… 1248	Nov-04 13:30:55	12.75	0

For this post, we train an ML model to spot this kind of behavior by engineering features that describe an individual card’s spending pattern, such as the number of transactions or the average transaction amount from that card in a certain time window. This model protects cardholders from fraud at the point of sale by detecting and blocking suspicious transactions before the payment can complete. The model makes predictions in a low-latency, real-time context and relies on receiving up-to-the-minute feature calculations so it can respond to an ongoing fraud attack. In a real-world scenario, features related to cardholder spending patterns would only form part of the model’s feature set, and we can include information about the merchant, the cardholder, the device used to make the payment, and any other data that may be relevant to detecting fraud.

Because our use case relies on profiling an individual card’s spending patterns, it’s crucial that we can identify credit cards in a transaction stream. Most publicly available fraud detection datasets don’t provide this information, so we use the Python Faker library to generate a set of transactions covering a 5-month period. This dataset contains 5.4 million transactions spread across 10,000 unique (and fake) credit card numbers, and is intentionally imbalanced to match the reality of credit card fraud (only 0.25% of the transactions are fraudulent). We vary the number of transactions per day per card, as well as the transaction amounts. See our GitHub repo for more details.

Overview of the solution

We want our fraud detection model to classify credit card transactions by noticing a burst of recent transactions that differs significantly from the cardholder’s usual spending pattern. Sounds simple enough, but how do we build it?

The following diagram shows our overall solution architecture. We feel that this same pattern will work well for a variety of streaming aggregation use cases. At a high level, the pattern involves the following five pieces:

Feature store – We use Feature Store to provide a repository of features with high-throughput writes and secure low-latency reads, using feature values that are organized into multiple feature groups.
Batch ingestion – Batch ingestion takes labeled historical credit card transactions and creates the aggregate features and ratios needed for training the fraud detection model. We use an Amazon SageMaker Processing job and the built-in Spark container to calculate aggregate weekly counts and transaction amount averages and ingest them into the feature store for use in online inference.
Model training and deployment – This aspect of our solution is straightforward. We use Amazon SageMaker to train a model using the built-in XGBoost algorithm on aggregated features created from historical transactions. The model is deployed to a SageMaker endpoint, where it handles fraud detection requests on live transactions.
Streaming ingestion – An Amazon Kinesis Data Analytics for Apache Flink application backed by Apache Kafka topics in Amazon Managed Streaming for Apache Kafka (MSK) (Amazon MSK) calculates aggregated features from a transaction stream, and an AWS Lambda function updates the online feature store. Apache Flink is a popular framework and engine for processing data streams.
Streaming predictions – Lastly, we make fraud predictions on a stream of transactions, using Lambda to pull aggregate features from the online feature store. We use the latest feature data to calculate transaction ratios and then call the fraud detection endpoint.

Prerequisites

We provide an AWS CloudFormation template to create the prerequisite resources for this solution. The following table lists the stacks available for different Regions.

AWS Region	Link
us-east-1
us-east-2
us-west-1
eu-west-1
ap-northeast-1

In the following sections, we explore each component of our solution in more detail.

Feature store

ML models rely on well-engineered features coming from a variety of data sources, with transformations as simple as calculations or as complicated as a multi-step pipeline that takes hours of compute time and complex coding. Feature Store enables the reuse of these features across teams and models, which improves data scientist productivity, speeds up time to market, and ensures consistency of model input.

Each feature inside Feature Store is organized into a logical grouping called a feature group. You decide which feature groups you need for your models. Each one can have dozens, hundreds, or even thousands of features. Feature groups are managed and scaled independently, but they’re all available for search and discovery across teams of data scientists responsible for many independent ML models and use cases.

ML models often require features from multiple feature groups. A key aspect of a feature group is how often its feature values need to be updated or materialized for downstream training or inference. You refresh some features hourly, nightly, or weekly, and a subset of features must be streamed to the feature store in near-real time. Streaming all feature updates would lead to unnecessary complexity, and could even lower the quality of data distributions by not giving you the chance to remove outliers.

In our use case, we create a feature group called cc-agg-batch-fg for aggregated credit card features updated in batch, and one called cc-agg-fg for streaming features.

The cc-agg-batch-fg feature group is updated nightly, and provides aggregate features looking back over a 1-week time window. Recalculating 1-week aggregations on streaming transactions don’t offer meaningful signals, and would be a waste of resources.

Conversely, our cc-agg-fg feature group must be updated in a streaming fashion, because it offers the latest transaction counts and average transaction amounts looking back over a 10-minute time window. Without streaming aggregation, we couldn’t spot the typical fraud attack pattern of a rapid sequence of purchases.

By isolating features that are recalculated nightly, we can improve ingestion throughput for our streaming features. Separation lets us optimize the ingestion for each group independently. When designing for your use cases, keep in mind that models requiring features from a large number of feature groups may want to make multiple retrievals from the feature store in parallel to avoid adding excessive latency to a real-time prediction workflow.

The feature groups for our use case are shown in the following table.

cc-agg-fg	cc-agg-batch-fg
cc_num (record id)	cc_num (record id)
trans_time	trans_time
num_trans_last_10m	num_trans_last_1w
avg_amt_last_10m	avg_amt_last_1w

Each feature group must have one feature used as a record identifier (for this post, the credit card number). The record identifier acts as a primary key for the feature group, enabling fast lookups as well as joins across feature groups. An event time feature is also required, which enables the feature store to track the history of feature values over time. This becomes important when looking back at the state of features at a specific point in time.

In each feature group, we track the number of transactions per unique credit card and its average transaction amount. The only difference between our two groups is the time window used for aggregation. We use a 10-minute window for streaming aggregation, and a 1-week window for batch aggregation.

With Feature Store, you have the flexibility to create feature groups that are offline only, online only, or both online and offline. An online store provides high-throughput writes and low-latency retrievals of feature values, which is ideal for online inference. An offline store is provided using Amazon Simple Storage Service (Amazon S3), giving firms a highly scalable repository, with a full history of feature values, partitioned by feature group. The offline store is ideal for training and batch scoring use cases.

When you enable a feature group to provide both online and offline stores, SageMaker automatically synchronizes feature values to an offline store, continuously appending the latest values to give you a full history of values over time. Another benefit of feature groups that are both online and offline is that they help avoid the problem of training and inference skew. SageMaker lets you feed both training and inference with the same transformed feature values, ensuring consistency to drive more accurate predictions. The focus in our post is to demonstrate online feature streaming, so we implemented online-only feature groups.

Batch ingestion

To materialize our batch features, we create a feature pipeline that runs as a SageMaker Processing job on a nightly basis. The job has two responsibilities: producing the dataset for training our model, and populating the batch feature group with the most up-to-date values for aggregate 1-week features, as shown in the following diagram.

Each historical transaction used in the training set is enriched with aggregated features for the specific credit card involved in the transaction. We look back over two separate sliding time windows: 1 week back, and the preceding 10 minutes. The actual features used to train the model include the following ratios of these aggregated values:

amt_ratio1 =avg_amt_last_10m / avg_amt_last_1w
amt_ratio2 =transaction_amount / avg_amt_last_1w
count_ratio =num_trans_last_10m / num_trans_last_1w

For example, count_ratio is the transaction count from the prior 10 minutes divided by the transaction count from the last week.

Our ML model can learn patterns of normal activity vs. fraudulent activity from these ratios, rather than relying on raw counts and transaction amounts. Spending patterns on different cards vary greatly, so normalized ratios provide a better signal to the model than the aggregated amounts themselves.

You may be wondering why our batch job is computing features with a 10-minute lookback. Isn’t that only relevant for online inference? We need the 10-minute window on historical transactions to create an accurate training dataset. This is critical for ensuring consistency with the 10-minute streaming window that will be used in near-real time to support online inference.

The resulting training dataset from the processing job can be saved directly as a CSV for model training, or it can be bulk ingested into an offline feature group that can be used for other models and by other data science teams to address a wide variety of other use cases. For example, we can create and populate a feature group called cc-transactions-fg. Our training job can then pull a specific training dataset based on the needs for our specific model, selecting specific date ranges and a subset of features of interest. This approach enables multiple teams to reuse feature groups and maintain fewer feature pipelines, leading to significant cost savings and productivity improvements over time. This example notebook demonstrates the pattern of using Feature Store as a central repository from which data scientists can extract training datasets.

In addition to creating a training dataset, we use the PutRecord API to put the 1-week feature aggregations into the online feature store nightly. The following code demonstrates putting a record into an online feature group given specific feature values, including a record identifier and an event time:

record = [{'FeatureName': 'cc_num', 
              'ValueAsString': str(cc_num)},
             {'FeatureName':'avg_amt_last_1w', 
              'ValueAsString': str(avg_amt_last_1w)},
             {'FeatureName':'num_trans_last_1w', 
              'ValueAsString': str(num_trans_last_1w)}]
event_time_feature = {
                 'FeatureName': 'trans_time',
                 'ValueAsString': str(int(round(time.time())))}
record.append(event_time_feature)
response = feature_store_client.put_record(
    FeatureGroupName=’cc-agg-batch-fg’, Record=record)

ML engineers often build a separate version of feature engineering code for online features based on the original code written by data scientists for model training. This can deliver the desired performance, but is an extra development step, and introduces more chance for training and inference skew. In our use case, we show how using SQL for aggregations can enable a data scientist to provide the same code for both batch and streaming.

Streaming ingestion

Feature Store delivers single-digit millisecond retrieval of pre-calculated features, and it can also play an effective role in solutions requiring streaming ingestion. Our use case demonstrates both. Weekly lookback is handled as a pre-calculated feature group, materialized nightly as shown earlier. Now let’s dive into how we calculate features aggregated on the fly over a 10-minute window and ingest them into the feature store for later online inference.

In our use case, we ingest live credit card transactions to a source MSK topic, and use a Kinesis Data Analytics for Apache Flink application to create aggregate features in a destination MSK topic. The application is written using Apache Flink SQL. Flink SQL makes it simple to develop streaming applications using standard SQL. It’s easy to learn Flink if you have ever worked with a database or SQL-like system by remaining ANSI-SQL 2011 compliant. Apart from SQL, we can build Java and Scala applications in Amazon Kinesis Data Analytics using open-source libraries based on Apache Flink. We then use a Lambda function to read the destination MSK topic and ingest the aggregate features into a SageMaker feature group for inference. Creating the Apache Flink application using Flink’s SQL API is straightforward. We use Flink SQL to aggregate the streaming data in the source MSK topic and store it in a destination MSK topic.

To produce aggregate counts and average amounts looking back over a 10-minute window, we use the following Flink SQL query on the input topic and pipe the results to the destination topic:

SELECT 
 cc_num, 
 COUNT(*) OVER LAST_10_MINUTES as cc_count,
 AVG(amount) OVER LAST_10_MINUTES as avg_amount
FROM 
 cctopic
WINDOW LAST_10_MINUTES AS (
 PARTITION BY cc_num
 ORDER BY proc_ts
 RANGE INTERVAL '10' MINUTE PRECEDING
 );

cc_num	amount	datetime	num_trans_last_10m	avg_amt_last_10m
…1248	50.00	Nov-01,22:01:00	1	74.99
…9843	99.50	Nov-01,22:02:30	1	99.50
…7403	100.00	Nov-01,22:03:48	1	100.00
…1248	200.00	Nov-01,22:03:59	2	125.00
…0732	26.99	Nov01, 22:04:15	1	26.99
…1248	50.00	Nov-01,22:04:28	3	100.00
…1248	500.00	Nov-01,22:05:05	4	200.00

In this example, notice that the final row has a count of four transactions in the last 10 minutes from the credit card ending with 1248, and a corresponding average transaction amount of $200.00. The SQL query is consistent with the one used to drive creation of our training dataset, helping to avoid training and inference skew.

As transactions stream into the Kinesis Data Analytics for Apache Flink aggregation app, the app sends the aggregate results to our Lambda function, as shown in the following diagram. The Lambda function takes these features and populates the cc-agg-fg feature group.

We send the latest feature values to the feature store from Lambda using a simple call to the PutRecord API. The following is the core piece of Python code for storing the aggregate features:

record = [{'FeatureName': 'cc_num', 
           'ValueAsString': str(cc_num)},
          {'FeatureName':'avg_amt_last_10m', 
           'ValueAsString': str(avg_amt_last_10m)},
          {'FeatureName':'num_trans_last_10m', 
           'ValueAsString': str(num_trans_last_10m)},
          {'FeatureName': 'evt_time', 
           'ValueAsString': str(int(round(time.time())))}]
featurestore_runtime.put_record(FeatureGroupName='cc-agg-fg', 
                                Record=record)

We prepare the record as a list of named value pairs, including the current time as the event time. The Feature Store API ensures that this new record follows the schema that we identified when we created the feature group. If a record for this primary key already existed, it is now overwritten in the online store.

Streaming predictions

Now that we have streaming ingestion keeping the feature store up to date with the latest feature values, let’s look at how we make fraud predictions.

We create a second Lambda function that uses the source MSK topic as a trigger. For each new transaction event, the Lambda function first retrieves the batch and streaming features from Feature Store. To detect anomalies in credit card behavior, our model looks for spikes in recent purchase amounts or purchase frequency. The Lambda function computes simple ratios between the 1-week aggregations and the 10-minute aggregations. It then invokes the SageMaker model endpoint using those ratios to make the fraud prediction, as shown in the following diagram.

We use the following code to retrieve feature values on demand from the feature store before calling the SageMaker model endpoint:

featurestore_runtime =  
        boto3.client(service_name='sagemaker-featurestore-runtime')
response = featurestore_runtime.get_record(
		FeatureGroupName=feature_group_name, 
        RecordIdentifierValueAsString=record_identifier_value)

SageMaker also supports retrieving multiple feature records with a single call, even if they are from different feature groups.

Finally, with the model input feature vector assembled, we call the model endpoint to predict if a specific credit card transaction is fraudulent. SageMaker also supports retrieving multiple feature records with a single call, even if they are from different feature groups.

sagemaker_runtime =  
    boto3.client(service_name='runtime.sagemaker')
request_body = ','.join(features)
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=ENDPOINT_NAME,
    ContentType='text/csv',
    Body=request_body)
probability = json.loads(response['Body'].read().decode('utf-8'))

In this example, the model came back with a probability of 98% that the specific transaction was fraudulent, and it was able to use near-real-time aggregated input features based on the most recent 10 minutes of transactions on that credit card.

Test the end-to-end solution

To demonstrate the full end-to-end workflow of our solution, we simply send credit card transactions into our MSK source topic. Our automated Kinesis Data Analytics for Apache Flink aggregation takes over from there, maintaining a near-real-time view of transaction counts and amounts in Feature Store, with a sliding 10-minute lookback window. These features are combined with the 1-week aggregate features that were already ingested to the feature store in batch, letting us make fraud predictions on each transaction.

We send a single transaction from three different credit cards. We then simulate a fraud attack on a fourth credit card by sending many back-to-back transactions in seconds. The output from our Lambda function is shown in the following screenshot. As expected, the first three one-off transactions are predicted as NOT FRAUD. Of the 10 fraudulent transactions, the first is predicted as NOT FRAUD, and the rest are all correctly identified as FRAUD. Notice how the aggregate features are kept current, helping drive more accurate predictions.

Conclusion

We have shown how Feature Store can play a key role in the solution architecture for critical operational workflows that need streaming aggregation and low-latency inference. With an enterprise-ready feature store in place, you can use both batch ingestion and streaming ingestion to feed feature groups, and access feature values on demand to perform online predictions for significant business value. ML features can now be shared at scale across many teams of data scientists and thousands of ML models, improving data consistency, model accuracy, and data scientist productivity. Feature Store is available now, and you can try out this entire example. Let us know what you think.

Special thanks to everyone who contributed to the previous blog post with a similar architecture: Paul Hargis, James Leoni and Arunprasath Shankar.

About the Authors

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in feature stores, computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Prabhakar Chandrasekaran is a Senior Technical Account Manager with AWS Enterprise Support. Prabhakar enjoys helping customers build cutting-edge AI/ML solutions on the cloud. He also works with enterprise customers providing proactive guidance and operational assistance, helping them improve the value of their solutions when using AWS. Prabhakar holds six AWS and six other professional certifications. With over 20 years of professional experience, Prabhakar was a data engineer and a program leader in the financial services space prior to joining AWS.

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

April 19, 2023

by Fred Wu Amazon AWS

This is a guest post co-written with Fred Wu from Sportradar.

Sportradar is the world’s leading sports technology company, at the intersection between sports, media, and betting. More than 1,700 sports federations, media outlets, betting operators, and consumer platforms across 120 countries rely on Sportradar knowhow and technology to boost their business.

Sportradar uses data and technology to:

Keep betting operators ahead of the curve with the products and services they need to manage their sportsbook
Give media companies the tools to engage more with fans
Give teams, leagues, and federations the data they need to thrive
Keep the industry clean by detecting and preventing fraud, doping, and match fixing

This post demonstrates how Sportradar used Amazon’s Deep Java Library (DJL) on AWS alongside Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Simple Storage Service (Amazon S3) to build a production-ready machine learning (ML) inference solution that preserves essential tooling in Java, optimizes operational efficiency, and increases the team’s productivity by providing better performance and accessibility to logs and system metrics.

The DJL is a deep learning framework built from the ground up to support users of Java and JVM languages like Scala, Kotlin, and Clojure. Right now, most deep learning frameworks are built for Python, but this neglects the large number of Java developers and developers who have existing Java code bases they want to integrate the increasingly powerful capabilities of deep learning into. With the DJL, integrating this deep learning is simple.

In this post, the Sportradar team discusses the challenges they encountered and the solutions they created to build their model inference platform using the DJL.

Business requirements

We are the US squad of the Sportradar AI department. Since 2018, our team has been developing a variety of ML models to enable betting products for NFL and NCAA football. We recently developed four more new models.

The fourth down decision models for the NFL and NCAA predict the probabilities of the outcome of a fourth down play. A play outcome could be a field goal attempt, play, or punt.

The drive outcome models for the NFL and NCAA predict the probabilities of the outcome of the current drive. A drive outcome could be an end of half, field goal attempt, touchdown, turnover, turnover on downs, or punt.

Our models are the building blocks of other models where we generate a list of live betting markets, include spread, total, win probability, next score type, next team to score, and more.

The business requirements for our models are as follows:

The model predictor should be able to load the pre-trained model file one time, then make predictions on many plays
We have to generate the probabilities for each play under 50-milisecond latency
The model predictor (feature extraction and model inference) has to be written in Java, so that the other team can import it as a Maven dependency

Challenges with the in-place system

The main challenge we have is how to bridge the gap between model training in Python and model inference in Java. Our data scientists train the model in Python using tools like PyTorch and save the model as PyTorch scripts. Our original plan was to also host the models in Python and utilize gRPC to communicate with another service, which will use the Java gRPC client to send the request.

However, a few issues came with this solution. Mainly, we saw the network overhead between two different services running in separate run environments or pods, which resulted in higher latency. But the maintenance overhead was the main reason we abandoned this solution. We had to build both the gRPC server and the client program separately and keep the protocol buffer files consistent and up to date. Then we needed to Dockerize the application, write a deployment YAML file, deploy the gRPC server to our Kubernetes cluster, and make sure it’s reliable and auto scalable.

Another problem was whenever an error occurred on the gRPC server side, the application client only got a vague error message instead of a detailed error traceback. The client had to reach out to the gRPC server maintainer to learn exactly which part of the code caused the error.

Ideally, we instead want to load the model PyTorch scripts, extract the features from model input, and run model inference entirely in Java. Then we can build and publish it as a Maven library, hosted on our internal registry, which our service team could import into their own Java projects. When we did our research online, the Deep Java Library showed up on the top. After reading a few blog posts and DJL’s official documentation, we were sure DJL would provide the best solution to our problem.

Solution overview

The following diagram compares the previous and updated architecture.

The following diagram outlines the workflow of the DJL solution.

The steps are as follows:

Training the models – Our data scientists train the models using PyTorch and save the models as torch scripts. These models are then pushed to an Amazon Simple Storage Service (Amazon S3) bucket using DVC, a version control tool for ML models.
Implementing feature extraction and feeding ML features – The framework team pulls the models from Amazon S3 into a Java repository where they implement feature extraction and feed ML features into the predictor. They use the DJL PyTorch engine to initialize the model predictor.
Packaging and publishing the inference code and models – The GitLab CI/CD pipeline packages and publishes the JAR file that contains the inference code and models to an internal Apache Archiva registry.
Importing the inference library and making calls – The Java client imports the inference library as a Maven dependency. All inference calls are made via Java function calls within the same Kubernetes pod. Because there are no gRPC calls, the inferencing response time is improved. Furthermore, the Java client can easily roll back the inference library to a previous version if needed. In contrast, the server-side error is not transparent for the client side in gRPC-based solutions, making error tracking difficult.

We have seen a stable inferencing runtime and reliable prediction results. The DJL solution offers several advantages over gRPC-based solutions:

Improved response time – With no gRPC calls, the inferencing response time is improved
Easy rollbacks and upgrades – The Java client can easily roll back the inference library to a previous version or upgrade to a new version
Transparent error tracking – In the DJL solution, the client can receive detailed error trackback messages in case of inferencing errors

Deep Java Library overview

The DJL is a full deep learning framework that supports the deep learning lifecycle from building a model, training it on a dataset, to deploying it in production. It has intuitive helpers and utilities for modalities like computer vision, natural language processing, audio, time series, and tabular data. DJL also features an easy model zoo of hundreds of pre-trained models that can be used out of the box and integrated into existing systems.

It is also a fully Apache-2 licensed open-source project and can be found on GitHub. The DJL was created at Amazon and open-sourced in 2019. Today, DJL’s open-source community is led by Amazon and has grown to include many countries, companies, and educational institutions. The DJL continues to grow in its ability to support different hardware, models, and engines. It also includes support for new hardware like ARM (both in servers like AWS Graviton and laptops with Apple M1) and AWS Inferentia.

The architecture of DJL is engine agnostic. It aims to be an interface describing what deep learning could look like in the Java language, but leaves room for multiple different implementations that could provide different capabilities or hardware support. Most popular frameworks today such as PyTorch and TensorFlow are built using a Python front end that connects to a high-performance C++ native backend. The DJL can use this to connect to these same native backends to take advantage of their work on hardware support and performance.

For this reason, many DJL users also use it for inference only. That is, they will train a model using Python and then load it using the DJL for deployment as part of their existing Java production system. Because the DJL utilizes the same engine that powers Python, it’s able to run without any decrease in performance or loss in accuracy. This is exactly the strategy that we found to support the new models.

The following diagram illustrates the workflow under the hood.

When the DJL loads, it finds all the engine implementations available in the class path using Java’s ServiceLoader. In this case, it detects the DJL PyTorch engine implementation, which will act as the bridge between the DJL API and the PyTorch Native.

The engine then works to load the PyTorch Native. By default, it downloads the appropriate native binary based on your OS, CPU architecture, and CUDA version, making it almost effortless to use. You can also provide the binary using one of the many available native JAR files, which are more reliable for production environments that often have limited network access for security.

Once loaded, the DJL uses the Java Native Interface to translate all the easy high-level functionalities in DJL into the equivalent low-level native calls. Every operation in the DJL API is hand-crafted to best fit the Java conventions and make it easily accessible. This also includes dealing with native memory, which is not supported by the Java Garbage Collector.

Although all these details are within the library, calling it from a user standpoint couldn’t be easier. In the following section, we walk through this process.

How Sportradar implemented DJL

Because we train our models using PyTorch, we use the DJL’s PyTorch engine for the model inference.

Loading the model is incredibly easy. All it takes is to build a criteria describing the model to load and where it is from. Then, we load it and use the model to create a new predictor session. See the following code:

For our model, we also have a custom translator, which we call MyTranslator. We use the translator to encapsulate the preprocessing code that converts from a convenient Java type into the input expected by the model and the postprocessing code that converts from the model output into a convenient output. In our case, we chose to use a float[] as the input type and the built-in DJL classifications as the output type. The following is a snippet of our translator code:

It’s pretty amazing that with just a few lines of code, the DJL loads the PyTorch scripts and our custom translator, and then the predictor is ready to make the predictions.

Conclusion

Sportradar’s product built on the DJL solution went live before the 2022–23 NFL regular season started, and it has been running smoothly since then. In the future, Sportradar plans to re-platform existing models hosted on gRPC servers to the DJL solution.

The DJL continues to grow in many different ways. The most recent release, v0.21.0, has many improvements, including updated engine support, improvements on Spark, Hugging Face batch tokenizers, an NDScope for easier memory management, and enhancements to the time series API. It also has the first major release of DJL Zero, a new API aiming to allow support for both using pre-trained models and training your own custom deep learning models even with zero knowledge of deep learning.

The DJL also features a model server called DJL Serving. It makes it simple to host a model on an HTTP server from any of the 10 supported engines, including the Python engine to support Python code. With v0.21.0 of DJL Serving, it includes faster transformer support, Amazon SageMaker multi-model endpoint support, updates for Stable Diffusion, improvements for DeepSpeed, and updates to the management console. You can now use it to deploy large models with model parallel inference using DeepSpeed and SageMaker.

There is also much upcoming with the DJL. The largest area under development is large language model support for models like ChatGPT or Stable Diffusion. There is also work to support streaming inference requests in DJL Serving. Thirdly, there are improvements to demos and the extension for Spark. Of course, there is also standard continuing work including features, fixes, engine updates, and more.

For more information on the DJL and its other features, see Deep Java Library.

Follow our GitHub repo, demo repository, Slack channel, and Twitter for more documentation and examples of the DJL!

About the authors

Fred Wu is a Senior Data Engineer at Sportradar, where he leads infrastructure, DevOps, and data engineering efforts for various NBA and NFL products. With extensive experience in the field, Fred is dedicated to building robust and efficient data pipelines and systems to support cutting-edge sports analytics.

Zach Kimberg is a Software Developer in the Amazon AI org. He works to enable the development, training, and production inference of deep learning. There, he helped found and continues to develop the DeepJavaLibrary project.

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.