Amazon AWS – Page 193

Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints

October 14, 2022

by Syed Jaffry Amazon AWS

Amazon SageMaker multi-model endpoint (MME) enables you to cost-effectively deploy and host multiple models in a single endpoint and then horizontally scale the endpoint to achieve scale. As illustrated in the following figure, this is an effective technique to implement multi-tenancy of models within your machine learning (ML) infrastructure. We have seen software as a service (SaaS) businesses use this feature to apply hyper-personalization in their ML models while achieving lower costs.

For a high-level overview of how MME work, check out the AWS Summit video Scaling ML to the next level: Hosting thousands of models on SageMaker. To learn more about the hyper-personalized, multi-tenant use cases that MME enables, refer to How to scale machine learning inference for multi-tenant SaaS use cases.

In the rest of this post, we dive deeper into the technical architecture of SageMaker MME and share best practices for optimizing your multi-model endpoints.

Use cases best suited for MME

SageMaker multi-model endpoints are well suited for hosting a large number of models that you can serve through a shared serving container and you don’t need to access all the models at the same time. Depending on the size of the endpoint instance memory, a model may occasionally be unloaded from memory in favor of loading a new model to maximize efficient use of memory, therefore your application needs to be tolerant of occasional latency spikes on unloaded models.

MME is also designed for co-hosting models that use the same ML framework because they use the shared container to load multiple models. Therefore, if you have a mix of ML frameworks in your model fleet (such as PyTorch and TensorFlow), SageMaker dedicated endpoints or multi-container hosting is a better choice.

Finally, MME is suited for applications that can tolerate an occasional cold start latency penalty, because models are loaded on first invocation and infrequently used models can be offloaded from memory in favor of loading new models. Therefore, if you have a mix of frequently and infrequently accessed models, a multi-model endpoint can efficiently serve this traffic with fewer resources and higher cost savings.

We have also seen some scenarios where customers deploy an MME cluster with enough aggregate memory capacity to fit all their models, thereby avoiding model offloads altogether yet still achieving cost savings because of the shared inference infrastructure.

Model serving containers

When you use the SageMaker Inference Toolkit or a pre-built SageMaker model serving container compatible with MME, your container has the Multi Model Server (JVM process) running. The easiest way to have Multi Model Server (MMS) incorporated into your model serving container is to use SageMaker model serving containers compatible with MME (look for those with Job Type=inference and CPU/GPU=CPU). MMS is an open source, easy-to-use tool for serving deep learning models. It provides a REST API with a web server to serve and manage multiple models on a single host. However, it’s not mandatory to use MMS; you could implement your own model server as long as it implements the APIs required by MME.

When used as part of the MME platform, all predict, load, and unload API calls to MMS or your own model server are channeled through the MME data plane controller. API calls from the data plane controller are made over local host only to prevent unauthorized access from outside of the instance. One of the key benefits of MMS is that it enables a standardized interface for loading, unloading, and invoking models with compatibility across a wide range of deep learning frameworks.

Advanced configuration of MMS

If you choose to use MMS for model serving, consider the following advanced configurations to optimize the scalability and throughput of your MME instances.

Increase inference parallelism per model

MMS creates one or more Python worker processes per model based on the value of the default_workers_per_model configuration parameter. These Python workers handle each individual inference request by running any preprocessing, prediction, and post processing functions you provide. For more information, see the custom service handler GitHub repo.

Having more than one model worker increases the parallelism of predictions that can be served by a given model. However, when a large number of models are being hosted on an instance with a large number of CPUs, you should perform a load test of your MME to find the optimum value for default_workers_per_model to prevent any memory or CPU resource exhaustion.

Design for traffic spikes

Each MMS process within an endpoint instance has a request queue that can be configured with the job_queue_size parameter (default is 100). This determines the number of requests MMS will queue when all worker processes are busy. Use this parameter to fine-tune the responsiveness of your endpoint instances after you’ve decided on the optimal number of workers per model.

In an optimal worker per model ratio, the default of 100 should suffice for most cases. However, for those cases where request traffic to the endpoint spikes unusually, you can reduce the size of the queue if you want the endpoint to fail fast to pass control to the application or increase the queue size if you want the endpoint to absorb the spike.

Maximize memory resources per instance

When using multiple worker processes per model, by default each worker process loads its own copy of the model. This can reduce the available instance memory for other models. You can optimize memory utilization by sharing a single model between worker processes by setting the configuration parameter preload_model=true. Here you’re trading off reduced inference parallelism (due to a single model instance) with more memory efficiency. This setting along with multiple worker processes can be a good choice for use cases where model latency is low but you have heavier preprocessing and postprocessing (done by the worker processes) per inference request.

Set values for MMS advanced configurations

MMS uses a config.properties file to store configurations. MMS uses the following order to locate this config.properties file:

If the MMS_CONFIG_FILE environment variable is set, MMS loads the configuration from the environment variable.
If the --mms-config parameter is passed to MMS, it loads the configuration from the parameter.
If there is a config.properties in current folder where the user starts MMS, it loads the config.properties file from the current working directory.

If none of the above are specified, MMS loads the built-in configuration with default values.

The following is a command line example of starting MMS with an explicit configuration file:

multi-model-server --start --mms-config /home/mms/config.properties

Key metrics to monitor your endpoint performance

The key metrics that can help you optimize your MME are typically related to CPU and memory utilization and inference latency. The instance-level metrics are emitted by MMS, whereas the latency metrics come from the MME. In this section, we discuss the typical metrics that you can use to understand and optimize your MME.

Endpoint instance-level metrics (MMS metrics)

From the list of MMS metrics, CPUUtilization and MemoryUtilization can help you evaluate whether or not your instance or the MME cluster is right-sized. If both metrics have percentages between 50–80%, then your MME is right-sized.

Typically, low CPUUtilization and high MemoryUtilization is an indication of an over-provisioned MME cluster because it indicates that infrequently invoked models aren’t being unloaded. This could be because of a higher-than-optimal number of endpoint instances provisioned for the MME and therefore higher-than-optimal aggregate memory is available for infrequently accessed models to remain in memory. Conversely, close to 100% utilization of these metrics means that your cluster is under-provisioned, so you need to adjust your cluster auto scaling policy.

Platform-level metrics (MME metrics)

From the full list of MME metrics, a key metric that can help you understand the latency of your inference request is ModelCacheHit. This metric shows the average ratio of invoke requests for which the model was already loaded in memory. If this ratio is low, it indicates your MME cluster is under-provisioned because there’s likely not enough aggregate memory capacity in the MME cluster for the number of unique model invocations, therefore causing models to be frequently unloaded from memory.

Lessons from the field and strategies for optimizing MME

We have seen the following recommendations from some of the high-scale uses of MME across a number of customers.

Horizontal scaling with smaller instances is better than vertical scaling with larger instances

You may experience throttling on model invocations when running high requests per second (RPS) on fewer endpoint instances. There are internal limits to the number of invocations per second (loads and unloads that can happen concurrently on an instance), and therefore it’s always better to have a higher number of smaller instances. Running a higher number of smaller instances means a higher total aggregate capacity of these limits for the endpoint.

Another benefit of horizontally scaling with smaller instances is that you reduce the risk of exhausting instance CPU and memory resources when running MMS with higher levels of parallelism, along with a higher number of models in memory (as described earlier in this post).

Avoiding thrashing is a shared responsibility

Thrashing in MME is when models are frequently unloaded from memory and reloaded due to insufficient memory, either in an individual instance or on aggregate in the cluster.

From a usage perspective, you should right-size individual endpoint instances and right-size the overall size of the MME cluster to ensure enough memory capacity is available per instance and also on aggregate for the cluster for your use case. The MME platform’s router fleet will also maximize the cache hit.

Don’t be aggressive with bin packing too many models on fewer, larger memory instances

Memory isn’t the only resource on the instance to be aware of. Other resources like CPU can be a constraining factor, as seen in the following load test results. In some other cases, we have also observed other kernel resources like process IDs being exhausted on an instance, due to a combination of too many models being loaded and the underlying ML framework (such as TensorFlow) spawning threads per model that were multiples of available vCPUs.

The following performance test demonstrates an example of CPU constraint impacting model latency. In this test, a single instance endpoint with a large instance, while having more than enough memory to keep all four models in memory, produced comparatively worse model latencies under load when compared to an endpoint with four smaller instances.

single instance endpoint model latency

single instance endpoint CPU & memory utilization

four instance endpoint model latency

four instance endpoint CPU & memory utilization

To achieve both performance and cost-efficiency, right-size your MME cluster with higher number of smaller instances that on aggregate give you the optimum memory and CPU capacity while being relatively at par for cost with fewer but larger memory instances.

Mental model for optimizing MME

There are four key metrics that you should always consider when right-sizing your MME:

The number and size of the models
The number of unique models invoked at a given time
The instance type and size
The instance count behind the endpoint

Start with the first two points, because they inform the third and fourth. For example, if not enough instances are behind the endpoint for the number or size of unique models you have, the aggregate memory for the endpoint will be low and you’ll see a lower cache hit ratio and thrashing at the endpoint level because the MME will load and unload models in and out of memory frequently.

Similarly, if the invocations for unique models are higher than the aggregate memory of all instances behind the endpoint, you’ll see a lower cache hit. This can also happen if the size of instances (especially memory capacity) is too small.

Vertically scaling with really large memory instances could also lead to issues because although the models may fit into memory, other resources like CPU and kernel processes and thread limits could be exhausted. Load test horizontal scaling in pre-production to get the optimum number and size of instances for your MME.

Summary

In this post, you got a deeper understanding of the MME platform. You learned which technical use cases MME is suited for and reviewed the architecture of the MME platform. You gained a deeper understanding of the role of each component within the MME architecture and which components you can directly influence the performance of. Finally, you had a deeper look at the configuration parameters that you can adjust to optimize MME for your use case and the metrics you should monitor to maintain optimum performance.

To get started with MME, review Amazon SageMaker Multi-Model Endpoints using XGBoost and Host multiple models in one container behind one endpoint.

About the Author

Syed Jaffry is a Principal Solutions Architect with AWS. He works with a range of companies from mid-sized organizations, large enterprises, financial services and ISVs to help them build and operate cost efficient and scalable AI/ML applications in the cloud.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Testing approaches for Amazon SageMaker ML models

October 14, 2022

by Tobias Wenzel Amazon AWS

This post was co-written with Tobias Wenzel, Software Engineering Manager for the Intuit Machine Learning Platform.

We all appreciate the importance of a high-quality and reliable machine learning (ML) model when using autonomous driving or interacting with Alexa, for examples. ML models also play an important role in less obvious ways—they’re used by business applications, healthcare, financial institutions, amazon.com, TurboTax, and more.

As ML-enabled applications become core to many businesses, models need to follow the same vigor and discipline as software applications. An important aspect of MLOps is to deliver a new version of the previously developed ML model in production by using established DevOps practices such as testing, versioning, continuous delivery, and monitoring.

There are several prescriptive guidelines around MLOps, and this post gives an overview of the process that you can follow and which tools to use for testing. This is based on collaborations between Intuit and AWS. We have been working together to implement the recommendations explained in this post in practice and at scale. Intuit’s goal of becoming an AI-driven expert platform is heavily dependent on a strategy of increasing velocity of initial model development as well as testing of new versions.

Requirements

The following are the main areas of consideration while deploying new model versions:

Model accuracy performance – It’s important to keep track of model evaluation metrics like accuracy, precision, and recall, and ensure that the objective metrics remain relatively the same or improve with a new version of the model. In most cases, deploying a new version of the model doesn’t make sense if the experience of end-users won’t improve.
Test data quality – Data in non-production environments, whether simulated or point-in-time copy, should be representative of the data that the model will receive when fully deployed, in terms of volume or distribution. If not, your testing processes won’t be representative, and your model may behave differently in production.
Feature importance and parity – Feature importance in the newer version of the model should relatively compare to the older model, though there might be new features introduced. This is to ensure that the model isn’t becoming biased.
Business process testing – It’s important that a new version of a model can fulfill your required business objectives within acceptable parameters. For example, one of the business metrics can be that the end-to-end latency for any service must not be more than 100 milliseconds, or the cost to host and retrain a particular model can’t be more than $10,000 per year.
Cost – A simple approach to testing is to replicate the whole production environment as a test environment. This is a common practice in software development. However, such an approach in the case of ML models might not yield the right ROI depending upon the size of data and may impact the model in terms of the business problem it’s addressing.
Security – Test environments are often expected to have sample data instead of real customer data and as a result, data handling and compliance rules can be less strict. Just like cost though, if you simply duplicate the production environment into a test environment, you could introduce security and compliance risks.
Feature store scalability – If an organization decides to not create a separate test feature store because of cost or security reasons, then model testing needs to happen on the production feature store, which can cause scalability issues as traffic is doubled during the testing period.
Online model performance – Online evaluations differ from offline evaluations and can be important in some cases like recommendation models because they measure user satisfaction in real time rather than perceived satisfaction. It’s hard to simulate real traffic patterns in non-production due to seasonality or other user behavior, so online model performance can only be done in production.
Operational performance – As models get bigger and are increasingly deployed in a decentralized manner on different hardware, it’s important to test the model for your desired operational performance like latency, error rate, and more.

Most ML teams have a multi-pronged approach to model testing. In the following sections, we provide ways to address these challenges during various testing stages.

Offline model testing

The goal of this testing phase is to validate new versions of an existing model from an accuracy standpoint. This should be done in an offline fashion to not impact any predictions in the production system that are serving real-time predictions. By ensuring that the new model performs better for applicable evaluation metrics, this testing addresses challenge 1 (model accuracy performance). Also, by using the right dataset, this testing can address challenges 2 and 3 (test data quality, feature importance and parity), with the additional benefit of tackling challenge 5 (cost).

This phase is done in the staging environment.

You should capture production traffic, which you can use to replay in offline back testing. It’s preferable to use past production traffic instead of synthetic data. The Amazon SageMaker Model Monitor capture data feature allows you to capture production traffic for models hosted on Amazon SageMaker. This allows model developers to test their models with data from peak business days or other significant events. The captured data is then replayed against the new model version in a batch fashion using Sagemaker batch transform. This means that the batch transform run can tests with data that has been collected over weeks or months in just a few hours. This can significantly speed up the model evaluation process compared to running two or more versions of a real-time model side by side and sending duplicate prediction requests to each endpoint. In addition to finding a better-performing version faster, this approach also uses the compute resources for a shorter amount of time, reducing the overall cost.

A challenge with this approach to testing is that the feature set changes from one model version to another. In this scenario, we recommend creating a feature set with a superset of features for both versions so that all features can be queried at once and recorded through the data capture. Each prediction call can then work on only those features necessary for the current version of the model.

As an added bonus, by integrating Amazon SageMaker Clarify in your offline model testing, you can check the new version of model for bias and also compare feature attribution with the previous version of the model. With pipelines, you can orchestrate the entire workflow such that after training, a quality check step can take place to perform an analysis of the model metrics and feature importance. These metrics are stored in the SageMaker model registry for comparison in the next run of training.

Integration and performance testing

Integration testing is needed to validate end-to-end business processes from a functional as well as a runtime performance perspective. Within this process, the whole pipeline should be tested, including fetching, and calculating features in the feature store and running the ML application. This should be done with a variety of different payloads to cover a variety of scenarios and requests and achieve high coverage for all possible code runs. This addresses challenges 4 and 9 (business process testing and operational performance) to ensure none of the business processes are broken with the new version of the model.

This testing should be done in a staging environment.

Both integration testing and performance testing need to be implemented by individual teams using their MLOps pipeline. For the integration testing, we recommend the tried and tested method of maintaining a functionally equivalent pre-production environment and testing with a few different payloads. The testing workflow can be automated as shown in this workshop. For the performance testing, you can use Amazon SageMaker Inference Recommender, which offers a great starting point to determine which instance type and how many of those instances to use. For this, you’ll need to use a load generator tool, such as the open-source projects perfsizesagemaker and perfsize that Intuit has developed. Perfsizesagemaker allows you to automatically test model endpoint configurations with a variety of payloads, response times, and peak transactions per second requirements. It generates detailed test results that compare different model versions. Perfsize is the companion tool that tries different configurations given only the peak transactions per second and the expected response time.

A/B testing

In many cases where user reaction to the immediate output of the model is required, such as ecommerce applications, offline model functional evaluation isn’t sufficient. In these scenarios, you need to A/B test models in production before making the decision of updating models. A/B testing also has its risks because there could be real customer impact. This testing method serves as the final ML performance validation, a lightweight engineering sanity check. This method also addresses challenges 8 and 9 (online model performance and operational excellence).

A/B testing should be performed in a production environment.

With SageMaker, you can easily perform A/B testing on ML models by running multiple production variants on an endpoint. Traffic can be routed in increments to the new version to reduce the risk that a badly behaving model could have on production. If results of the A/B test look good, traffic is routed to the new version, eventually taking over 100% of traffic. We recommend using deployment guardrails to transition from model A to B. For a more complete discussion on A/B testing using Amazon Personalize models as an example, refer to Using A/B testing to measure the efficacy of recommendations generated by Amazon Personalize.

Online model testing

In this scenario, the new version of a model is significantly different to the one already serving live traffic in production, so the offline testing approach is no longer suitable to determine the efficacy of the new model version. The most prominent reason for this is a change in features required to produce the prediction, so that previously recorded transactions can’t be used to test the model. In this scenario, we recommend using shadow deployments. Shadow deployments offer the capability to deploy a shadow (or challenger) model alongside the production (or champion) model that is currently serving predictions. This lets you evaluate how the shadow model performs in production traffic. The predictions of the shadow model aren’t served to the requesting application; they’re logged for offline evaluation. With the shadow approach for testing, we address challenges 4, 5, 6, and 7 (business process testing, cost, security, and feature store scalability).

Online model testing should be done in staging or production environments.

This method of testing new model versions should be used as a last resort if all the other methods can’t be used. We recommend it as a last resort because duplexing calls to multiple models generates additional load on all downstream services in production, which can lead to performance bottlenecks as well as increased cost in production. The most obvious impact this has is on the feature serving layer. For use cases that share features from a common pool of physical data, we need to be able to simulate multiple use cases concurrently accessing the same data table to ensure no resource contention exists before transitioning to production. Wherever possible, duplicate queries to the feature store should be avoided, and features needed for both versions of the model should be reused for the second inference. Feature stores based on Amazon DynamoDB, as the one Intuit has built, can implement Amazon DynamoDB Accelerator(DAX) to cache and avoid doubling the I/O to the database. These and other caching options can mitigate challenge 7 (feature store scalability).

To address challenge 5 (cost) as well as 7, we propose using shadow deployments to sample the incoming traffic. This gives model owners another layer of control to minimize impact on the production systems.

Shadow deployment should be onboarded to the Model Monitor offerings just like the regular production deployments in order to observe the improvements of the challenger version.

Conclusion

This post illustrates the building blocks to create a comprehensive set of processes and tools to address various challenges with model testing. Although every organization is unique, this should help you get started and narrow down your considerations when implementing your own testing strategy.

About the authors

Tobias Wenzel is a Software Engineering Manager for the Intuit Machine Learning Platform in Mountain View, California. He has been working on the platform since its inception in 2016 and has helped design and build it from the ground up. In his job, he has focused on the operational excellence of the platform and bringing it successfully through Intuit’s seasonal business. In addition, he is passionate about continuously expanding the platform with the latest technologies.

Shivanshu Upadhyay is a Principal Solutions Architect in the AWS Business Development and Strategic Industries group. In this role, he helps most advanced adopters of AWS transform their industry by effectively using data and AI.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Encode multi-lingual text properties in Amazon Neptune to train predictive models

October 14, 2022

by Jiani Zhang Amazon AWS

Amazon Neptune ML is a machine learning (ML) capability of Amazon Neptune that helps you make accurate and fast predictions on your graph data. Under the hood, Neptune ML uses Graph Neural Networks (GNNs) to simultaneously take advantage of graph structure and node/edge properties to solve the task at hand. Traditional methods either only use properties and no graph structure (e.g., XGBoost, Neural Networks), or only graph structure and no properties (e.g., node2vec, Label Propagation). To better manipulate the node/edge properties, ML algorithms require the data to be well behaved numerical data, but raw data in a database can have other types, like raw text. To make use of these other types of data, we need specialized processing steps that convert them from their native type into numerical data, and the quality of the ML results is strongly dependent on the quality of these data transformations. Raw text, like sentences, are among the most difficult types to transform, but recent progress in the field of Natural Language Processing (NLP) has led to strong methods that can handle text coming from multiple languages and a wide variety of lengths.

Beginning with version 1.1.0.0, Neptune ML supports multiple text encoders (text_fasttext, text_sbert, text_word2vec, and text_tfidf), which bring the benefits of recent advances in NLP and enables support for multi-lingual text properties as well as additional inference requirements around languages and text length. For example, in a job recommendation use case, the job posts in different countries can be described in different languages and the length of job descriptions vary considerably. Additionally, Neptune ML supports an auto option that automatically chooses the best encoding method based on the characteristics of the text feature in the data.

In this post, we illustrate the usage of each text encoder, compare their advantages and disadvantages, and show an example of how to choose the right text encoders for a job recommendation task.

What is a text encoder?

The goal of text encoding is to convert the text-based edge/node properties in Neptune into fixed-size vectors for use in downstream machine learning models for either node classification or link prediction tasks. The length of the text feature can vary a lot. It can be a word, phrase, sentence, paragraph, or even a document with multiple sentences (the maximum size of a single property is 55 MB in Neptune). Additionally, the text features can be in different languages. There may also be sentences that contain words in several different languages, which we define as code-switching.

Beginning with the 1.1.0.0 release, Neptune ML allows you to choose from several different text encoders. Each encoder works slightly differently, but has the same goal of converting a text value field from Neptune into a fixed-size vector that we use to build our GNN model using Neptune ML. The new encoders are as follows:

text_fasttext (new) – Uses fastText encoding. FastText is a library for efficient text representation learning. text_fasttext is recommended for features that use one and only one of the five languages that fastText supports (English, Chinese, Hindi, Spanish, and French). The text_fasttext method can optionally take the max_length field, which specifies the maximum number of tokens in a text property value that will be encoded, after which the string is truncated. You can regard a token as a word. This can improve performance when text property values contain long strings, because if max_length is not specified, fastText encodes all the tokens regardless of the string length.
text_sbert (new) – Uses the Sentence BERT (SBERT) encoding method. SBERT is a kind of sentence embedding method using the contextual representation learning models, BERT-Networks. text_sbert is recommended when the language is not supported by text_fasttext. Neptune supports two SBERT methods: text_sbert128, which is the default if you just specify text_sbert, and text_sbert512. The difference between them is the maximum number of tokens in a text property that get encoded. The text_sbert128 encoding only encodes the first 128 tokens, whereas text_sbert512 encodes up to 512 tokens. As a result, using text_sbert512 can require more processing time than text_sbert128. Both methods are slower than text_fasttext.
text_word2vec – Uses Word2Vec algorithms originally published by Google to encode text. Word2Vec only supports English.
text_tfidf – Uses a term frequency-inverse document frequency (TF-IDF) vectorizer for encoding text. TF-IDF encoding supports statistical features that the other encodings do not. It quantifies the importance or relevance of words in one node property among all the other nodes.

Note that text_word2vec and text_tfidf were previously supported and the new methods text_fasttext and text_sbert are recommended over the old methods.

Comparison of different text encoders

The following table shows the detailed comparison of all the supported text encoding options (text_fasttext, text_sbert, and text_word2vec). text_tfidf is not a model-based encoding method, but rather a count-based measure that evaluates how relevant a token (for example, a word) is to the text features in other nodes or edges, so we don’t include text_tfidf for comparison. We recommend using text_tfidf when you want to quantify the importance or relevance of some words in one node or edge property amongst all the other node or edge properties.)

.	.	text_fasttext	text_sbert	text_word2vec
Model Capability	Supported language	English, Chinese, Hindi, Spanish, and French	More than 50 languages	English
	Can encode text properties that contain words in different languages	No	Yes	No
	Max-length support	No maximum length limit	Encodes the text sequence with the maximum length of 128 and 512	No maximum length limit
Time Cost	Loading	Approximately 10 seconds	Approximately 2 seconds	Approximately 2 seconds
Time Cost	Inference	Fast	Slow	Medium

Note the following usage tips:

For text property values in English, Chinese, Hindi, Spanish, and French, text_fasttext is the recommended encoding. However, it can’t handle cases where the same sentence contains words in more than one language. For other languages than the five that fastText supports, use text_sbert encoding.
If you have many property value text strings longer than, for example, 120 tokens, use the max_length field to limit the number of tokens in each string that text_fasttext encodes.

To summarize, depending on your use case, we recommend the following encoding method:

If your text properties are in one of the five supported languages, we recommend using text_fasttext due to its fast inference. text_fasttext is the recommended choices and you can also use text_sbert in the following two exceptions.
If your text properties are in different languages, we recommend using text_sbert because it’s the only supported method that can encode text properties containing words in several different languages.
If your text properties are in one language that isn’t one of the five supported languages, we recommend using text_sbert because it supports more than 50 languages.
If the average length of your text properties is longer than 128, consider using text_sbert512 or text_fasttext. Both methods can use encode longer text sequences.
If your text properties are in English only, you can use text_word2vec, but we recommend using text_fasttext for its fast inference.

Use case demo: Job recommendation task

The goal of the job recommendation task is to predict what jobs users will apply for based on their previous applications, demographic information, and work history. This post uses an open Kaggle dataset. We construct the dataset as a three-node type graph: job, user, and city.

A job is characterized by its title, description, requirements, located city, and state. A user is described with the properties of major, degree type, number of work history, total number of years for working experience, and more. For this use case, job title, job description, job requirements, and majors are all in the form of text.

In the dataset, users have the following properties:

State – For example, CA or 广东省 (Chinese)
Major – For example, Human Resources Management or Lic Cytura Fisica (Spanish)
DegreeType – For example, Bachelor’s, Master’s, PhD, or None
WorkHistoryCount – For example, 0, 1, 16, and so on
TotalYearsExperience – For example, 0.0, 10.0, or NAN

Jobs have the following properties:

Title – For example, Administrative Assistant or Lic Cultura Física (Spanish).
Description – For example, “This Administrative Assistant position is responsible for performing a variety of clerical and administrative support functions in the areas of communications, …” The average number of words in a description is around 192.2.
Requirements – For example, “JOB REQUIREMENTS: 1. Attention to detail; 2.Ability to work in a fast paced environment;3.Invoicing…”
State: – For example, CA, NY, and so on.

The node type city like Washington DC and Orlando FL only has the identifier for each node. In the following section, we analyze the characteristics of different text features and illustrate how to select the proper text encoders for different text properties.

How to select different text encoders

For our example, the Major and Title properties are in multiple languages and have short text sequences, so text_sbert is recommended. The sample code for the export parameters is as follows. For the text_sbert type, there are no other parameter fields. Here we choose text_sbert128 other than text_sbert512, because the text length is relatively shorter than 128.

"additionalParams": {
    "neptune_ml": {
        "version": "v2.0",
        "targets": [ ... ],
        "features": [
            {
                "node": "user",
                "property": "Major",
                "type": "text_sbert128"
            },
            {
                "node": "job",
                "property": "Title",
                "type": "text_sbert128",
            }, ...
        ], ...
    }
}

The Description and Requirements properties are usually in long text sequences. The average length of a description is around 192 words, which is longer than the maximum input length of text_sbert (128). We can use text_sbert512, but it may result in slower inference. In addition, the text is in a single language (English). Therefore, we recommend text_fasttext with the en language value because of its fast inference and not limited input length. The sample code for the export parameters is as follows. The text_fasttext encoding can be customized using language and max_length. The language value is required, but max_length is optional.

"additionalParams": {
    "neptune_ml": {
        "version": "v2.0",
        "targets": [ ... ],
        "features": [
            {
                "node": "job",
                "property": "Description",
                "type": "text_fasttext",
                "language": "en",
                "max_length": 256
            },
            {
                "node": "job",
                "property": "Requirements",
                "type": "text_fasttext",
                "language": "en"
            }, ...
        ], ...
    }
}

More details of the job recommendation use cases can be found in the Neptune notebook tutorial.

For demonstration purposes, we select one user, i.e., user 443931, who holds a Master’s degree in ‘Management and Human Resources. The user has applied to five different jobs, titled as “Human Resources (HR) Manager”, “HR Generalist”, “Human Resources Manager”, “Human Resources Administrator”, and “Senior Payroll Specialist”. In order to evaluate the performance of the recommendation task, we delete 50% of the apply jobs (the edges) of the user (here we delete “Human Resources Administrator” and “Human Resources (HR) Manager) and try to predict the top 10 jobs this user is most likely to apply for.

After encoding the job features and user features, we perform a link prediction task by training a relational graph convolutional network (RGCN) model. Training a Neptune ML model requires three steps: data processing, model training, and endpoint creation. After the inference endpoint has been created, we can make recommendations for user 443931. From the predicted top 10 jobs for user 443931 (i.e., “HR Generalist”, “Human Resources (HR) Manager”, “Senior Payroll Specialist”, “Human Resources Administrator”, “HR Analyst”, et al.), we observe that the two deleted jobs are among the 10 predictions.

Conclusion

In this post, we showed the usage of the newly supported text encoders in Neptune ML. These text encoders are simple to use and can support multiple requirements. In summary,

text_fasttext is recommended for features that use one and only one of the five languages that text_fasttext supports.
text_sbert is recommended for text that text_fasttext doesn’t support.
text_word2vec only supports English, and can be replaced by text_fasttext in any scenario.

For more details about the solution, see the GitHub repo. We recommend using the text encoders on your graph data to meet your requirements. You can just choose an encoder name and set some encoder attributes, while keeping the GNN model unchanged.

About the authors

Jiani Zhang is an applied scientist of AWS AI Research and Education (AIRE). She works on solving real-world applications using machine learning algorithms, especially natural language and graph related problems.

Solving some of the largest, most complex operations problems

October 14, 2022

by Amazon AWS

How Amazon’s Supply Chain Optimization Technologies team has evolved over time to meet a challenge of staggering complexity.Read More

Maximizing the efficiency of Amazon’s own delivery networks

October 14, 2022

by Amazon AWS

INFORMS talk explores techniques Amazon’s Supply Chain Optimization Technologies organization is testing to fulfill customer orders more efficiently.Read More

Build a solution for a computer vision skin lesion classifier using Amazon SageMaker Pipelines

October 13, 2022

by Mariem Kthiri Amazon AWS

Amazon SageMaker Pipelines is a continuous integration and continuous delivery (CI/CD) service designed for machine learning (ML) use cases. You can use it to create, automate, and manage end-to-end ML workflows. It tackles the challenge of orchestrating each step of an ML process, which requires time, effort, and resources. To facilitate its use, multiple templates are available that you can customize according to your needs.

Fully managed image and video analysis services have also accelerated the adoption of Computer vision solutions. AWS offers a pre-trained and fully managed AWS AI service called Amazon Rekognition that can be integrated into computer vision applications using API calls and require no ML experience. You just have to provide an image to the Amazon Rekognition API and it can identify the required objects according to pre-defined labels. It is also possible to provide custom labels specific to your use case and build a customized computer vision model with little to no overhead need for ML expertise.

In this post, we address a specific computer vision problem: skin lesion classification, and use Pipelines by customizing an existing template and tailoring it to this task. Accurate skin lesion classification can help with early diagnosis of cancer diseases. However, it’s a challenging task in the medical field, because there is a high similarity between different kinds of skin lesions. Pipelines allows us to take advantage of a variety of existing models and algorithms, and establish an end-to-end productionized pipeline with minimal effort and time.

Solution overview

In this post, we build an end-to-end pipeline using Pipelines to classify dermatoscopic images of common pigmented skin lesions. We use the Amazon SageMaker Studio project template MLOps template for building, training, and deploying models and the code in the following GitHub repository. The resulting architecture is shown in the following figure.

For this pipeline, we use the HAM10000 (“Human Against Machine with 10000 training images”) dataset, which consists of 10,015 dermatoscopic images. The task at hand is a multi-class classification in the field of computer vision. This dataset depicts six of the most important diagnostic categories in the realm of pigmented lesions: actinic keratoses and intraepithelial carcinoma or Bowen’s disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines or seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv), and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc).

For the format of the model’s input, we use the RecordIO format. This is a compact format that stores image data together for continuous reading and therefore faster and more efficient training. In addition, one of the challenges of using the HAM10000 dataset is the class imbalance. The following table illustrates the class distribution.

Class	akiec	bcc	bkl	df	mel	nv	vasc
Number of images	327	514	1099	115	1113	6705	142
Total	10015

To address this issue, we augment the dataset using random transformations (such as cropping, flipping, mirroring, and rotating) to have all classes with approximately the same number of images.

This preprocessing step uses MXNet and OpenCV, therefore it uses a pre-built MXNet container image. The rest of the dependencies are installed using a requirements.txt file. If you want to create and use a custom image, refer to Create Amazon SageMaker projects with image building CI/CD pipelines.

For the training step, we use the estimator available from the SageMaker built-in Scikit Docker image for image classification and set the parameters as follows:

hyperparameters = {
        "num_layers": 18,
        "use_pretrained_model": 1,
        "augmentation_type": 'crop_color_transform',
        "image_shape": '3,224,224', 
        "num_classes": 7,
        "num_training_samples": 29311, 
        "mini_batch_size": 8,
        "epochs": 5, 
        "learning_rate": 0.00001,
        "precision_dtype": 'float32'
    }

    estimator_config = {
        "hyperparameters": hyperparameters,
        "image_uri": training_image,
        "role": role,
        "instance_count": 1,
        "instance_type": "ml.p3.2xlarge",
        "volume_size": 100,
        "max_run": 360000,
        "output_path": "s3://{bucket}/{base_job_prefix}/training_jobs",
    }
    
    image_classifier = sagemaker.estimator.Estimator(**estimator_config)

For further details about the container image, refer to Image Classification Algorithm.

Create a Studio project

For detailed instructions on how to set up Studio, refer to Onboard to Amazon SageMaker Domain Using Quick setup. To create your project, complete the following steps:

In Studio, choose the Projects menu on the SageMaker resources menu.

On the projects page, you can launch a pre-configured SageMaker MLOps template.
Choose MLOps template for model building, training, and deployment.
Choose Select project template.
Enter a project name and short description.
Choose Create project.

The project takes a few minutes to be created.

Prepare the dataset

To prepare the dataset, complete the following steps:

Go to Harvard DataVerse.
Choose Access Dataset, and review the license Creative Commons Attribution-NonCommercial 4.0 International Public License.
If you accept the license, choose Original Format Zip and download the ZIP file.
Create an Amazon Simple Storage Service (Amazon S3) bucket and choose a name starting with sagemaker (this allows SageMaker to access the bucket without any extra permissions).
You can enable access logging and encryption for security best practices.
Upload dataverse_files.zip to the bucket.
Save the S3 bucket path for later use.
Make a note of the name of the bucket you have stored the data in, and the names of any subsequent folders, to use later.

Prepare for data preprocessing

Because we’re using MXNet and OpenCV in our preprocessing step, we use a pre-built MXNet Docker image and install the remaining dependencies using the requirements.txt file. To do so, you need to copy it and paste it under pipelines/skin in the sagemaker-<pipeline-name>-modelbuild repository. In addition, add the MANIFEST.in file at the same level as setup.py, to tell Python to include the requirements.txt file. For more information about MANIFEST.in, refer to Including files in source distributions with MANIFEST.in. Both files can be found in the GitHub repository.

Change the Pipelines template

To update the Pipelines template, complete the following steps:

Create a folder inside the default bucket.
Make sure the Studio execution role has access to the default bucket as well as the bucket containing the dataset.
From the list of projects, choose the one that you just created.
On the Repositories tab, choose the hyperlinks to locally clone the AWS CodeCommit repositories to your local Studio instance.
Navigate to the pipelines directory inside the sagemaker-<pipeline-name>-modelbuild directory and rename the abalone directory to skin.
Open the codebuild-buildspec.yml file in the sagemaker-<pipeline-name>-modelbuild directory and modify the run pipeline path from run-pipeline —module-name pipelines.abalone.pipeline (line 15) to the following:
```
run-pipeline --module-name pipelines.skin.pipeline 
```
Save the file.
Replace the files pipelines.py, preprocess.py, and evaluate.py in the pipelines directory with the files from the GitHub repository.
Update the preprocess.py file (lines 183-186) with the S3 location (SKIN_CANCER_BUCKET) and folder name (SKIN_CANCER_BUCKET_PATH) where you uploaded the dataverse_files.zip archive:
1. skin_cancer_bucket=”<bucket-name-containing-dataset>”
2. skin_cancer_bucket_path=”<prefix-to-dataset-inside-bucket>”
3. skin_cancer_files=”<dataset-file-name-without-extension>”
4. skin_cancer_files_ext=”<dataset-file-name-with-extension>”

In the preceding example, the dataset would be stored under s3://monai-bucket-skin-cancer/skin_cancer_bucket_prefix/dataverse_files.zip.

Trigger a pipeline run

Pushing committed changes to the CodeCommit repository (done on the Studio source control tab) triggers a new pipeline run, because an Amazon EventBridge event monitors for commits. We can monitor the run by choosing the pipeline inside the SageMaker project. The following screenshot shows an example of a pipeline that ran successfully.

To commit the changes, navigate to the Git section on the left pane.
Stage all relevant changes. You don’t need to keep track of the -checkpoint file. You can add an entry to the .gitignore file with *checkpoint.* to ignore them.
Commit the changes by providing a summary as well as your name and an email address.
Push the changes.
Navigate back to the project and choose the Pipelines section.
If you choose the pipelines in progress, the steps of the pipeline appear.
This allows you to monitor the step that is currently running.It may take a couple of minutes for the pipeline to appear. For the pipeline to start running, the steps defined in CI/CD codebuild-buildspec.yml have to run successfully. To check on the status of these steps, you can use AWS CodeBuild. For more information, refer to AWS CodeBuild (AMS SSPS).
When the pipeline is complete, go back to the project page and choose the Model groups tab to inspect the metadata attached to the model artifacts.
If everything looks good, choose the Update Status tab and manually approve the model.The default ModelApprovalStatus is set to PendingManualApproval. If our model has greater than 60% accuracy, it’s added to the model registry, but not deployed until manual approval is complete.
Navigate to the Endpoints page on the SageMaker console, where you can see a staging endpoint being created.After few minutes, the endpoint is listed with the InService status.
To deploy the endpoint into production, on the CodePipeline console, choose the sagemaker-<pipeline-name>-modeldeploy pipeline that is currently in progress.
At the end of the DeployStaging phase, you need to manually approve the deployment.

After this step, you can see the production endpoint being deployed on the SageMaker Endpoints page. After a while, the endpoint shows as InService.

Clean up

You can easily clean up all the resources created by the SageMaker project.

In the navigation pane in Studio, choose SageMaker resources.
Choose Projects from the drop-down menu and choose your project.
On the Actions menu, choose Delete to delete all related resources.

Results and next steps

We successfully used Pipelines to create an end-to-end MLOps framework for skin lesion classification using a built-in model on the HAM10000 dataset. For the parameters provided in the repository, we obtained the following results on the test set.

Metric	Precision	Recall	F1 score
Value	0.643	0.8	0.713

You can work further on improving the performance of the model by fine-tuning its hyperparameters, adding more transformations for data augmentation, or using other methods, such as Synthetic Minority Oversampling Technique (SMOTE) or Generative Adversarial Networks (GANs). Furthermore, you can use your own model or algorithm for training by using built-in SageMaker Docker images or adapting your own container to work on SageMaker. For further details, refer to Using Docker containers with SageMaker.

You can also add additional features to your pipeline. If you want to include monitoring, you can choose the MLOps template for model building, training, deployment and monitoring template when creating the SageMaker project. The resulting architecture has an additional monitoring step. Or if you have an existing third-party Git repository, you can use it by choosing the MLOps template for model building, training, and deployment with third-party Git repositories using Jenkins project and providing information for both model building and model deployment repositories. This allows you to utilize any existing code and saves you any time or effort on integration between SageMaker and Git. However, for this option, a AWS CodeStar connection is required.

Conclusion

In this post, we showed how to create an end-to-end ML workflow using Studio and automated Pipelines. The workflow includes getting the dataset, storing it in a place accessible to the ML model, configuring a container image for preprocessing, then modifying the boilerplate code to accommodate such image. Then we showed how to trigger the pipeline, the steps that the pipeline follows, and how they work. We also discussed how to monitor model performance and deploy the model to an endpoint.

We performed most of these tasks within Studio, which acts as an all-encompassing ML IDE, and accelerates the development and deployment of such models.

This solution is not bound to the skin classification task. You can extend it to any classification or regression task using any of the SageMaker built-in algorithms or pre-trained models.

About the authors

Mariem Kthiri is an AI/ML consultant at AWS Professional Services Globals and is part of the Health Care and Life Science (HCLS) team. She is passionate about building ML solutions for various problems and always eager to jump on new opportunities and initiatives. She lives in Munich, Germany and is keen of traveling and discovering other parts of the world.

Yassine Zaafouri is an AI/ML consultant within Professional Services at AWS. He enables global enterprise customers to build and deploy AI/ML solutions in the cloud to overcome their business challenges. In his spare time, he enjoys playing and watching sports and traveling around the world.

Fotinos Kyriakides is an AI/ML Engineer within Professional Services in AWS. He is passionate about using the technology to provide value to customers and achieve business outcomes. Base in London, in his spare time he enjoys running and exploring.

Anna Zapaishchykova was a ProServe Consultant in AI/ML and a member of Amazon Healthcare TFC. She is passionate about technology and the impact it can make on healthcare. Her background is in building MLOps and AI-powered solutions to customer problems in a variety of domains such as insurance, automotive, and healthcare.

How Amazon Search runs large-scale, resilient machine learning projects with Amazon SageMaker

October 13, 2022

by Luochao Wang Amazon AWS

If you have searched for an item to buy on amazon.com, you have used Amazon Search services. At Amazon Search, we’re responsible for the search and discovery experience for our customers worldwide. In the background, we index our worldwide catalog of products, deploy highly scalable AWS fleets, and use advanced machine learning (ML) to match relevant and interesting products to every customer’s query.

Our scientists regularly train thousands of ML models to improve the quality of search results. Supporting large-scale experimentation presents its own challenges, especially when it comes to improving the productivity of the scientists training these ML models.

In this post, we share how we built a management system around Amazon SageMaker training jobs, allowing our scientists to fire-and-forget thousands of experiments and be notified when needed. They can now focus on high-value tasks and resolving algorithmic errors, saving 60% of their time.

The challenge

At Amazon Search, our scientists solve information retrieval problems by experimenting and running numerous ML model training jobs on SageMaker. To keep up with our team’s innovation, our models’ complexity and number of training jobs have increased over time. SageMaker training jobs allow us to reduce the time and cost to train and tune those models at scale, without the need to manage infrastructure.

Like everything in such large-scale ML projects, training jobs can fail due to a variety of factors. This post focuses on capacity shortages and failures due to algorithm errors.

We designed an architecture with a job management system to tolerate and reduce the probability of a job failing due to capacity unavailability or algorithm errors. It allows scientists to fire-and-forget thousands of training jobs, automatically retry them on transient failure, and get notified of success or failure if needed.

Solution overview

In the following solution diagram, we use SageMaker training jobs as the basic unit of our solution. That is, a job represents the end-to-end training of an ML model.

The high-level workflow of this solution is as follows:

Scientists invoke an API to submit a new job to the system.
The job is registered with the New status in a metadata store.
A job scheduler asynchronously retrieves New jobs from the metadata store, parses their input, and tries to launch SageMaker training jobs for each one. Their status changes to Launched or Failed depending on success.
A monitor checks the jobs progress at regular intervals, and reports their Completed, Failed, or InProgress state in the metadata store.
A notifier is triggered to report Completed and Failed jobs to the scientists.

Persisting the jobs history in the metadata store also allows our team to conduct trend analysis and monitor project progress.

This job scheduling solution uses loosely coupled serverless components based on AWS Lambda, Amazon DynamoDB, Amazon Simple Notification Service (Amazon SNS), and Amazon EventBridge. This ensures horizontal scalability, allowing our scientists to launch thousands of jobs with minimal operations effort. The following diagram illustrates the serverless architecture.

In the following sections, we go into more detail about each service and its components.

DynamoDB as the metadata store for job runs

The ease of use and scalability of DynamoDB made it a natural choice to persist the jobs metadata in a DynamoDB table. This solution stores several attributes of jobs submitted by scientists, thereby helping with progress tracking and workflow orchestration. The most important attributes are as follows:

JobId – A unique job ID. This can be auto-generated or provided by the scientist.
JobStatus – The status of the job.
JobArgs – Other arguments required for creating a training job, such as the input path in Amazon S3, the training image URI, and more. For a complete list of parameters required to create a training job, refer to CreateTrainingJob.

Lambda for the core logic

We use three container-based Lambda functions to orchestrate the job workflow:

Submit Job – This function is invoked by scientists when they need to launch new jobs. It acts as an API for simplicity. You can also front it with Amazon API Gateway, if needed. This function registers the jobs in the DynamoDB table.
Launch Jobs – This function periodically retrieves New jobs from the DynamoDB table and launches them using the SageMaker CreateTrainingJob command. It retries on transient failures, such as ResourceLimitExceeded and CapacityError, to instrument resiliency into the system. It then updates the job status as Launched or Failed depending on success.
Monitor Jobs – This function periodically keeps track of job progress using the DescribeTrainingJob command, and updates the DynamoDB table accordingly. It polls Failed jobs from the metadata and assesses whether they should be resubmitted or marked as terminally failed. It also publishes notification messages to the scientists when their jobs reach a terminal state.

EventBridge for scheduling

We use EventBridge to run the Launch Jobs and Monitor Jobs Lambda functions on a schedule. For more information, refer to Tutorial: Schedule AWS Lambda functions using EventBridge.

Alternatively, you can use Amazon DynamoDB Streams for the triggers. For more information, see DynamoDB Streams and AWS Lambda triggers.

Notifications with Amazon SNS

Our scientists are notified by email using Amazon SNS when their jobs reach a terminal state (Failed after a maximum number of retries), Completed, or Stopped.

Conclusion

In this post, we shared how Amazon Search adds resiliency to ML model training workloads by scheduling them, and retrying them on capacity shortages or algorithm errors. We used Lambda functions in conjunction with a DynamoDB table as a central metadata store to orchestrate the entire workflow.

Such a scheduling system allows scientists to submit their jobs and forget about them. This saves time and allows them to focus on writing better models.

To go further in your learnings, you can visit Awesome SageMaker and find in a single place, all the relevant and up-to-date resources needed for working with SageMaker.

About the Authors

Luochao Wang is a Software Engineer at Amazon Search. He focuses on scalable distributed systems and automation tooling on the cloud to accelerate the pace of scientific innovation for Machine Learning applications.

Ishan Bhatt is a Software Engineer in Amazon Prime Video team. He primarily works in the MLOps space and has experience building MLOps products for the past 4 years using Amazon SageMaker.

Abhinandan Patni is a Senior Software Engineer at Amazon Search. He focuses on building systems and tooling for scalable distributed deep learning training and real time inference.

Eiman Elnahrawy is a Principal Software Engineer at Amazon Search leading the efforts on Machine Learning acceleration, scaling, and automation. Her expertise spans multiple areas, including Machine Learning, Distributed Systems, and Personalization.

Sofian Hamiti is an AI/ML specialist Solutions Architect at AWS. He helps customers across industries accelerate their AI/ML journey by helping them build and operationalize end-to-end machine learning solutions.

Dr. Romi Datta is a Senior Manager of Product Management in the Amazon SageMaker team responsible for training, processing and feature store. He has been in AWS for over 4 years, holding several product management leadership roles in SageMaker, S3 and IoT. Prior to AWS he worked in various product management, engineering and operational leadership roles at IBM, Texas Instruments and Nvidia. He has an M.S. and Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, and an MBA from the University of Chicago Booth School of Business.

RJ is an engineer in Search M5 team leading the efforts for building large scale deep learning systems for training and inference. Outside of work he explores different cuisines of food and plays racquet sports.

Five MIT PhD students named as inaugural Amazon Fellows

October 13, 2022

by Amazon AWS

Fellowships will provide support to pursue research in the fields of artificial intelligence and robotics.Read More

Customize business rules for intelligent document processing with human review and BI visualization

October 12, 2022

by Lana Zhang Amazon AWS

A massive amount of business documents are processed daily across industries. Many of these documents are paper-based, scanned into your system as images, or in an unstructured format like PDF. Each company may apply unique rules associated with its business background while processing these documents. How to extract information accurately and process them flexibly is a challenge many companies face.

Amazon Intelligent Document Processing (IDP) allows you to take advantage of industry-leading machine learning (ML) technology without previous ML experience. This post introduces a solution included in the Amazon IDP workshop showcasing how to process documents to serve flexible business rules using Amazon AI services. You can use the following step-by-step Jupyter notebook to complete the lab.

Amazon Textract helps you easily extract text from various documents, and Amazon Augmented AI (Amazon A2I) allows you to implement a human review of ML predictions. The default Amazon A2I template allows you to build a human review pipeline based on rules, such as when the extraction confidence score is lower than a pre-defined threshold or required keys are missing. But in a production environment, you need the document processing pipeline to support flexible business rules, such as validating the string format, verifying the data type and range, and validating fields across documents. This post shows how you can use Amazon Textract and Amazon A2I to customize a generic document processing pipeline supporting flexible business rules.

Solution overview

For our sample solution, we use the Tax Form 990, a US IRS (Internal Revenue Service) form that provides the public with financial information about a non-profit organization. For this example, we only cover the extraction logic for some of the fields on the first page of the form. You can find more sample documents on the IRS website.

The following diagram illustrates the IDP pipeline that supports customized business rules with human review.

The architecture is composed of three logical stages:

Extraction – Extract data from the 990 Tax Form (we use page 1 as an example).
- Retrieve a sample image stored in an Amazon Simple Storage Service (Amazon S3) bucket.
- Call the Amazon Textract analyze_document API using the Queries feature to extract text from the page.
Validation – Apply flexible business rules with a human-in-the-loop review.
- Validate the extracted data against business rules, such as validating the length of an ID field.
- Send the document to Amazon A2I for a human to review if any business rules fail.
- Reviewers use the Amazon A2I UI (a customizable website) to verify the extraction result.
BI visualization – We use Amazon QuickSight to build a business intelligence (BI) dashboard showing the process insights.

Customize business rules

You can define a generic business rule in the following JSON format. In the sample code, we define three rules:

The first rule is for the employer ID field. The rule fails if the Amazon Textract confidence score is lower than 99%. For this post, we set the confidence score threshold high, which will break by design. You could adjust the threshold to a more reasonable value to reduce unnecessary human effort in a real-world environment, such as 90%.
The second rule is for the DLN field (the unique identifier of the tax form), which is required for the downstream processing logic. This rule fails if the DLN field is missing or has an empty value.
The third rule is also for the DLN field but with a different condition type: LengthCheck. The rule breaks if the DLN length is not 16 characters.

The following code shows our business rules in JSON format:

rules = [
    {
        "description": "Employee Id confidence score should greater than 99",
        "field_name": "d.employer_id",
        "field_name_regex": None, # support Regex: "_confidence$",
        "condition_category": "Confidence",
        "condition_type": "ConfidenceThreshold",
        "condition_setting": "99",
    },
    {
        "description": "dln is required",
        "field_name": "dln",
        "condition_category": "Required",
        "condition_type": "Required",
        "condition_setting": None,
    },
    {
        "description": "dln length should be 16",
        "field_name": "dln",
        "condition_category": "LengthCheck",
        "condition_type": "ValueRegex",
        "condition_setting": "^[0-9a-zA-Z]{16}$",
    }
]

You can expand the solution by adding more business rules following the same structure.

Extract text using an Amazon Textract query

In the sample solution, we call the Amazon Textract analyze_document API query feature to extract fields by asking specific questions. You don’t need to know the structure of the data in the document (table, form, implied field, nested data) or worry about variations across document versions and formats. Queries use a combination of visual, spatial, and language cues to extract the information you seek with high accuracy.

To extract value for the DLN field, you can send a request with questions in natural languages, such as “What is the DLN?” Amazon Textract returns the text, confidence, and other metadata if it finds corresponding information on the image or document. The following is an example of an Amazon Textract query request:

textract.analyze_document(
        Document={'S3Object': {'Bucket': data_bucket, 'Name': s3_key}},
        FeatureTypes=["QUERIES"],
        QueriesConfig={
                'Queries': [
                    {
                        'Text': 'What is the DLN?',
                       'Alias': 'The DLN number - unique identifier of the form'
                    }
               ]
        }
)

Define the data model

The sample solution constructs the data in a structured format to serve the generic business rule evaluation. To keep extracted values, you can define a data model for each document page. The following image shows how the text on page 1 maps to the JSON fields.

Each field represents a document’s text, check box, or table/form cell on the page. The JSON object looks like the following code:

{
    "dln": {
        "value": "93493319020929",
        "confidence": 0.9765, 
        "block": {} 
    },
    "omb_no": {
        "value": "1545-0047",
        "confidence": 0.9435,
        "block": {}
    },
    ...
}

You can find the detailed JSON structure definition in the GitHub repo.

Evaluate the data against business rules

The sample solution comes with a Condition class—a generic rules engine that takes the extracted data (as defined in the data model) and the rules (as defined in the customized business rules). It returns two lists with failed and satisfied conditions. We can use the result to decide if we should send the document to Amazon A2I for human review.

The Condition class source code is in the sample GitHub repo. It supports basic validation logic, such as validating a string’s length, value range, and confidence score threshold. You can modify the code to support more condition types and complex validation logic.

Create a customized Amazon A2I web UI

Amazon A2I allows you to customize the reviewer’s web UI by defining a worker task template. The template is a static webpage in HTML and JavaScript. You can pass data to the customized reviewer page using the Liquid syntax.

In the sample solution, the custom Amazon A2I UI template displays the page on the left and the failure conditions on the right. Reviewers can use it to correct the extraction value and add their comments.

The following screenshot shows our customized Amazon A2I UI. It shows the original image document on the left and the following failed conditions on the right:

The DLN numbers should be 16 characters long. The actual DLN has 15 characters.
The confidence score of employer_id is lower than 99%. The actual confidence score is around 98%.

The reviewers can manually verify these results and add comments in the CHANGE REASON text boxes.

For more information about integrating Amazon A2I into any custom ML workflow, refer to over 60 pre-built worker templates on the GitHub repo and Use Amazon Augmented AI with Custom Task Types.

Process the Amazon A2I output

After the reviewer using the Amazon A2I customized UI verifies the result and chooses Submit, Amazon A2I stores a JSON file in the S3 bucket folder. The JSON file includes the following information on the root level:

The Amazon A2I flow definition ARN and human loop name
Human answers (the reviewer’s input collected by the customized Amazon A2I UI)
Input content (the original data sent to Amazon A2I when starting the human loop task)

The following is a sample JSON generated by Amazon A2I:

{
  "flowDefinitionArn": "arn:aws:sagemaker:us-east-1:711334203977:flow-definition/a2i-custom-ui-demo-workflow",
  "humanAnswers": [
    {
      "acceptanceTime": "2022-08-23T15:23:53.488Z",
      "answerContent": {
        "Change Reason 1": "Missing X at the end.",
        "True Value 1": "93493319020929X",
        "True Value 2": "04-3018996"
      },
      "submissionTime": "2022-08-23T15:24:47.991Z",
      "timeSpentInSeconds": 54.503,
      "workerId": "94de99f1bc6324b8",
      "workerMetadata": {
        "identityData": {
          "identityProviderType": "Cognito",
          "issuer": "https://cognito-idp.us-east-1.amazonaws.com/us-east-1_URd6f6sie",
          "sub": "cef8d484-c640-44ea-8369-570cdc132d2d"
        }
      }
    }
  ],
  "humanLoopName": "custom-loop-9b4e67ff-2c9f-40f9-aae5-0e26316c905c",
  "inputContent": {...} # the original input send to A2I when starting the human review task
}

You can implement extract, transform, and load (ETL) logic to parse information from the Amazon A2I output JSON and store it in a file or database. The sample solution comes with a CSV file with processed data. You can use it to build a BI dashboard by following the instructions in the next section.

Create a dashboard in Amazon QuickSight

The sample solution includes a reporting stage with a visualization dashboard served by Amazon QuickSight. The BI dashboard shows key metrics such as the number of documents processed automatically or manually, the most popular fields that required human review, and other insights. This dashboard can help you get an oversight of the document processing pipeline and analyze the common reasons causing human review. You can optimize the workflow by further reducing human input.

The sample dashboard includes basic metrics. You can expand the solution using Amazon QuickSight to show more insights into the data.

Expand the solution to support more documents and business rules

To expand the solution to support more document pages with corresponding business rules, you need to make the following changes:

Create a data model for the new page in JSON structure representing all the values you want to extract out of the pages. Refer to the Define the data model section for a detailed format.
Use Amazon Textract to extract text out of the document and populate values to the data model.
Add business rules corresponding to the page in JSON format. Refer to the Customize business rules section for the detailed format.

The custom Amazon A2I UI in the solution is generic, which doesn’t require a change to support new business rules.

Conclusion

Intelligent document processing is in high demand, and companies need a customized pipeline to support their unique business logic. Amazon A2I also offers a built-in template integrated with Amazon Textract to implement your human review use cases. It also allows you to customize the reviewer page to serve flexible requirements.

This post guided you through a reference solution using Amazon Textract and Amazon A2I to build an IDP pipeline that supports flexible business rules. You can try it out using the Jupyter notebook in the GitHub IDP workshop repo.

About the authors

Lana Zhang is a Sr. Solutions Architect at the AWS WWSO AI Services team with expertise in AI and ML for intelligent document processing and content moderation. She is passionate about promoting AWS AI services and helping customers transform their business solutions.

Sonali Sahu is leading Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core area of focus are Artificial Intelligence & Machine Learning for Intelligent Document Processing.

3 questions with Antia Lamas-Linares about the Nobel Prize in physics

October 12, 2022

by Amazon AWS

The head of Amazon Web Services’ quantum communication program on the Nobel winners’ influence on her field.Read More

Use cases best suited for MME

Model serving containers

Advanced configuration of MMS

Increase inference parallelism per model

Design for traffic spikes

Maximize memory resources per instance

Set values for MMS advanced configurations

Key metrics to monitor your endpoint performance

Endpoint instance-level metrics (MMS metrics)

Platform-level metrics (MME metrics)

Lessons from the field and strategies for optimizing MME

Horizontal scaling with smaller instances is better than vertical scaling with larger instances

Avoiding thrashing is a shared responsibility

Don’t be aggressive with bin packing too many models on fewer, larger memory instances

Mental model for optimizing MME

Summary

About the Author

Requirements

Offline model testing

Integration and performance testing

A/B testing

Online model testing

Conclusion

About the authors

What is a text encoder?

Comparison of different text encoders

Use case demo: Job recommendation task

How to select different text encoders

Conclusion

About the authors

Solution overview

Create a Studio project

Prepare the dataset

Prepare for data preprocessing

Change the Pipelines template

Trigger a pipeline run

Clean up

Results and next steps

Conclusion

About the authors

The challenge

Solution overview

DynamoDB as the metadata store for job runs

Lambda for the core logic

EventBridge for scheduling

Notifications with Amazon SNS

Conclusion

About the Authors

Solution overview

Customize business rules

Extract text using an Amazon Textract query

Define the data model

Evaluate the data against business rules

Create a customized Amazon A2I web UI

Process the Amazon A2I output

Create a dashboard in Amazon QuickSight

Expand the solution to support more documents and business rules

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.