Getting started with Amazon SageMaker Feature Store

In a machine learning (ML) journey, one crucial step before building any ML model is to transform your data and design features from your data so that your data can be machine-readable. This step is known as feature engineering. This can include one-hot encoding categorical variables, converting text values to vectorized representation, aggregating log data to a daily summary, and more. The quality of your features directly influences your model predictability, and often needs a few iterations until a model reaches an ideal level of accuracy. Data scientists and developers can easily spend 60% of their time designing and creating features, and the challenges go beyond writing and testing your feature engineering code. Features built at different times and by different teams aren’t consistent. Extensive and repetitive feature engineering work is often needed when productionizing new features. Difficulty tracking versions and up-to-date features aren’t easily accessible.

To address these challenges, Amazon SageMaker Feature Store provides a fully managed central repository for ML features, making it easy to securely store and retrieve features without the heavy lifting of managing the infrastructure. It lets you define groups of features, use batch ingestion and streaming ingestion, and retrieve the latest feature values with low latency.

For an introduction to Feature Store and a basic use case using a credit card transaction dataset for fraud detection, see New – Store, Discover, and Share Machine Learning Features with Amazon SageMaker Feature Store. For further exploration of its features, see Using streaming ingestion with Amazon SageMaker Feature Store to make ML-backed decisions in near-real time.

For this post, we focus on the integration of Feature Store with other Amazon SageMaker features to help you get started quickly. The associated sample notebook and the following video demonstrate how you can apply these concepts to the development of an ML model to predict the risk of heart failure.

The components of Feature Store

Feature Store is a centralized hub for features and associated metadata. Features are defined and stored in a collection called a feature group. You can visualize a feature group as a table in which each column is a feature, with a unique identifier for each row. In principle, a feature group is composed of features and values specific to each feature. A feature group’s definition is composed of a list of the following:

  • Feature definitions – These consist of a name and data types.
  • A record identifier name – Each feature group is defined with a record identifier name. It should be a unique ID to identify each instance of the data, for example, primary key, customer ID, transaction ID, and so on.
  • Configurations for its online and offline store – You can create an online or offline store. The online store is used for low-latency, real-time inference use cases, and the offline store is used for training and batch inference.

The following diagram shows how you can use Feature Store as part of your ML pipeline. First, you read in your raw data and transform it to features ready for exploration and modeling. Then you can create a feature store, configure it to an online or offline store, or both. Next you can ingest data via streaming to the online and offline store, or in batches directly to the offline store. After your feature store is set up, you can create a model using data from your offline store and access it for real time inference or batch inference.

For more hands-on experience, follow the notebook example for a step-by-step guide to build a feature store, train a model for fraud detection, and access the feature store for inference.

Export data from Data Wrangler to Feature Store

Because Feature Store can ingest data in batches, you can author features using Amazon SageMaker Data Wrangler, create feature groups in Feature Store, and ingest features in batches using a SageMaker Processing job with a notebook exported from Data Wrangler. This mode allows for batch ingestion into the offline store. It also supports ingestion into the online store if the feature group is configured for both online and offline use.

To start off, after you complete your data transformation steps and analysis, you can conveniently export your data preparation workflow into a notebook with one click. When you export your flow steps, you have the option of exporting your processing code to a notebook that pushes your processed features to Feature Store.

Choose Export step and Feature Store to automatically create your notebook. This notebook recreates the manual steps you created, creates a feature group, and adds features to an offline or online feature store, allowing you easily rerun your manual steps.

This notebook defines the schema instead of auto-detection of data types for each column of the data, with the following format:

column_schema = [
 { 
 "name": "Height", 
 "type": "long" 
 },
 { 
 "name": "Sum", 
 "type": "string" 
 }, 
 { 
 "name": "Time", 
 "type": "string"
  }
]

For more information on how to load the schema, map it, and add it as a FeatureDefinition that you can use to create the FeatureGroup, see Export to the SageMaker Feature Store.

Additionally, you must specify a record identifier name and event time feature name in the following code:

  • The record_identifier_name is the name of the feature whose value uniquely identifies a record defined in the feature store.
  • An EventTime is a point in time when a new event occurs that corresponds to the creation or update of a record in a feature. All records in the feature group must have a corresponding EventTime.

The notebook creates an offline store and the online by default with the following configuration set to True:

online_store_config = {
    "EnableOnlineStore": True
}

You can also disable an online store by setting EnableOnlineStore to False in the online and offline store configurations.

You can then run the notebook, and the notebook creates a feature group and processing job to process data in scale. The offline store is located in an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account. Because Feature Store is integrated with Amazon SageMaker Studio, you can visualize the feature store by choosing Components and registries in the navigation pane, choosing Feature Store on the drop-down menu, and then finding your feature store on the list. You can check for feature definitions, manage feature group tags, and generate queries for the offline store.

Build a training set from an offline store

Now that you have created a feature store from your processed data, you can build a training dataset from your offline store by using services such as Amazon Athena, AWS Glue, or Amazon EMR. In the following example, because Feature Store automatically builds an AWS Glue Data Catalog when you create feature groups, you can easily create a training dataset with feature values from the feature group. This is done by utilizing the auto-built Data Catalog.

First, create an Athena query for your feature group with the following code. The table_name is the AWS Glue table that is automatically generated by Feature Store.

sample_query = your_feature_group.athena_query()
data_table = sample_query.table_name

You can then write your query using SQL on your feature group, and run the query with the .run() command and specify your S3 bucket location for the dataset to be saved there. You can modify the query to include any operations needed for your data like joining, filtering, ordering, and so on. You can further process the output DataFrame until it’s ready for modeling, then upload it to your S3 bucket so that your SageMaker trainer can directly read the input from the S3 bucket.

# define your Athena query
query_string = 'SELECT * FROM "'+data_table+'"'

# run Athena query. The output is loaded to a Pandas dataframe.
dataset = pd.DataFrame()
sample_query.run(query_string=query_string, output_location='s3://'+default_s3_bucket_name+'/query_results/')
sample_query.wait()
dataset = sample_query.as_dataframe()

Access your Feature Store for inference

After you build a model from the training set, you can access your online store conveniently to fetch a record and make predictions using the deployed model. Feature Store can be especially useful in supplementing data for inference requests because of the low-latency GetRecord functionality. In this example, you can use the following code to query the online feature group to build an inference request:

selected_id = str(194)

# Helper to parse the feature value from the record.

def get_feature_value(record, feature_name):
    return str(list(filter(lambda r: r['FeatureName'] == feature_name, record))[0]['ValueAsString'])

fs_response = featurestore_runtime.get_record(
                                               FeatureGroupName=your_feature_group_name,
                                               RecordIdentifierValueAsString=selected_id)
selected_record = fs_response['Record']
inference_request = [
    get_feature_value(selected_record, 'feature1'),
    get_feature_value(selected_record, 'feature2'),
    ....
    get_feature_value(selected_record, 'feature 10')
]

You can then call the deployed model predictor to generate a prediction for the selected record:

results = predictor.predict(','.join(inference_request), 
                            initial_args = {"ContentType": "text/csv"})
prediction = json.loads(results)

Integrate Feature Store in a SageMaker pipeline

Feature Store also integrates with Amazon SageMaker Pipelines to create, add feature search and discovery to, and reuse automated ML workflows. As a result, it’s easy to add feature search, discovery, and reuse to your ML workflow. The following code shows you how to configure the ProcessingOutput to directly write the output to your feature group instead of Amazon S3, so that you can maintain your model features in a feature store:

flow_step_outputs = []
flow_output = sagemaker.processing.ProcessingOutput(
    output_name=customers_output_name,
    feature_store_output=sagemaker.processing.FeatureStoreOutput(
        feature_group_name=your_feature_group_name), 
    app_managed=True)
flow_step_outputs.append(flow_output)

example_flow_step = ProcessingStep(
    name='SampleProcessingStep', 
    processor=flow_processor, # Your flow processor defined at the beginning of your pipeline
    inputs=flow_step_inputs, # Your processing and feature engineering steps, can be Data Wrangler flows
    outputs=flow_step_outputs)

Conclusion

In this post, we explored how Feature Store can be a powerful tool in your ML journey. You can easily export your data processing and feature engineering results to a feature group and build your feature store. After your feature store is all set up, you can explore and build training sets from your offline store, taking advantage of its integration with other AWS analytics services such as Athena, AWS Glue, and Amazon EMR. After you train and deploy a model, you can fetch records from your online store for real-time inference. Lastly, you can add a feature store as a part of a complete SageMaker pipeline in your ML workflow. Feature Store makes it easy to store and retrieve features as needed in ML development.

Give it a try, and let us know what you think!


About the Author

As a data scientist and consultant, Zoe Ma has helped bring the latest tools and technologies and data-driven insights to businesses and enterprises. In her free time, she loves painting and crafting and enjoys all water sports.

Courtney McKay is a consultant. She is passionate about helping customers drive measurable ROI with AI/ML tools and technologies. In her free time, she enjoys camping, hiking and gardening.

Read More

Run ML inference on AWS Snowball Edge with Amazon SageMaker Edge Manager and AWS IoT Greengrass

You can use AWS Snowball Edge devices in locations like cruise ships, oil rigs, and factory floors with limited to no network connectivity for a wide range of machine learning (ML) applications such as surveillance, facial recognition, and industrial inspection. However, given the remote and disconnected nature of these devices, deploying and managing ML models at the edge is often difficult. With AWS IoT Greengrass and Amazon SageMaker Edge Manager, you can perform ML inference on locally generated data on Snowball Edge devices using cloud-trained ML models. You not only benefit from the low latency and cost savings of running local inference, but also reduce the time and effort required to get ML models to production. You can do all this while continuously monitoring and improving model quality across your Snowball Edge device fleet.

In this post, we talk about how you can use AWS IoT Greengrass version 2.0 or higher and Edge Manager to optimize, secure, monitor, and maintain a simple TensorFlow classification model to classify shipping containers (connex) and people.

Getting started

To get started, order a Snowball Edge device (for more information, see Creating an AWS Snowball Edge Job). You can order a Snowball Edge device with an AWS IoT Greengrass validated AMI on it.

After you receive the device, you can use AWS OpsHub for Snow Family or the Snowball Edge client to unlock the device. You can start an Amazon Elastic Compute Cloud (Amazon EC2) instance with the latest AWS IoT Greengrass installed or use the commands on AWS OpsHub for Snow Family.

Launch and install an AMI with the following requirements, or provide an AMI reference on the Snowball console before ordering and it will be shipped with all libraries and data in the AMI:

  • The ML framework of your choice, such as TensorFlow, PyTorch, or MXNet
  • Docker (if you intend to use it)
  • AWS IoT Greengrass
  • Any other libraries you may need

Prepare the AMI at the time of ordering the Snowball Edge device on AWS Snow Family console. For instructions, see Using Amazon EC2 Compute Instances. You also have the option to update the AMI after Snowball is deployed to your edge location.

Install the latest AWS IoT Greengrass on Snowball Edge

To install AWS IoT Greengrass on your device, complete the following steps:

  1. Install the latest AWS IoT Greengrass on your Snowball Edge device. Make sure dev_tools=True is set to have ggv2 cli See the following code:
sudo -E java -Droot="/greengrass/v2" -Dlog.store=FILE  -jar ./MyGreengrassCore/lib/Greengrass.jar  --aws-region region  --thing-name MyGreengrassCore  --thing-group-name MyGreengrassCoreGroup  --tes-role-name GreengrassV2TokenExchangeRole  --tes-role-alias-name GreengrassCoreTokenExchangeRoleAlias  --component-default-user ggc_user:ggc_group  --provision true  --setup-system-service true  --deploy-dev-tools true

We reference the --thing-name you chose here when we set up Edge Manager.

  1. Run the following command to test your installation:
aws greegrassv2 help
  1. On the AWS IoT console, validate the successfully registered Snowball Edge device with your AWS IoT Greengrass account.

Optimize ML models with Edge Manager

We use Edge Manger to deploy and manage the model on Snowball Edge.

  1. Install the Edge Manager agent on Snowball Edge using the latest AWS IoT Greengrass.
  2. Train and store your ML model.

You can train your ML model using any framework of your choice and save it to an Amazon Simple Storage Service (Amazon S3) bucket. In the following screenshot, we use TensorFlow to train a multi-label model to classify connex and people in an image. The model used here is saved to an S3 bucket by first creating a .tar file.

After the model is saved (TensorFlow Lite in this case), you can start an Amazon SageMaker Neo compilation job of the model and optimize the ML model for Snowball Edge Compute (SBE_C).

  1. On the SageMaker console, under Inference in the navigation pane, choose Compilation jobs.
  2. Choose Create compilation job.

  1. Give your job a name and create or use an existing role.

 If you’re creating a new AWS Identity and Access Management (IAM) role, ensure that SageMaker has access to the bucket in which the model is saved.

  1. In the Input configuration section, for Location of model artifacts, enter the path to model.tar.gz where you saved the file (in this case, s3://feidemo/tfconnexmodel/connexmodel.tar.gz).
  2. For Data input configuration, enter the ML model’s input layer (its name and its shape). In this case, it’s called keras_layer_input and its shape is [1,224,224,3], so we enter {“keras_layer_input”:[1,224,224,3]}.

  1. For Machine learning framework, choose TFLite.

  1. For Target device, choose sbe_c.
  2. Leave Compiler options
  3. For S3 Output location, enter the same location as where your model is saved with the prefix (folder) output. For example, we enter s3://feidemo/tfconnexmodel/output.

  1. Choose Submit to start the compilation job.

Now you create a model deployment package to be used by Edge Manager.

  1. On the SageMaker console, under Edge Manager, choose Edge packaging jobs.
  2. Choose Create Edge packaging job.
  3. In the Job properties section, enter the job details.
  4. In the Model source section, for Compilation job name, enter the name you provided for the Neo compilation job.
  5. Choose Next.

  1. In the Output configuration section, for S3 bucket URI, enter where you want to store the package in Amazon S3.
  2. For Component name, enter a name for your AWS IoT Greengrass component.

This step creates an AWS IoT Greengrass model component where the model is downloaded from Amazon S3 and uncompressed to local storage on Snowball Edge.

  1. Create a device fleet to manage a group of devices, in this case, just one (SBE).
  2. For IAM role¸ enter the role generated by AWS IoT Greengrass earlier (–tes-role-name).

Make sure it has the required permissions by going to IAM console, searching for the role, and adding the required policies to it.

  1. Register the Snowball Edge device to the fleet you created.

  1. In the Device source section, enter the device name. The IoT name needs to match the name you used earlier—in this case, –thing-name MyGreengrassCore.

You can register additional Snowball devices on the SageMaker console to add them to the device fleet, which allows you to group and manage these devices together.

Deploy ML models to Snowball Edge using AWS IoT Greengrass

In the previous sections, you unlocked and configured your Snowball Edge device. The ML model is now compiled and optimized for performance on Snowball Edge. An Edge Manager package is created with the compiled model and the Snowball device is registered to a fleet. In this section, you look at the steps involved in deploying the ML model for inference to Snowball Edge with the latest AWS IoT Greengrass.

Components

AWS IoT Greengrass allows you to deploy to edge devices as a combination of components and associated artifacts. Components are JSON documents that contain the metadata, the lifecycle, what to deploy when, and what to install. Components also define what operating system to use and what artifacts to use when running on different OS options.

Artifacts

Artifacts can be code files, models, or container images. For example, a component can be defined to install a pandas Python library and run a code file that will transform the data, or to install a TensorFlow library and run the model for inference. The following are example artifacts needed for an inference application deployment:

  • gRPC proto and Python stubs (this can be different based on your model and framework)
  • Python code to load the model and perform inference

These two items are uploaded to an S3 bucket.

Deploy the components

The deployment needs the following components:

  • Edge Manager agent (available in public components at GA)
  • Model
  • Application

Complete the following steps to deploy the components:

  1. On the AWS IoT console, under Greengrass, choose Components, and create the application component.
  2. Find the Edge Manager agent component in the public components list and deploy it.
  3. Deploy a model component created by Edge Manager, which is used as a dependency in the application component.
  4. Deploy the application component to the edge device by going to the list of AWS IoT Greengrass deployments and creating a new deployment.

If you have an existing deployment, you can revise it to add the application component.

Now you can test your component.

  1. In your prediction or inference code deployed with application component, code in the logic to access files locally on the Snowball Edge device (for example, in the incoming folder) and have the predictions or processed files be moved to a processed folder.
  2. Log in to the device to see if the predictions have been made.
  3. Set up the code to run on a loop, checking the incoming folder for new files, processing the files, and moving them to the processed folder.

The following screenshot is an example setup of files before deployment inside the Snowball Edge.

After deployment, all the test images have classes of interest and therefore are moved to the processed folder.

Clean up

To clean up everything or reimplement this solution from scratch, stop all the EC2 instances by invoking the TerminateInstance API against EC2-compatible endpoints running on your Snowball Edge device. To return your Snowball Edge device, see Powering Off the Snowball Edge and Returning the Snowball Edge Device.

Conclusion

This post walked you through how to order a Snowball Edge device with an AMI of your choice. You then compile a model for the edge using SageMaker, package that model using Edge Manager, and create and run components with artifacts to perform ML inference on Snowball Edge using the latest AWS IoT Greengrass. With Edge Manager, you can deploy and update your ML models on a fleet of Snowball Edge devices, and monitor performance at the edge with saved input and prediction data on Amazon S3. You can also run these components as long-running AWS Lambda functions that can spin up a model and wait for data to do inference.

You combine several features of AWS IoT Greengrass to create an MQTT client and use a pub/sub model to invoke other services or microservices. The possibilities are endless.

By running ML inference on Snowball Edge with Edge Manager and AWS IoT Greengrass, you can optimize, secure, monitor, and maintain ML models on fleets of Snowball Edge devices. Thanks for reading and please do not hesitate to leave questions or comments in the comments section.

To learn more about AWS Snow Family, AWS IoT Greengrass, and Edge Manager, check out the following:


About the Authors

Raj Kadiyala is an AI/ML Tech Business Development Manager in AWS WWPS Partner Organization. Raj has over 12 years of experience in Machine Learning and likes to spend his free time exploring machine learning for practical every day solutions and staying active in the great outdoors of Colorado.

 

 

 

Nida Beig is a Sr. Product Manager – Tech at Amazon Web Services where she works on the AWS Snow Family team. She is passionate about understanding customer needs, and using technology as a conductor of transformative thinking to deliver consumer products. Besides work, she enjoys traveling, hiking, and running.

Read More

Run your TensorFlow job on Amazon SageMaker with a PyCharm IDE

As more machine learning (ML) workloads go into production, many organizations must bring ML workloads to market quickly and increase productivity in the ML model development lifecycle. However, the ML model development lifecycle is significantly different from an application development lifecycle. This is due in part to the amount of experimentation required before finalizing a version of a model. Amazon SageMaker, a fully managed ML service, enables organizations to put ML ideas into production faster and improve data scientist productivity by up to 10 times. Your team can quickly and easily train and tune models, move back and forth between steps to adjust experiments, compare results, and deploy models to production-ready environments.

Amazon SageMaker Studio offers an integrated development environment (IDE) for ML. Developers can write code, track experiments, visualize data, and perform debugging and monitoring all within a single, integrated visual interface, which significantly boosts developer productivity. Within Studio, you can also use Studio notebooks, which are collaborative notebooks (the view is an extension of the JupyterLab interface). You can launch quickly because you don’t need to set up compute instances and file storage beforehand. SageMaker Studio provides persistent storage, which enables you to view and share notebooks even if the instances that the notebooks run on are shut down. For more details, see Use Amazon SageMaker Studio Notebooks.

Many data scientists and ML researchers prefer to use a local IDE such as PyCharm or Visual Studio Code for Python code development while still using SageMaker to train the model, tune hyperparameters with SageMaker hyperparameter tuning jobs, compare experiments, and deploy models to production-ready environments. In this post, we show how you can use SageMaker to manage your training jobs and experiments on AWS, using the Amazon SageMaker Python SDK with your local IDE. For this post, we use PyCharm for our IDE, but you can use your preferred IDE with no code changes.

The code used in this post is available on GitHub.

Prerequisites

To run training jobs on a SageMaker managed environment, you need the following:

  • An AWS account configured with the AWS Command Line Interface (AWS CLI) to have sufficient permissions to run SageMaker training jobs
  • Docker configured (SageMaker local mode) and the SageMaker Python SDK installed on your local computer
  • (Optional) Studio set up for experiment tracking and the Amazon SageMaker Experiments Python SDK

Setup

To get started, complete the following steps:

  1. Create a new user with programmatic access that enables an access key ID and secret access key for the AWS CLI.
  2. Attach the permissions AmazonSageMakerFullAccess and AmazonS3FullAccess.
  3. Limit the permissions to specific Amazon Simple Storage Service (Amazon S3) buckets if possible.
  4. You also need an execution role for the SageMaker AmazonSageMakerFullAccess and AmazonS3FullAccess permissions. SageMaker uses this role to perform operations on your behalf on the AWS hardware that is managed by SageMaker.
  5. Install the AWS CLI on your local computer and quick configuration with aws configure:
$ aws configure
AWS Access Key ID [None]: AKIAI*********EXAMPLE
AWS Secret Access Key [None]: wJal********EXAMPLEKEY
Default region name [None]: eu-west-1
Default output format [None]: json

For more information, see Configuring the AWS CLI

  1. Install Docker and your preferred local Python IDE. For this post, we use PyCharm.
  2. Make sure that you have all the required Python libraries to run your code locally.
  3. Add the SageMaker Python SDK to your local library. You can use pip install sagemaker or create a virtual environment with venv for your project then install SageMaker within the virtual environment. For more information, see Use Version 2.x of the SageMaker Python SDK.

Develop your ML algorithms on your local computer

Many data scientists use a local IDE for ML algorithm development, such as PyCharm. In this post, the algorithm Python script tf_code/tf_script.py is a simple file that uses TensorFlow Keras to create a feedforward neural network. You can run the Python script locally as you do usually.

Make your TensorFlow code SageMaker compatible

To make your code compatible for SageMaker, you must follow certain rules for reading input data and writing output model and other artifacts. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables. For more information, see SageMaker Toolkits Containers Structure.

The following code shows some important environment variables used by SageMaker for managing the infrastructure.

The following uses the input data location SM_CHANNEL_{channel_name}:

SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_CHANNEL_VALIDATION=/opt/ml/input/data/validation
SM_CHANNEL_TESTING=/opt/ml/input/data/testing

The following code uses the model output location to save the model artifact:

SM_MODEL_DIR=/opt/ml/model

The following code uses the output artifact location to write non-model training artifacts (such as evaluation results):

SM_OUTPUT_DATA_DIR=/opt/ml/output

You can pass these SageMaker environment variables as arguments so you can still run the training script outside of SageMaker:

# SageMaker default SM_MODEL_DIR=/opt/ml/model
if os.getenv("SM_MODEL_DIR") is None:
    os.environ["SM_MODEL_DIR"] = os.getcwd() + '/model'

# SageMaker default SM_OUTPUT_DATA_DIR=/opt/ml/output
if os.getenv("SM_OUTPUT_DATA_DIR") is None:
    os.environ["SM_OUTPUT_DATA_DIR"] = os.getcwd() + '/output'

# SageMaker default SM_CHANNEL_TRAINING=/opt/ml/input/data/training
if os.getenv("SM_CHANNEL_TRAINING") is None:
    os.environ["SM_CHANNEL_TRAINING"] = os.getcwd() + '/data'

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--output_dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))

Test your ML algorithms on a local computer with the SageMaker SDK local mode

The SageMaker Python SDK supports local mode, which allows you to create estimators and deploy them to your local environment. This is a great way to test your deep learning scripts before running them in the SageMaker managed training or hosting environments. Local mode is supported for framework images (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) and images you supply yourself. See the following code for ./sm_local.py:

sagemaker_role = 'arn:aws:iam::707*******22:role/RandomRoleNameHere'
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}

def sagemaker_estimator(sagemaker_role, code_entry, code_dir, hyperparameters):
    sm_estimator = TensorFlow(entry_point=code_entry,
                              source_dir=code_dir,
                              role=sagemaker_role,
                              instance_type='local',
                              instance_count=1,
                              model_dir='/opt/ml/model',
                              hyperparameters=hyperparameters,
                              output_path='file://{}/model/'.format(os.getcwd()),
                              framework_version='2.2',
                              py_version='py37',
                              script_mode=True)
    return sm_estimator

With SageMaker local mode, the managed TensorFlow image from the service account is downloaded to your local computer and shows up in Docker. This Docker image is the same as in the SageMaker managed training or hosting environments, so you can debug your code locally and faster.

The following diagram outlines how a Docker image runs in your local machine with SageMaker local mode.

 

The service account TensorFlow Docker image is now running in your local computer.

On the newer versions of macOS, when you debug your code with SageMaker local mode, you might need to add Docker Full Disk Access within System Preferences under Security & Privacy, otherwise PermissionError occurs.

Run your ML algorithms on an AWS managed environment with the SageMaker SDK

After you create the training job, SageMaker launches the ML compute instances and uses the training code and the training dataset to train the model. It saves the resulting model artifacts and other output in the S3 bucket you specified for that purpose.

The following diagram outlines how a Docker image runs in an AWS managed environment.

On the SageMaker console, you can see that your training job launched, together with all training job related metadata, including metrics for model accuracy, input data location, output data configuration, and hyperparameters. This helps you manage and track all your SageMaker training jobs.

Deploy your trained ML model on a SageMaker endpoint for real-time inference

For this step, we use the ./sm_deploy.py script.

When your trained model seems satisfactory, you might want to test the real-time inference against an HTTPS endpoint, or with batch prediction. With the SageMaker SDK, you can easily set up the inference environment to test your inference code and assess model performance regarding accuracy, latency, and throughput.

SageMaker provides model hosting services for model deployment, as shown in the following diagram. It provides an HTTPS endpoint where your ML model is available to perform inference.

The persistent endpoint deployed with SageMaker hosting services appears on the SageMaker console.

Organize, track, and compare your ML trainings with Amazon SageMaker Experiments

Finally, if you have lots of experiments with different preprocessing configurations, different hyperparameters, or even different ML algorithms to test, we suggest you use Amazon SageMaker Experiments to help you group and organize your ML iterations.

Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. Experiments is integrated with Studio, providing a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best-performing models.

Conclusion

In this post, we showed how you can use SageMaker with your local IDE, such as PyCharm. With SageMaker, data scientists can take advantage of this fully managed service to build, train, and deploy ML models quickly, without having to worry about the underlying infrastructure needs.

To fully achieve operational excellence, your organization needs a well-architected ML workload solution, which includes versioning ML inputs and artifacts, tracking data and model lineage, automating ML deployment pipelines, continuously monitoring and measuring ML workloads, establishing a model retraining strategy, and more. For more information about SageMaker features, see the Amazon SageMaker Developer Guide.

SageMaker is generally available worldwide. For a list of the supported AWS Regions, see the AWS Region Table for all AWS global infrastructure.


About the Author

Yanwei Cui, PhD, is a Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building artificial intelligence powered industrial applications in computer vision, natural language processing and online user behavior prediction. At AWS, he shares the domain expertise and helps customers to unlock business potentials, and to drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Read More

How Cortica used Amazon HealthLake to get deeper insights to improve patient care

This is a guest post by Ernesto DiMarino, who is Head of Enterprise Applications and Data at Cortica.

Cortica is on a mission to revolutionize healthcare for children with autism and other neurodevelopmental differences. Cortica was founded to fix the fragmented journey families typically navigate while seeking diagnoses and therapies for their children. To bring their vision to life, Cortica seamlessly blends neurology, research-based therapies, and technology into comprehensive care programs for the children they serve. This coordinated approach leads to best-in-class member satisfaction and empowers families to achieve long-lasting, transformative results.

In this post, we discuss how Cortica used Amazon HealthLake to create a data analytics hub to store a patient’s medical history, medication history, behavioral assessments, lab reports, and genetic variants in Fast Healthcare Interoperability Resource (FHIR) standard format. They create a composite view of the patient’s health journey and apply advance analytics to understand trends in patient progression with Cortica’s treatment approach.

Unifying our data

The challenges faced by Cortica’s team of three data engineers are no different than any other healthcare enterprise. Cortica has two EHRs (electronic health records), 6 specialties, 420 providers, and a few home-grown data capturing questionnaires, one of which has 842 questions. With multiple vendors providing systems and data solutions, Cortica finds itself in an all-too-common situation in the healthcare industry: volumes of data with multiple formats and complexity in matching patients from system to system. Cortica looked to solve some of this complexity by setting up a data lake on AWS.

Cortica’s team imported all data into an Amazon Simple Storage Service (Amazon S3) data lake using Python extract, transform, and load (ETL), orchestrating it with Apache Airflow. Additionally, they maintain a Kimball model star schema for financial and operational analytics. The data sizes are a respectable 16 terabytes of data. Most of the file formats delivered to the data lake are in CSV, PDF, and Parquet, all of which the data lake is well equipped to manage. However, the data lake solution is only part of the story. To truly derive value from the data, Cortica needed a standardized model to deal with the healthcare languages and vocabularies, as well as the many industry standardized code sets.

Deriving deeper value from data

Although the data lake and star schema data model work well for some financial and operational analytics, the Cortica team found that it was challenging to dive deeper into the data for meaningful insights to share with patients and their caregivers. Some of the questions they wanted to answer included:

  • How can Cortica present to caregivers a composite view of the patient’s healthcare journey with Cortica?
  • How can they show that patients are getting better over time using data from standardized assessments, medical notes, and goals tracking data?
  • How do patients with specific comorbidities progress to their goals compared to patients without comorbidities?
  • Can Cortica show how patients have better outcomes through the unique multispecialty approach?
  • Can Cortica partner with industry researchers sharing de-identified data to help further treatment for autism and other neurodevelopmental differences?

Before implementing the data lake, staff would read through PDFs, Excel, and vendor systems to create Excel files to capture the data points of interest. Interrogating the EHRs and manually transcribing documents and notes into a large spreadsheet for analysis would take months of work. This process wasn’t scalable and made it difficult to reproduce analytics and insights.

With the data lake, Cortica found that they still lacked the ability to quickly access the volumes of data, as well as join the various datasets together to make complex analysis. Because healthcare data is so driven by medical terminologies, they needed a solution that could help unify data from different healthcare fields to present a clear patient journey through the different specialties Cortica offers. To quickly derive this deeper value, they chose Amazon HealthLake to help provide this added layer of meaning to the data.

Cortica’s solution

Cortica adopted Amazon HealthLake to help standardize data and scale insights. Through implementing the FHIR standard, Amazon HealthLake provided a faster solution to standardizing data with a far less complex maintenance pathway. They were able to quickly load a basic set of resources into Amazon HealthLake. This allowed the team to create a proof of concept (POC) for starting to answer the bigger set of questions focused on their patient population. In a 3-day process, they were able to develop a POC for understanding their patient’s journey from the perspective of their behavior therapy goals and medical comorbidities. Most of the 3-day process was spent on two days fine-tuning the queries in Amazon QuickSight and making visualizations of the data. From a data to visual perspective, the data was ready in hours not months. The following diagram illustrates their pipeline.

Getting to insights faster

Cortica was able to quickly see across their patient population the length of time it took for patients to attain their goals. The team could then break it down by age-phenotype (a designated age grouping for comparing Cortica’s population). They saw the grouping of patients that were meeting their goals in 4, 6, 9, and 12-month intervals. They further sliced and diced the visuals by layering in a variety of categories such as goal status. Until now, staff and clinicians were only able to look at an individual’s data rather than population data. They couldn’t get these types of insights. The manual chart clinician abstraction process for this goal analysis would have taken months to complete.

The following charts show two visualizations of their goals.

As a fast follow with this POC, Cortica wanted to see how medical comorbidities impacted goal attainment. The specific medical comorbidities of interest were seizures, constipation, and sleep disturbances, because these are commonly found within this patient population. Data for the FHIR Condition Resource was loaded into the pipeline, and the team was able to identify cohorts by comorbidites and quickly visualize the information. In a few minutes, they had visualizations running, and could see the impact that these comorbidities had on goal attainment (see the following example diagram).

With Amazon HealthLake, the Cortica team can spend more time analyzing and understanding data patterns rather than figuring out where data comes from, formatting it, and joining it into a usable state. The value that Amazon brings to any healthcare organization is the ability to quickly move data, conform data, and start visualizing. With FHIR as the data model, a small non-technical team can request an organization’s integration team to provide a flat file feed of FHIR resources of interest to an S3 bucket. This data is easily loaded to Amazon HealthLake data stores via the AWS Command Line Interface (AWS CLI), AWS Management Console, or API. Next, they can run the data on Amazon Athena to expose the data to an SQL queryable tool and use QuickSight for visualization. Both clinical or non-technical teams can use this solution to start deriving value from data locked within medical records systems.

Conclusion

The tools available through AWS such as Amazon HealthLake, Amazon SageMaker, Athena, Amazon Comprehend Medical, and QuickSight are speeding up the ability to learn more about the patient population Cortica cares for in an actionable timeframe. Analysis that took months to complete can now be completed in days, and in some cases hours. AWS tools can enhance analysis by adding layers of richness to the data in minutes and provide different views of the same analysis. Furthermore, analysis that required chart abstraction can now be done through automated data pipelines, processing hundreds or thousands of documents to derive insights from notes, which were previously only available to a few clinicians.

Cortica is entering a new era of data analytics, one in which the data pipeline and process doesn’t require data engineers and technical staff. What is unknown can be learned from the data, ultimately bringing Cortica closer to its mission of revolutionizing the pediatric healthcare space and empowering families to achieve long-lasting, transformative results.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Ernesto DiMarino is Head of Enterprise Applications and Data at Cortica.

Satadal Bhattacharjee is Sr Manager, Product Management, who leads products at AWS Health AI. He works backwards from healthcare customers to help them make sense of their data by developing services such as Amazon HealthLake and Amazon Comprehend Medical.

Read More

Attendee matchmaking at virtual events with Amazon Personalize

Amazon Personalize enables developers to build applications with the same machine learning (ML) technology used by Amazon.com for real-time personalized recommendations—no ML expertise required. Amazon Personalize makes it easy for developers to build applications capable of delivering a wide array of personalization experiences, including specific product recommendations, personalized product re-ranking, and customized direct marketing. Besides applications in retail and ecommerce, other common use cases for Amazon Personalize include recommending videos, blog posts, or newsfeeds based on users’ activity history.

What if you wanted to recommend users of common interest to connect with each other? As the pandemic pushes many of our normal activities virtual, connecting with people is a greater challenge than ever before. This post discusses how 6Connex turned this challenge into an opportunity by harnessing Amazon Personalize to elevate their “user matchmaking” feature.

6Connex and event AI

6Connex is an enterprise virtual venue and hybrid events system. Their cloud-based product portfolio includes virtual environments, learning management, and webinars. Attendee experience is one of the most important metrics for their success.

Attendees have better experiences when they are engaged not only with the event’s content, organizers, and sponsors, but also when making connections with other attendees. Engagement metrics are measured and reported for each attendee activity on the platform, as well as feedback from post-event surveys. The goal is to make the events system more attendee-centric by not only providing personalized content and activity recommendations, but also making matchmaking suggestions for attendees based on similar interests and activity history. By adding event AI features to their platform, 6Connex fosters more meaningful connections between attendees, and keeps their attendees more engaged with a personalized event journey.

Implementation and solution architecture

6Connex built their matchmaking solution using the related items recipe (SIMS) of Amazon Personalize. The SIMS algorithm uses collaborative filtering to recommend items that are similar to a given item. The novelty of 6Connex’s approach lies in the reverse mapping of users and items. In this solution, event attendees are items in Amazon Personalize terms, and content, meeting rooms, and so on are users’ in Amazon Personalize terms.

When a platform user joins a meeting room or views a piece of content, an interaction is created. To increase the accuracy of interaction types, also known as event_type, you can add logic to only count as an interaction when a user stays in a meeting room for at least a certain amount of time. This eliminates accidental clicks and cases when users join but quickly leave a room due to lack of interest.

As many users interact with the platform during a live event, interactions are streamed in real time from the platform via Amazon Kinesis Data Streams. AWS Lambda functions are used for data transformation before streaming data directly to Amazon Personalize through an event tracker. This mechanism also enables Amazon Personalize to adjust to changing user interest over time, allowing recommendations to adapt in real time.

After a model is trained in Amazon Personalize, a fully managed inference endpoint (campaign) is created to serve real-time recommendations for 6Connex’s platform. To answer the question “for each attendee, who are similar attendees?”, 6Connex’s client-side application queries the GetRecommendations API with a current user (represented as an itemId). The API response provides recommended connections because they have been identified as similar by the Amazon Personalize.

Due to its deep learning capabilities, Amazon Personalize requires at least 1,000 interaction data points before training the model. At the start of a live event, there aren’t enough interactions, therefore a rules engine is used at the beginning of an event to provide the initial recommendations prior to gathering 1000 data points. The following table shows the three main phases of an event where connection recommendations are generated.

Rule-based recommendations
  • Event tracker interaction events < 1,000
  • Use rule engine
  • Cache results for 10 minutes
Amazon Personalize real-time recommendations during live sessions
  • Initial data is loaded and model is trained
  • Data is ingested in real-time via Kinesis Data Streams
  • Regular training occurs across the day
  • Recommendation results are cached
Amazon Personalize batch recommendations for on-demand users
  • Main event live sessions are over but the event is still open for a period of time
  • Daily batch recommendations are retrieved and loaded into DynamoDB.

For a high-level example architecture, see the following diagram.

The following are the steps involved in the solution architecture:

  1. 6Connex web application calls the GetRecommendations API to retrieve recommended connections.
  2. A matchmaking Lambda function retrieves recommendations.
  3. Until the training threshold of 1,000 interaction data points is reached, the matchmaking function uses a simple rules engine to provide recommendations.
  4. Recommendations are generated from Amazon Personalize and stored in Amazon ElastiCache. The reason for caching recommendations is to improve response performance while reducing the number of queries on the Amazon Personalize API. When new recommendations are requested, or when the cache expires (expiration is set to every 15 minutes), recommendations are pulled from Amazon Personalize.
  5. New user interactions are ingested in real time via Kinesis Data Streams.
  6. A Lambda function consumes data from the data stream, performs data transformation, persists the transformed data to Amazon Simple Storage Service (Amazon S3) and related metadata to Amazon DynamoDB, and sends the records to Amazon Personalize via the PutEvents API.
  7. AWS Step Functions orchestrates the process for creating solutions, training, retraining, and several other workflows. More details on the Step Functions workflow are in the next section.
  8. Amazon EventBridge schedules regular retraining events during the virtual events. We also use EventBridge to trigger batch recommendations after the virtual events are over and when the contents are served to end users on demand.
  9. Recommendations are stored in DynamoDB for use during the on-demand period and also for future analysis.

Adoption of MLOps

It was crucial for 6Connex to quickly shift from a rules-based recommender engine to personalized recommendations using Amazon Personalize. To accelerate this shift and hydrate the interactions dataset, 6Connex infers interactions not only from content engagement, but also from other sources such as pre-events questionnaires. This is an important development that increased the speed to when users start receiving ML-based recommendations.

More importantly, the adoption of Amazon Personalize MLOps enabled 6Connex to automate and accelerate the transition from rule-based recommendations to personalized recommendations using Amazon Personalize. After the minimum threshold for data is met, Step Functions loads data into Amazon Personalize and manages the training process.

The following diagram shows the MLOps pipeline for the initial loading of data, training solutions, and deploying campaigns.

6Connex created their MLOps solution based on the Amazon Personalize MLOps reference solution to automate this process. There are several Step Functions workflows that offload long-running processes such loading batch recommendations in DynamoDB, retraining Amazon Personalize solutions, and cleaning up after an event is complete.

With Amazon Personalize and MLOps pipelines, 6Connex brought an AI solution to market in less than half the time it would have taken to develop and deploy their own ML infrastructure. Moreover, these solutions reduced the cost of acquiring data science and ML expertise. As a result, 6Connex realized a competitive advantage through AI-based personalized recommendations for each individual user.

Based on the success of this engagement, 6Connex plans to expand its usage of Amazon Personalize to provide content-based recommendations in the near future. 6Connex is looking forward to expanding the partnership not only in ML, but also in data analytics and business intelligence to serve the fast-growing hybrid event market.

Conclusion

With a well-designed MLOps pipeline and some creativity, 6Connex built a robust recommendation engine using Amazon Personalize in a short amount of time.

Do you have a use case for a recommendation engine but are short on time or ML expertise? You can get started with Amazon Personalize using the Developer Guide, as well as a myriad of hands-on resources such as the Amazon Personalize Samples GitHub repo.

If you have any questions on this matchmaking solution, please leave a comment!


About the Author

Shu Jackson is a Senior Solutions Architect with AWS. Shu works with startup customers helping them design and build solutions in the cloud, with a focus on AI/ML.

 

 

 

 

Luis Lopez Soria is a Sr AI/ML specialist solutions architect working with the Amazon Machine Learning team. He works with AWS customers to help them adopt machine learning on a large scale. He enjoys playing sports, traveling around the world, and exploring new foods and cultures.

Read More

Accurately predicting future sales at Clearly using Amazon Forecast

This post was cowritten by Ziv Pollak, Machine Learning Team Lead, and Alex Thoreux, Web Analyst at Clearly.

A pioneer in online shopping, Clearly launched their first site in 2000. Since then, they’ve grown to become one of the biggest online eyewear retailers in the world, providing customers across Canada, the US, Australia and New Zealand with glasses, sunglasses, contact lenses, and other eye health products. Through their Mission to eliminate poor vision, Clearly strives to make eyewear affordable and accessible for everyone. Creating an optimized platform is a key part of this wider vision.

Predicting future sales is one of the biggest challenges every retail organization has – but it’s also one of the most important pieces of insight. Having a clear and reliable picture of predicted sales for the next day or week allows your company to adjust its strategy and increase the chances of meeting its sales and revenue goals.

We’ll talk about how Clearly built an automated and orchestrated forecasting pipeline using AWS Step Functions, and used Amazon Forecast APIs to train a machine learning (ML) model and predict sales on a daily basis for the upcoming weeks and months.

With a solution that also collects metrics and logs, provides auditing, and is invoked automatically, Clearly was able to create a serverless, well-architected solution in just a few weeks.

The challenge: Detailed sales forecasting

With a reliable sales forecast, we can improve our marketing strategy, decision-making process, and spend, to ensure successful operations of the business.

In addition, when a diversion between the predicted sales numbers and the actual sales number occurs, it’s a clear indicator that something is wrong, such as an issue with the website or promotions that may not be working properly. From there, we can problem solve the issues and address them in a timely manner.

For forecasting sales, our existing solution was based on senior members of the marketing team building manual predictions. Historical data was loaded into an Excel sheet and predictions were made using basic forecasting functionality and macros. These manual predictions were nearly 90% accurate and took 4–8 hours to complete, which was a good starting point, but still not accurate enough to confidently guide the marketing team’s next steps.

In addition, testing “what-if” future scenarios was difficult to implement because we could only perform the predictions for the following months, without further granularity such as weeks and days.

Having a detailed sales forecast allows us to identify situations where money is being lost due to outages or other technical or business issues. With a reliable and accurate forecast, when we see that actual sales aren’t meeting expected sales, we know there is an issue.

Another major challenge we faced was the lack of a tenured ML team – all members had been with the company less than a year when the project kicked off.

Overview of solution: Forecast

Amazon Forecast is a fully managed service that uses ML to deliver highly accurate forecasts. After we provided the data, Forecast automatically examined it, identified what was meaningful, and produced a forecasting model capable of making predictions on our different lines of products and geographical locations to deliver the most accurate daily forecasts. The following diagram illustrates our forecasting pipeline.

To operationalize the flow, we applied the following workflow:

  1. Amazon EventBridge calls the orchestration pipeline daily to retrieve the predictions.
  2. Step Functions help manage the orchestration pipeline.
  3. An AWS Lambda function calls Amazon Athena APIs to retrieve and prepare the training data, stored on Amazon Simple Storage Service (Amazon S3).
  4. An orchestrated pipeline of Lambda functions uses Forecast to create the datasets, train the predictors, and generate the forecasted revenue. The forecasted data is saved in an S3 bucket.
  5. Amazon Simple Notification Service (Amazon SNS) notifies users when a problem occurs during the forecasting process or when the process completes successfully.
  6. Business analysts build dashboards on Amazon QuickSight, which queries the forecast data from Amazon S3 using Athena.

We chose to work with Forecast for a few reasons:

  • Forecast is based on the same technology used at Amazon.com, so we have a lot of confidence in the tool’s capabilities.
  • The ease of use and implementation allowed us to quickly confirm we have the needed dataset to produce accurate results.
  • Because the Clearly ML team was less than 1 year old, a fully managed service allowed us to deliver this project without needing deep technical ML skills and knowledge.

Data sources

Finding the data to use for this forecast, while making sure it was clear and reliable, was the most important element in our ability to generate accurate predictions. We ended up using the following datasets, training the model on 3 years of daily data:

  • Web traffic.
  • Number of orders.
  • Average order value.
  • Conversion rate.
  • New customer revenue.
  • Marketing spend.
  • Marketing return on advertisement spend.
  • Promotions.

To create the dataset, we went through many iterations, changing the number of data sources until the predictions reach our benchmark of at least 95% accuracy.

Dashboard and results

Writing the prediction results into our existing data lake allows us to use QuickSight to build metrics and dashboards for the senior-level managers. This enables them to understand and use these results when making decisions on the next steps needed to meet our monthly marketing targets.

We were able to present the forecast results on two levels, starting with overall business performance and then going deeper into performance per each line of business (contacts and glasses). For those three cases (overall, contacts, glasses) we presented the following information:

  • Predicted revenue vs. target – This allows the marketing team to understand how we’re expected to perform this month, compared to our target, if they take no additional actions. For example, if we see that the projected sales don’t meet our marketing goals, we need to launch a new marketing campaign. The following screenshot shows an example analysis with a value of -17.47%, representing the expected total monthly revenue vs. the target.
  • Revenue performance compared to predictions over the last month – This graph shows that the predicted revenue is within the forecasted range, which means that the predictions are accurate. The following example graph shows high bound, revenue, and low bound values.
  • Month to date revenue compared to weekly and monthly forecasts – The following example screenshot shows text automatically generated by QuickSight that indicates revenue-related KPIs.

Thanks to Forecast, Clearly now has an automated pipeline that generates forecasts for daily and weekly scenarios, reaching or surpassing our benchmarks of 97%, which is an increase of 7.78% from a process that was done manually and was limited to longer periods.

Now, daily forecasts for weekly and monthly revenue take only 15 minutes in data gathering and preparation, with the forecasting process taking close to 15 minutes to complete on a daily basis. This is a huge improvement from 4-8 hours with the manual process, which could only perform predictions for the whole month.

With more granularity and better accuracy, our marketing team has better tools to act faster on discrepancies and create prediction scenarios on campaigns could achieve better revenue results.

Conclusion

Effective and accurate prediction of customer future behavior is one of the biggest challenges in ML in retail today, and having a good understanding of our customers and their behavior is vital for business success. Forecast provided a fully managed ML solution to easily create an accurate and reliable prediction with minimal overhead. The biggest benefit we get with these predictions is that we have accurate visibility of what the future will look like and can change it if it doesn’t meet our targets.

In addition, Forecast allows us to predict what-if scenarios and their impact on revenue. For example, we can project the overall revenue until the end of the month, and with some data manipulation we can also predict what will happen if we launch a BOGO (buy one, get one free) campaign next Tuesday.

“With leading ecommerce tools like Virtual Try On, combined with our unparalleled customer service, we strive to help everyone see clearly in an affordable and effortless manner—which means constantly looking for ways to innovate, improve, and streamline processes. Effective and accurate prediction of customer future behavior is one of the biggest challenges in machine learning in retail today. In just a few weeks, Amazon Forecast helped us accurately and reliably forecast sales for the upcoming week with over 97% accuracy, and with over 90% accuracy when predicting sales for the following month.”

– Dr. Ziv Pollak, Machine Learning Team Leader.

For more information about how to get started building your own MLOps pipelines with Forecast, see Building AI-powered forecasting automation with Amazon Forecast by applying MLOps, and for other use cases, visit the AWS Machine Leaning Blog.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Author

Dr Ziv Pollak Dr Ziv Pollak is an experienced technical leader who transforms the way organizations use machine learning to increase revenue, reduce costs, improve customer service, and ensure business success. He is currently leading the Machine Learning team at Clearly.

 

 

Alex Thoreux Alex Thoreux is a Jr Web Analyst at Clearly who built the forecasting pipeline, as well as other ML applications for Clearly.

 

 

 

Fernando Rocha

Fernando Rocha is a Specialist SA. As Clearly’s Solutions Architect, he helps them build analytics and machine learning solutions on AWS.

Read More

Announcing model improvements and lower annotation limits for Amazon Comprehend custom entity recognition

Amazon Comprehend is a natural language processing (NLP) service that provides APIs to extract key phrases, contextual entities, events, sentiment from unstructured text, and more. Entities refer to things in your document such as people, places, organizations, credit card numbers, and so on. But what if you want to add entity types unique to your business, like proprietary part codes or industry-specific terms? Custom entity recognition (CER) in Amazon Comprehend enables you to train models with entities that are unique to your business in just a few easy steps. You can identify almost any kind of entity, simply by providing a sufficient number of details to train your model effectively.

Training an entity recognizer from the ground up requires extensive knowledge of machine learning (ML) and a complex process for model optimization. Amazon Comprehend makes this easy for you using a technique called transfer learning to help build your custom model. Internally, Amazon Comprehend uses base models that have been trained on data collected by Amazon Comprehend and optimized for the purposes of entity recognition. With this in place, all you need to supply is the data. ML model accuracy is typically dependent on both the volume and quality of data. Getting good quality annotation data is a laborious process.

Until today, you could train an Amazon Comprehend custom entity recognizer with only 1,000 documents and 200 annotations per entity. Today, we’re announcing that we have improved underlying models for the Amazon Comprehend custom entity API by reducing the minimum requirements to train the model. Now, with as few as 250 documents and 100 annotations per entity (also referred to as shots), you can train Amazon Comprehend CER models to predict entities with greater accuracy. To take advantage of the updated performance offered by the new CER model framework, you can simply retrain and deploy improved models.

To illustrate the model improvements, we compare the result of previous models with that of the new release. We selected a diverse set of entity recognition datasets across different domains and languages from the open-source domain to showcase the model improvements. In this post, we walk you through the results from our training and inference process between the previous CER model version and the new CER model.

Datasets

When you train an Amazon Comprehend CER model, you provide the entities that you want the custom model to recognize, and the documents with text containing these entities. You can train Amazon Comprehend CER models using entity lists or annotations. Entity lists are CSV files that contain the text (a word or words) of an entity example from the training document along with a label, which is the entity type that the text is categorized as. With annotations, you can provide the positional offset of entities in a sentence along with the entity type being represented. When you use the entire sentence, you’re providing the contextual reference for the entities, which increases the accuracy of the model you’re training.

We selected the annotations option for labeling our entities because the datasets we selected already contained the annotations for each of the entity types represented. In this section, we discuss the datasets we selected and what they describe.

CoNLL

The Conference on Computational Natural Language Learning (CoNLL) provides datasets for language-independent (doesn’t use language-specific resources for performing the task) named entity recognition with entities provided in English, Spanish, and German. Four types of named entities are provided in the dataset: persons, locations, organizations, and names of miscellaneous entities that don’t belong to the previous three types.

We used the CoNLL-2003 dataset for English, and the CoNLL-2002 dataset for Spanish languages for our entity recognition training. We ran some basic transformations to convert the annotations data to a format that is required by Amazon Comprehend CER. We converted the entity types from their semantic notation to actual words they represent, such as person, organization, location, and miscellaneous.

SNIPS

The SNIPS dataset was created in 2017 as part of benchmarking tests for natural language understanding (NLU) by Snips. The results from these tests are available in the 2018 paper “Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces” by Coucke, et al. We used the GetWeather and the AddToPlaylist datasets for our experiments. The entities for the GetWeather dataset we considered are timerange, city, state, condition_description, country, and condition_temperature. For AddToPlaylist, we considered the entities artist, playlist_owner, playlist, music_item, and entity_name.

Sampling configuration

The following table represents the dataset configuration for our tests. Each row represents an Amazon Comprehend CER model that was trained, deployed, and used for entity prediction with our test dataset.

Dataset Published year Language Number of documents sampled for training Number of entities sampled Number of annotations per entity (shots) Number of documents sampled for blind test inference (never seen during training)
SNIPS-AddToPlaylist 2017 English 254 5 artist – 101
playlist_owner – 148
playlist – 254
music_item – 100
entity_name – 100
100
SNIPS-GetWeather 2017 English 600 6 timeRange – 281
city – 211
state – 111
condition_description – 121
country – 117
condition_temperature – 115
200
SNIPS-GetWeather 2017 English 1000 6 timeRange -544
city – 428
state -248
condition_description -241
country -230
condition_temperature – 228
200
SNIPS-GetWeather 2017 English 2000 6 timeRange -939
city -770
state – 436
condition_description – 401
country – 451
condition_temperature – 431
200
CoNLL 2003 English 350 3 Location – 183
Organization – 111
Person – 229
200
CoNLL 2003 English 600 3 Location – 384
Organization – 210
Person – 422
200
CoNLL 2003 English 1000 4 Location – 581
Miscellaneous – 185
Organization – 375
Person – 658
200
CoNLL 2003 English 2000 4 Location – 1133
Miscellaneous – 499
Organization – 696
Person – 1131
200
CoNLL 2002 Spanish 380 4 Location – 208
Miscellaneous – 103
Organization – 404
Person – 207
200
CoNLL 2002 Spanish 600 4 Location – 433
Miscellaneous – 220
Organization – 746
Person – 436
200
CoNLL 2002 Spanish 1000 4 Location – 578
Miscellaneous – 266
Organization – 929
Person – 538
200
CoNLL 2002 Spanish 2000 4 Location – 1184
Miscellaneous – 490
Organization – 1726
Person – 945
200

For more details on how to format data to create annotations and entity lists for Amazon Comprehend CER, see Training Custom Entity Recognizers. We created a benchmarking approach based on the sampling configuration for our tests, and we discuss the results in the following sections.

Benchmarking process

As shown in the sampling configuration in the preceding section, we trained a total of 12 models, with four models each for CoNLL English and Spanish datasets with varying document and annotation configurations, three models for the SNIPS-GetWeather dataset, again with varying document and annotation configurations, and one model with the SNIPS-AddToPlaylist dataset, primarily to test the new minimums of 250 documents and 100 annotations per entity.

Two inputs are required to train an Amazon CER model: entity representations and the documents containing these entities. For an example of how to train your own CER model, refer to Setting up human review of your NLP-based entity recognition models with Amazon SageMaker Ground Truth, Amazon Comprehend, and Amazon A2I. We measure the accuracy of our models using metrics such as F1 score, precision, and recall for the test set at training and the blind test set at inference. We run subsequent inference on these models using a blind test dataset of documents that we set aside from our original datasets.

Precision indicates how many times the model makes a correct entity identification compared to the number of attempted identifications. Recall indicates how many times the model makes a correct entity identification compared to the number of instances of that the entity is actually present, as defined by the total number of correct identifications (true positives) and missed identifications (false negatives). F1 score indicates a combination of the precision and recall metrics, which measures the overall accuracy of the model for custom entity recognition. To learn more about these metrics, refer to Custom Entity Recognizer Metrics.

Amazon Comprehend CER provides support for both real-time endpoints and batch inference requirements. We used the asynchronous batch inference API for our experiments. Finally, we calculated the F1 score, precision, and recall for the inference by comparing what the model predicted with what was originally annotated for the test documents. The metrics are calculated by doing a strict match for the span offsets, and a partial match isn’t considered nor given partial credit.

Results

The following tables document the results from our experiments we ran using the sampling configuration and the benchmarking process we explained previously.

Previous limits vs. new limits

The limits have reduced from 1,000 documents and 200 annotations per entity for CER training in the previous model to 250 documents and 100 annotations per entity in the improved model.

The following table shows the absolute improvement in F1 scores measured at training, between the old and new models. The new model improves the accuracy of your entity recognition models even when you have a lower count of training documents.

Model Previous F1 during training New F1 during training F1 point gains
CoNLL-2003-EN-600 85 96.2 11.2
CoNLL-2003-EN-1000 80.8 91.5 10.7
CoNLL-2003-EN-2000 92.2 94.1 1.9
CoNLL-2003-ES-600 81.3 86.5 5.2
CoNLL-2003-ES-1000 85.3 92.7 7.4
CoNLL-2003-ES-2000 86.1 87.2 1.1
SNIPS-Weather-600 74.7 92.1 17.4
SNIPS-Weather-1000 93.1 94.8 1.7
SNIPS-Weather-2000 92.1 95.9 3.8

Next, we report the evaluation on a blind test set that was split before the training process from the dataset.

Previous model with at least 200 annotations New (improved) model with approximately 100 annotations
Dataset Number of entities F1 Blind test set F1 F1 Blind test set F1 F1 point gains on blind test set
CoNLL-2003 – English 3 84.9 79.4 90.2 87.9 8.5
CoNLL-2003 – Spanish 4 85.8 76.3 90.4 81.8 5.5
SNIPS-Weather 6 74.74 80.64 92.14 93.6 12.96

Overall, we observe an improvement in F1 scores with the new model even with half the number of annotations provided, as seen in the preceding table.

Continued improvement with more data

In addition to the improved F1 scores at lower limits, we noticed a trend where the new model’s accuracy measured with the blind test dataset continued to improve as we trained with increased annotations. For this test, we considered the SNIPS GetWeather and AddToPlaylist datasets.

The following graph shows a distribution of absolute blind test F1 scores for models trained with different datasets and annotation counts.

We generated the following metrics during training and inference for the SNIPS-AddToPlaylist model trained with 250 documents in the new Amazon Comprehend CER model.

SNIPS-AddToPlaylist metrics at training time

SNIPS-AddToPlaylist inference metrics with blind test dataset

Conclusion

In our experiments with the model improvements in Amazon Comprehend CER, we observe accuracy improvements with fewer annotations and lower document volumes. Now, we consistently see increased accuracy across multiple datasets even with half the number of data samples. We continue to see improvements to the F1 score as we trained models with different dataset sampling configurations, including multi-lingual models. With this updated model, Amazon Comprehend makes it easy to train custom entity recognition models. Limits have been lowered to 100 annotations per entity and 250 documents for training while offering improved accuracy with your models. You can start training custom entity models on the Amazon Comprehend console or through the API.


About the Authors

Prem Ranga is an Enterprise Solutions Architect based out of Houston, Texas. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an Autonomous Vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.

 

Chethan Krishna is a Senior Partner Solutions Architect in India. He works with Strategic AWS Partners for establishing a robust cloud competency, adopting AWS best practices and solving customer challenges. He is a builder and enjoys experimenting with AI/ML, IoT and Analytics.

 

 

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She helps customers adopt machine learning on a large scale. She is passionate about NLP and ML Explainability areas in AI/ML.

 

Read More