Amazon Forecast now provides estimated run time for forecast creation jobs, enabling you to manage your time efficiently

Amazon Forecast now displays the estimated time it takes to complete an in-progress workflow for importing your data, training the predictor, and generating the forecast. You can now manage your time more efficiently and better plan for your next workflow around the estimated time remaining for your in-progress workflow. Forecast uses machine learning (ML) to generate more accurate demand forecasts, without requiring any prior ML experience. Forecast brings the same technology used at Amazon.com to developers as a fully managed service, removing the need to manage resources or rebuild your systems.

Previously, you had no clear insights as to how long a workflow would take to complete, which forced you to proactively monitor each stage, whether it was importing your data, training the predictor, or generating the forecast. This made it difficult for you to plan for subsequent steps, causing frustration and anxiety. This can be especially frustrating when the time required to import data, train a predictor, and creating forecasts can vary widely depending on the size and characteristics of your data.

Now, you have visibility into the time that a workflow may take, which can be especially useful for manually running your forecast workloads and during the process of experimentation. Knowing how long each workflow will take allows you to focus on other tasks and come back to the forecast journey later. Additionally, the displayed estimated time to complete a workflow refreshes automatically, which provides better expectations and removes further frustration.

In this post, we walk through the Forecast console experience of reading the estimated time to workflow completion. To check the estimated time through the APIs, refer to DescribeDatasetImportJob, DescribePredictor, DescribeForecast.

If you want to build automated workflows for Forecast, we recommend following the steps outlined in Create forecasting systems faster with automated workflows and notifications in Amazon Forecast, which walks through integrating Forecast with Amazon EventBridge to build event-driven Forecast workflows. EventBridge removes the need to manually check the estimated time for a workflow to complete, because it starts your desired next workflow automatically.

Check the estimated time to completion of your dataset import workflow

After you create a new dataset import job, you can see the Create pending status for the newly created job. When the status changes to Create in progress, you can see the estimated time remaining in the Status column of the Datasets imports section. This estimated time refreshes automatically until the status changes to Active.

On the details page of the newly created dataset import job, when the status is Create in progress, the Estimated time remaining field shows the remaining time for the import job to complete and Actual import time shows -. This section refreshes automatically with the estimated time to completion. After the import job is complete and the status becomes Active, the Actual import time shows the total time of the import.

Check the estimated time to completion of your predictor training workflow

After you create a new predictor, you first see the Create pending status for the newly created job. When the status changes to Create in progress, you see the estimated time remaining in the Status column in the Predictors section. This estimated time refreshes automatically until the status changes to Active.

On the details page of the newly created predictor job, when the status is Create in progress, the Estimated time remaining field shows the remaining time for the predictor job to complete and Actual import time shows -. This section refreshes automatically with the estimated time to completion. After the import job is complete and the status becomes Active, the Actual import time shows the total time for the predictor creation.

Check the estimated time to completion of your forecast creation workflow

After you create a new forecast, you first see the Create pending status for the newly created job. When the status changes to Create in progress, you see the estimated time remaining in the Status column. This estimated time refreshes automatically until it changes to Active.

On the details page of the newly created forecast job, when the status is Create in progress, the Estimated time remaining field shows the remaining time for the forecast job to complete and Actual import time shows -. This section refreshes automatically with the estimated time to completion. After the import job is complete and the status changes to Active, the Actual import time shows the total time for the forecast creation to complete.

Conclusion

You can now find out how long it takes when you initiate a workload using Forecast, which can help you manage your time more efficiently. The new field is part of the response to Describe* calls that will show up automatically, without requiring any setup.

To learn more about this capability, see DescribeDatasetImportJob, DescribePredictor, and DescribeForecast. You can use this capability in all Regions where Forecast is publicly available. For more information about Region availability, see AWS Regional Services.


About the Authors

Alex Kim is a Sr. Product Manager for Amazon Forecast. His mission is to deliver AI/ML solutions to all customers who can benefit from it. In his free time, he enjoys all types of sports and discovering new places to eat.

 

 

 

Ranjith Kumar Bodla is an SDE in the Amazon Forecast team. He works as a backend developer within a distributed environment with a focus on AI/ML and leadership. During his spare time, he enjoys playing table tennis, traveling, and reading.

 

 

 

Gautam Puri is a Software Development Engineer on the Amazon Forecast team. His focus area is on building distributed systems that solve machine learning problems. In his free time, he enjoys hiking and basketball.

 

 

 

Shannon Killingsworth is a UX Designer for Amazon Forecast and Amazon Personalize. His current work is creating console experiences that are usable by anyone, and integrating new features into the console experience. In his spare time, he is a fitness and automobile enthusiast.

 

Read More

Build an event-based tracking solution using Amazon Lookout for Vision

Amazon Lookout for Vision is a machine learning (ML) service that spots defects and anomalies in visual representations using computer vision (CV). With Amazon Lookout for Vision, manufacturing companies can increase quality and reduce operational costs by quickly identifying differences in images of objects at scale.

Many enterprise customers want to identify missing components in products, damage to vehicles or structures, irregularities in production lines, minuscule defects in silicon wafers, and other similar problems. Amazon Lookout for Vision uses ML to see and understand images from any camera as a person would, but with an even higher degree of accuracy and at a much larger scale. Amazon Lookout for Vision eliminates the need for costly and inconsistent manual inspection, while improving quality control, defect and damage assessment, and compliance. In minutes, you can begin using Amazon Lookout for Vision to automate inspection of images and objects—with no ML expertise required.

In this post, we look at how we can automate detecting anomalies in silicon wafers and notifying operators in real time.

Solution overview

Keeping track of the quality of products in a manufacturing line is a challenging task. Some process steps take images of the product that humans then review in order to assure good quality. Thanks to artificial intelligence, you can automate these anomaly detection tasks, but human intervention may be necessary after anomalies are detected. A standard approach is sending emails when problematic products are detected. These emails might be overlooked, which could cause a loss in quality in a manufacturing plant.

In this post, we automate the process of detecting anomalies in silicon wafers and notifying operators in real time using automated phone calls. The following diagram illustrates our architecture. We deploy a static website using AWS Amplify, which serves as the entry point for our application. Whenever a new image is uploaded via the UI (1), an AWS Lambda function invokes the Amazon Lookout for Vision model (2) and predicts whether this wafer is anomalous or not. The function stores each uploaded image to Amazon Simple Storage Service (Amazon S3) (3). If the wafer is anomalous, the function sends the confidence of the prediction to Amazon Connect and calls an operator (4), who can take further action (5).

Setting up Amazon Connect and the associated contact flow

To configure Amazon Connect and the contact flow, you complete the following high-level steps:

  1. Create an Amazon Connect instance.
  2. Set up the contact flow.
  3. Claim your phone number.

Create an Amazon Connect instance

The first step is to create an Amazon Connect instance. For the rest of the setup, we use the default values, but don’t forget to create an administrator login.

Instance creation can take a few minutes, after which we can log in to the Amazon Connect instance using the admin account we created.

Setting up the contact flow

In this post, we have a predefined contact flow that we can import. For more information about importing an existing contact flow, see Import/export contact flows.

  1. Choose the file contact-flow/wafer-anomaly-detection from the GitHub repo.
  2. Choose Import.

The imported contact flow looks similar to the following screenshot.

  1. On the flow details page, expand Show additional flow information.

Here you can find the ARN of the contact flow.

  1. Record the contact flow ID and contact center ID, which you need later.

Claim your phone number

Claiming a number is easy and takes just a few clicks. Make sure to choose the previously imported contact flow while claiming the number.

If no numbers are available in the country of your choice, raise a support ticket.

Contact flow overview

The following screenshot shows our contact flow.

The contact flow performs the following functions:

  • Enable logging
  • Set the output Amazon Polly voice (for this post, we use the Kendra voice)
  • Get customer input using DTMF (only keys 1 and 2 are valid).
  • Based on the user’s input, the flow does one of the following:
    • Prompt a goodbye message stating no action will be taken and exit
    • Prompt a goodbye message stating an action will be taken and exit
    • Fail and deliver a fallback block stating that the machine will shut down and exit

Optionally, you can enhance your system with an Amazon Lex bot.

Deploy the solution

Now that you have set up Amazon Connect, deployed your contact flow, and noted the information you need for the rest of the deployment, we can deploy the remaining components. In the cloned GitHub repository, edit the build.sh script and run it from the command line:

#Global variables
ApplicationRegion="YOUR_REGION"
S3SourceBucket="YOUR_S3_BUCKET-sagemaker"
LookoutProjectName="YOUR_PROJECT_NAME"
FlowID="YOUR_FLOW_ID"
InstanceID="YOUR_INSTANCE_ID"
SourceNumber="YOUR_CLAIMED_NUMBER"
DestNumber="YOUR_MOBILE_PHONE_NUMBER"
CloudFormationStack="YOUR_CLOUD_FORMATION_STACK_NAME"

Provide the following information:

  • Your Region
  • The S3 bucket name you want to use (make sure the name includes the word sagemaker).
  • The name of the Amazon Lookout for Vision project you want to use
  • The ID of your contact flow
  • Your Amazon Connect instance ID
  • The number you’ve claimed in Amazon Connect in E.164 format (for example, +132398765)
  • A name for the AWS CloudFormation stack you create by running this script

This script then performs the following actions:

  • Create an S3 bucket for you
  • Build the .zip files for your Lambda function
  • Upload the CloudFormation template and the Lambda function to your new S3 bucket
  • Create the CloudFormation stack

After the stack is deployed, you can find the following resources created on the AWS CloudFormation console.

You can see that an Amazon SageMaker notebook called amazon-lookout-vision-create-project is also created.

Build, train, and deploy the Amazon Lookout for Vision model

In this section, we see how to build, train, and deploy the Amazon Lookout for Vision model using the open-source Python SDK. For more information about the Amazon Lookout for Vision Python SDK, see this blog post.

You can build the model via the AWS Management Console. For programmatic deployment, complete the following steps:

  1. On the SageMaker console, on the Notebook instances page, access the SageMaker notebook instance that was created earlier by choosing Open Jupyter.

In the instance, you can find the GitHub repository of the Amazon Lookout for Vision Python SDK automatically cloned.

  1. Navigate into the amazon-lookout-for-vision-python-sdk/example folder.

The folder contains an example notebook that walks you through building, training, and deploying a model. Before you get started, you need to upload the images to use to train the model into your notebook instance.

  1. In the example/folder, create two new folders named good and bad.
  2. Navigate into both folders and upload your images accordingly.

Example images are in the downloaded GitHub repository.

  1. After you upload the images, open the lookout_for_vision_example.ipynb notebook.

The notebook walks you through the process of creating your model. One important step you should do first is provide the following information:

# Training & Inference
input_bucket = "YOUR_S3_BUCKET_FOR_TRAINING"
project_name = "YOUR_PROJECT_NAME"
model_version = "1" # leave this as one if you start right at the beginning

# Inference
output_bucket = "YOUR_S3_BUCKET_FOR_INFERENCE" # can be same as input_bucket
input_prefix = "YOUR_KEY_TO_FILES_TO_PREDICT/" # used in batch_predict
output_prefix = "YOUR_KEY_TO_SAVE_FILES_AFTER_PREDICTION/" # used in batch_predict

You can ignore the inference section, but feel free to also play around with this part of the notebook. Because you’re just getting started, you can leave model_version set to “1”.

For input_bucket and project_name, use the S3 bucket and Amazon Lookout for Vision project name that are provided as part of the build.sh script. You can then run each cell in the notebook, which successfully deploys the model.

You can view the training metrics using the SDK, but you can also find them on the console. To do so, open your project, navigate to the models, and choose the model you’ve trained. The metrics are available on the Performance metrics tab.

You’re now ready to deploy a static website that can call your model on demand.

Deploy the static website

Your first step is to add the endpoint of your Amazon API Gateway to your static website’s source code.

  1. On the API Gateway console, find the REST API called LookoutVisionAPI.
  2. Open the API and choose Stages.
  3. On the stage’s drop-down menu (for this post, dev), choose the POST
  4. Copy the value for Invoke URL.

We add the URL to the HTML source code.

  1. Open the file html/index.html.

At the end of the file, you can find a section that uses jQuery to trigger an AJAX request. One key is called url, which has an empty string as its value.

  1. Enter the URL you copied as your new url value and save the file.

The code should look similar to the following:

$.ajax({
    type: 'POST',
    url: 'https://<API_Gateway_ID>.execute-api.<AWS_REGION>.amazonaws.com/dev/amazon-lookout-vision-api',
    data: JSON.stringify({coordinates: coordinates, image: reader.result}),
    cache: false,
    contentType: false,
    processData: false,
    success:function(data) {
        var anomaly = data["IsAnomalous"]
        var confidence = data["Confidence"]
        text = "Anomaly:" + anomaly + "<br>" + "Confidence:" + confidence + "<br>";
        $("#json").html(text);
    },
    error: function(data){
        console.log("error");
        console.log(data);
}});
  1. Convert the index.html file to a .zip file.
  2. On the AWS Amplify console, choose the app ObjectTracking.

The front-end environment page of your app opens automatically.

  1. Select Deploy without Git provider.

You can enhance this piece to connect AWS Amplify to Git and automate your whole deployment.

  1. Choose Connect branch.

  1. For Environment name¸ enter a name (for this post, we enter dev).
  2. For Method, select Drag and drop.
  3. Choose Choose files to upload the index.html.zip file you created.
  4. Choose Save and deploy.

After the deployment is successful, you can use your web application by choosing the domain displayed in AWS Amplify.

Detect anomalies

Congratulations! You just built a solution to automate the detection of anomalies in silicon wafers and alert an operator to take appropriate action. The data we use for Amazon Lookout for Vision is a wafer map taken from Wikipedia. A few “bad” spots have been added to mimic real-world scenarios in semiconductor manufacturing.

After deploying the solution, you can run a test to see how it works. When you open the AWS Amplify domain, you see a website that lets you upload an image. For this post, we present the result of detecting a bad wafer with a so-called donut pattern. After you upload the image, it’s displayed on your website.

If the image is detected as an anomaly, Amazon Connect calls your phone number and you can interact with the service.

Conclusion

In this post, we used Amazon Lookout for Vision to automate the detection of anomalies in silicon wafers and alert an operator in real time using Amazon Connect so they can take action as needed.

This solution isn’t bound to just wafers. You can extend it to object tracking in transportation, products in manufacturing, and other endless possibilities.


About the Authors

Tolla Cherwenka is an AWS Global Solutions Architect who is certified in data and analytics. She uses an art of the possible approach to work backwards from business goals to develop transformative event-driven data architectures that enable data-driven decisions. Moreover, she is passionate about creating prescriptive solutions for refactoring to mission critical monolithic workloads to microservices, supply chain and connected factories that leverage IOT, machine learning, big data and analytics services.

 

 Michael Wallner is a Global Data Scientist with AWS Professional Services and is passionate about enabling customers on their AI/ML journey in the cloud to become AWSome. Besides having a deep interest in Amazon Connect he likes sports and enjoys cooking.

 

 

Krithivasan Balasubramaniyan is a Principal Consultant at Amazon Web Services. He enables global enterprise customers in their digital transformation journey and helps architect cloud native solutions.

 

Read More

Quality Assessment for SageMaker Ground Truth Video Object Tracking Annotations using Statistical Analysis

Data quality is an important topic for virtually all teams and systems deriving insights from data, especially teams and systems using machine learning (ML) models. Supervised ML is the task of learning a function that maps an input to an output based on examples of input-output pairs. For a supervised ML algorithm to effectively learn this mapping, the input-output pairs must be accurately labeled, which makes data labeling a crucial step in any supervised ML task.

Supervised ML is commonly used in the computer vision space. You can train an algorithm to perform a variety of tasks, including image classification, bounding box detection, and semantic segmentation, among many others. Computer vision annotation tools, like those available in Amazon SageMaker Ground Truth (Ground Truth), simplify the process of creating labels for computer vision algorithms and encourage best practices, resulting in high-quality labels.

To ensure quality, humans must be involved at some stage to either annotate or verify the assets. However, human labelers are often expensive, so it’s important to use them cost-effectively. There is no industry-wide standard for automatically monitoring the quality of annotations during the labeling process of images (or videos or point clouds), so human verification is the most common solution.

The process for human verification of labels involves expert annotators (verifiers) verifying a sample of the data labeled by a primary annotator where the experts correct (overturn) any errors in the labels. You can often find candidate samples that require label verification by using ML methods. In some scenarios, you need the same images, videos, or point clouds to be labeled and processed by multiple labelers to determine ground truth when there is ambiguity. Ground Truth accomplishes this through annotation consolidation to get agreement on what the ground truth is based on multiple responses.

In computer vision, we often deal with tasks that contain a temporal dimension, such as video and LiDAR sensors capturing sequential frames. Labeling this kind of sequential data is complex and time consuming. The goal of this blog post is to reduce the total number of frames that need human review by performing automated quality checks in multi-object tracking (MOT) time series data like video object tracking annotations while maintaining data quality at scale. The quality initiative in this blog post proposes science-driven methods that take advantage of the sequential nature of these inputs to automatically identify potential outlier labels. These methods enable you to a) objectively track data labeling quality for Ground Truth video, b) use control mechanisms to achieve and maintain quality targets, and c) optimize costs to obtain high-quality data.

We will walk through an example situation in which a large video dataset has been labeled by primary human annotators for a ML system and demonstrate how to perform automatic quality assurance (QA) to identify samples that may not be labeled properly. How can this be done without overwhelming a team’s limited resources? We’ll show you how using Ground Truth and Amazon SageMaker.

Background

Data annotation is, typically, a manual process in which the annotator follows a set of guidelines and operates in a “best-guess” manner. Discrepancies in labeling criteria between annotators can have an effect on label quality, which may impact algorithm inference performance downstream.

For sequential inputs like video at a high frame rate, it can be assumed that a frame at time t will be very similar to a frame at time t+1. This extends to the labeled objects in the frames and allows large deviations between labels across labels to be considered outliers, which can be identified with statistical metrics. Auditors can be directed to pay special attention to these outlier frames in the verification process.

A common theme in feedback from customers is the desire to create a standard methodology and framework to monitor annotations from Ground Truth and identify frames with low-quality annotations for auditing purposes. We propose this framework to allow you to measure the quality on a certain set of metrics and take action — for example, by sending those specific frames for relabeling using Ground Truth or Amazon Augmented AI (Amazon A2I).

The following table provides a glossary of terms frequently used in this post.

Term Meaning
Annotation The process whereby a human manually captures metadata related to a task. An example would be drawing the outline of the products in a still image.
SageMaker Ground Truth Ground Truth handles the scheduling of various annotation tasks and collecting the results. It also supports defining labor pools and labor requirements for performing the annotation tasks.
IoU The intersection over union (IoU) ratio measures overlap between two regions of interest in an image. This measures how good our object detector prediction is with the ground truth (the real object boundary).
Detection rate The number of detected boxes/number of ground truth boxes.
Annotation pipeline The complete end-to-end process of capturing a dataset for annotation, submitting the dataset for annotation, performing the annotation, performing quality checks and adjusting incorrect annotations.
Source data The MOT17 dataset.
Target data The unified ground truth dataset.

Evaluation metrics

This is an exciting open area of research for quality validation of annotations using statistical approaches, and the following quality metrics are often used to perform statistical validation.

Intersection over union (IoU)

IoU is the overlap between the ground truth and the prediction for each frame and the percentage of overlap between two bounding boxes. A high IoU combined with a low Hausdorff Distance indicates that a source bounding box corresponds well with a target bounding box in geometric space. These parameters may also indicate a skew in imagery. A low IoU may indicate quality conflicts between bounding boxes.

In the preceding equation bp is predicted bounding box and bgt is the ground truth bounding box.

Center Loss

Center loss is the distance between bounding box centers:


In the preceding equation (xp>,yp) is the center of predicted bounding box and (xgt,ygt) is the center of the ground truth bounding box.

IoU distribution

If the mean, median, and mode of an object’s IoU is drastically different than other objects, we may want to flag the object in question for manual auditing. We can use visualizations like heat maps for a quick understanding of object-level IoU variance.

MOT17 Dataset

The Multi Object Tracking Benchmark is a commonly used benchmark for multiple target tracking evaluation. They have a variety of datasets for training and evaluating multi-object tracking models available. For this post, we use the MOT17 dataset for our source data, which is based around detecting and tracking a large number of vehicles.

Solution

To run and customize the code used in this blog post, use the notebook Ground_Truth_Video_Quality_Metrics.ipynb in the Amazon SageMaker Examples tab of a notebook instance, under Ground Truth Labeling Jobs. You can also find the notebook on GitHub.

Download MOT17 dataset

Our first step is to download the data, which takes a few minutes, unzip it, and send it to Amazon Simple Storage Service (Amazon S3) so we can launch audit jobs. See the following code:

# Grab our data this will take ~5 minutes
!wget https://motchallenge.net/data/MOT17.zip -O /tmp/MOT17.zip
    
# unzip our data
!unzip -q /tmp/MOT17.zip -d MOT17
!rm /tmp/MOT17.zip

View MOT17 annotations

Now let’s look at what the existing MOT17 annotations look like.

In the following image, we have a scene with a large number of cars and pedestrians on a street. The labels include both bounding box coordinates as well as unique IDs for each object, or in this case cars, being tracked.

Evaluate our labels

For demonstration purposes, we’ve labeled three vehicles in one of the videos and inserted a few labeling anomalies into the annotations. Although human labelers tend to be accurate, they’re subject to conditions like distraction and fatigue, which can affect label quality. If we use automated methods to identify annotator mistakes and send directed recommendations for frames and objects to fix, we can make the label auditing process more accurate and efficient. If a labeler only has to focus on a few frames instead of a deep review of the entire scene, they can drastically improve speed and reduce cost.

Analyze our tracking data

Let’s put our tracking data into a form that’s easier to analyze.

We use a function to take the output JSON from Ground Truth and turn our tracking output into a dataframe. We can use this to plot values and metrics that will help us understand how the object labels move through our frames. See the following code:

# generate dataframes
lab_frame_real = create_annot_frame(tlabels['tracking-annotations'])
lab_frame_real.head()

Plot progression

Let’s start with some simple plots. The following plots illustrate how the coordinates of a given object progress through the frames of your video. Each bounding box has a left and top coordinate, representing the top-left point of the bounding box. We also have height and width values that let us determine the other three points of the box.

In the following plots, the blue lines represent the progression of our four values (top coordinate, left coordinate, width, and height) through the video frames and the orange lines represent a rolling average of the values from the previous five frames. Because a video is a sequence of frames, if we have a video that has five frames per second or more, the objects within the video (and the bounding boxes drawn around them) should have some amount of overlap between frames. In our video, we have vehicles driving at a normal pace so our plots should show a relatively smooth progression.

We can also plot the deviation between the rolling average and the actual values of bounding box coordinates. We’ll likely want to look at frames where the actual value deviates substantially from the rolling average.

Plot box sizes

Let’s combine the width and height values to look at how the size of the bounding box for a given object progresses through the scene. For Vehicle 1, we intentionally reduced the size of the bounding box on frame 139 and restored it on frame 141. We also removed a bounding box on frame 217. We can see both of these flaws reflected in our size progression plots.

Box size differential

Let’s now look at how the size of the box changes from frame to frame by plotting the actual size differential. This allows us to get a better idea of the magnitude of these changes. We can also normalize the magnitude of the size changes by dividing the size differentials by the sizes of the boxes. This lets us express the differential as a percentage change from the original size of the box. This makes it easier to set thresholds beyond which we can classify this frame as potentially problematic for this object bounding box. The following plots visualize both the absolute size differential and the size differential as a percentage. We can also add lines representing where the bounding box changed by more than 20% in size from one frame to the next.

View the frames with the largest size differential

Now that we have the indexes for the frames with the largest size differential, we can view them in sequence. If we look at the following frames, we can see for Vehicle 1 we were able to identify frames where our labeler made a mistake. Frame 217 was flagged because there was a large difference between frame 216 and the subsequent frame, frame 217.

Rolling IoU

IoU is a commonly used evaluation metric for object detection. We calculate it by dividing the area of overlap between two bounding boxes by the area of union for two bounding boxes. Although it’s typically used to evaluate the accuracy of a predicted box against a ground truth box, we can use it to evaluate how much overlap a given bounding box has from one frame of a video to the next.

Because our frames differ, we don’t expect a given bounding box for a single object to have 100% overlap with the corresponding bounding box from the next frame. However, depending on the frames per second for the video, there often is only a small amount of change in one from to the next because the time elapsed between frames is only a fraction of a second. For higher FPS video, we can expect a substantial amount of overlap between frames. The MOT17 videos are all shot at 25 FPS, so these videos qualify. Operating with this assumption, we can use IoU to identify outlier frames where we see substantial differences between a bounding box in one frame to the next. See the following code:

# calculate rolling intersection over union
def calc_frame_int_over_union(annot_frame, obj, i):
    lframe_len = max(annot_frame['frameid'])
    annot_frame = annot_frame[annot_frame.obj==obj]
    annot_frame.index = list(np.arange(len(annot_frame)))
    coord_vec = np.zeros((lframe_len+1,4))
    coord_vec[annot_frame['frameid'].values, 0] = annot_frame['left']
    coord_vec[annot_frame['frameid'].values, 1] = annot_frame['top']
    coord_vec[annot_frame['frameid'].values, 2] = annot_frame['width']
    coord_vec[annot_frame['frameid'].values, 3] = annot_frame['height']
    boxA = [coord_vec[i,0], coord_vec[i,1], coord_vec[i,0] + coord_vec[i,2], coord_vec[i,1] + coord_vec[i,3]]
    boxB = [coord_vec[i+1,0], coord_vec[i+1,1], coord_vec[i+1,0] + coord_vec[i+1,2], coord_vec[i+1,1] + coord_vec[i+1,3]]
    return bb_int_over_union(boxA, boxB)
# create list of objects
objs = list(np.unique(label_frame.obj))
# iterate through our objects to get rolling IoU values for each
iou_dict = {}
for obj in objs:
    iou_vec = np.ones(len(np.unique(label_frame.frameid)))
    ious = []
    for i in label_frame[label_frame.obj==obj].frameid[:-1]:
        iou = calc_frame_int_over_union(label_frame, obj, i)
        ious.append(iou)
        iou_vec[i] = iou
    iou_dict[obj] = iou_vec
    
fig, ax = plt.subplots(nrows=1,ncols=3, figsize=(24,8), sharey=True)
ax[0].set_title(f'Rolling IoU {objs[0]}')
ax[0].set_xlabel('frames')
ax[0].set_ylabel('IoU')
ax[0].plot(iou_dict[objs[0]])
ax[1].set_title(f'Rolling IoU {objs[1]}')
ax[1].set_xlabel('frames')
ax[1].set_ylabel('IoU')
ax[1].plot(iou_dict[objs[1]])
ax[2].set_title(f'Rolling IoU {objs[2]}')
ax[2].set_xlabel('frames')
ax[2].set_ylabel('IoU')
ax[2].plot(iou_dict[objs[2]])

The following plots show our results:

Identify and visualize low overlap frames

Now that we have calculated our intersection over union for our objects, we can identify objects below an IoU threshold we set. Let’s say we want to identify frames where the bounding box for a given object has less than 50% overlap. We can use the following code:

## ID problem indices
iou_thresh = 0.5
vehicle = 1 # because index starts at 0, 0 -> vehicle:1, 1 -> vehicle:2, etc.
# use np.where to identify frames below our threshold.
inds = np.where(np.array(iou_dict[objs[vehicle]]) < iou_thresh)[0]
worst_ind = np.argmin(np.array(iou_dict[objs[vehicle]]))
print(objs[vehicle],'worst frame:', worst_ind)

Visualize low overlap frames

Now that we have identified our low overlap frames, let’s view them. We can see for Vehicle:2, there is an issue on frame 102, compared to frame 101.

The annotator made a mistake and the bounding box for Vehicle:2 does not go low enough and clearly needs to be extended.

Thankfully our IoU metric was able to identify this!

Embedding comparison

The two preceding methods work because they’re simple and are based on the reasonable assumption that objects in high FPS video don’t move too much from frame to frame. They can be considered more classical methods of comparison. Can we improve upon them? Let’s try something more experimental.

We can use a deep learning method to identify outliers is to generate embeddings for our bounding box crops with an image classification model like ResNet and compare these across frames. Convolutional neural network image classification models have a final fully connected layer using a softmax or scaling activation function that outputs probabilities. If we remove the final layer of our network, our predictions will instead be the image embedding that is essentially the neural network’s representation of the image. If we isolate our objects by cropping our images, we can compare the representations of these objects across frames to see if we can identify any outliers.

We can use a ResNet18 model from Torchhub that was trained on ImageNet. Because ImageNet is a very large and generic dataset, the network over time was able to learn information regarding images that allow it to classify them into different categories. While a neural network more finely tuned on vehicles would likely perform better, a network trained on a large dataset like ImageNet should have learned enough information to give us some indication if images are similar.

The following code shows our crops:

def plot_crops(obj = 'Vehicle:1', start=0):
    fig, ax = plt.subplots(nrows=1, ncols=5, figsize=(20,12))
    for i,a in enumerate(ax):
        a.imshow(img_crops[i+start][obj])
        a.set_title(f'Frame {i+start}')
plot_crops(start=1)

The following image compares the crops in each frame:

Let’s compute the distance between our sequential embeddings for a given object:

def compute_dist(img_embeds, dist_func=distance.euclidean, obj='Vehicle:1'):
    dists = []
    inds = []
    for i in img_embeds:
        if (i>0)&(obj in list(img_embeds[i].keys())):
            if (obj in list(img_embeds[i-1].keys())):
                dist = dist_func(img_embeds[i-1][obj],img_embeds[i][obj]) # distance  between frame at t0 and t1
                dists.append(dist)
                inds.append(i)
    return dists, inds
obj = 'Vehicle:2'
dists, inds = compute_dist(img_embeds, obj=obj)
    
# look for distances that are 2 standard deviation greater than the mean distance
prob_frames = np.where(dists>(np.mean(dists)+np.std(dists)*2))[0]
prob_inds = np.array(inds)[prob_frames]
print(prob_inds)
print('The frame with the greatest distance is frame:', inds[np.argmax(dists)])

Let’s look at the crops for our problematic frames. We can see we were able to catch the issue on frame 102 where the bounding box was off-center.

Combine the metrics

Now that we have explored several methods for identifying anomalous and potentially problematic frames, let’s combine them and identify all of those outlier frames (see the following code). Although we might have a few false positives, these tend to be areas with a lot of action that we might want our annotators to review regardless.

def get_problem_frames(lab_frame, flawed_labels, size_thresh=.25, iou_thresh=.4, embed=False, imgs=None, verbose=False, embed_std=2):
    """
    Function for identifying potentially problematic frames using bounding box size, rolling IoU, and optionally embedding comparison.
    """
    if embed:
        model = torch.hub.load('pytorch/vision:v0.6.0', 'resnet18', pretrained=True)
        model.eval()
        modules=list(model.children())[:-1]
        model=nn.Sequential(*modules)
        
    frame_res = {}
    for obj in list(np.unique(lab_frame.obj)):
        frame_res[obj] = {}
        lframe_len = max(lab_frame['frameid'])
        ann_subframe = lab_frame[lab_frame.obj==obj]
        size_vec = np.zeros(lframe_len+1)
        size_vec[ann_subframe['frameid'].values] = ann_subframe['height']*ann_subframe['width']
        size_diff = np.array(size_vec[:-1])- np.array(size_vec[1:])
        norm_size_diff = size_diff/np.array(size_vec[:-1])
        norm_size_diff[np.where(np.isnan(norm_size_diff))[0]] = 0
        norm_size_diff[np.where(np.isinf(norm_size_diff))[0]] = 0
        frame_res[obj]['size_diff'] = [int(x) for x in size_diff]
        frame_res[obj]['norm_size_diff'] = [int(x) for x in norm_size_diff]
        try:
            problem_frames = [int(x) for x in np.where(np.abs(norm_size_diff)>size_thresh)[0]]
            if verbose:
                worst_frame = np.argmax(np.abs(norm_size_diff))
                print('Worst frame for',obj,'in',frame, 'is: ',worst_frame)
        except:
            problem_frames = []
        frame_res[obj]['size_problem_frames'] = problem_frames
        iou_vec = np.ones(len(np.unique(lab_frame.frameid)))
        for i in lab_frame[lab_frame.obj==obj].frameid[:-1]:
            iou = calc_frame_int_over_union(lab_frame, obj, i)
            iou_vec[i] = iou
            
        frame_res[obj]['iou'] = iou_vec.tolist()
        inds = [int(x) for x in np.where(iou_vec<iou_thresh)[0]]
        frame_res[obj]['iou_problem_frames'] = inds
        
        if embed:
            img_crops = {}
            img_embeds = {}
            for j,img in tqdm(enumerate(imgs)):
                img_arr = np.array(img)
                img_embeds[j] = {}
                img_crops[j] = {}
                for i,annot in enumerate(flawed_labels['tracking-annotations'][j]['annotations']):
                    try:
                        crop = img_arr[annot['top']:(annot['top']+annot['height']),annot['left']:(annot['left']+annot['width']),:]                    
                        new_crop = np.array(Image.fromarray(crop).resize((224,224)))
                        img_crops[j][annot['object-name']] = new_crop
                        new_crop = np.reshape(new_crop, (1,224,224,3))
                        new_crop = np.reshape(new_crop, (1,3,224,224))
                        torch_arr = torch.tensor(new_crop, dtype=torch.float)
                        with torch.no_grad():
                            emb = model(torch_arr)
                        img_embeds[j][annot['object-name']] = emb.squeeze()
                    except:
                        pass
                    
            dists = compute_dist(img_embeds, obj=obj)
            # look for distances that are 2+ standard deviations greater than the mean distance
            prob_frames = np.where(dists>(np.mean(dists)+np.std(dists)*embed_std))[0]
            frame_res[obj]['embed_prob_frames'] = prob_frames.tolist()
        
    return frame_res
    
# if you want to add in embedding comparison, set embed=True
num_images_to_validate = 300
embed = False
frame_res = get_problem_frames(label_frame, flawed_labels, size_thresh=.25, iou_thresh=.5, embed=embed, imgs=imgs[:num_images_to_validate])
        
prob_frame_dict = {}
all_prob_frames = []
for obj in frame_res:
    prob_frames = list(frame_res[obj]['size_problem_frames'])
    prob_frames.extend(list(frame_res[obj]['iou_problem_frames']))
    if embed:
        prob_frames.extend(list(frame_res[obj]['embed_prob_frames']))
    all_prob_frames.extend(prob_frames)
    
prob_frame_dict = [int(x) for x in np.unique(all_prob_frames)]
prob_frame_dict

Launch a directed audit job

Now that we’ve identified our problematic annotations, we can launch a new audit labeling job to review identified outlier frames. We can do this via the SageMaker console, but when we want to launch jobs in a more automated fashion, using the boto3 API is very helpful.

Generate manifests

SageMaker Ground Truth operates using manifests. When using a modality like image classification, a single image corresponds to a single entry in a manifest and a given manifest will contains paths for all of the images to be labeled in a single manifest. For videos, because we have multiple frames per video and we can have multiple videos in a single manifest, this is organized instead by using a JSON sequence file for each video that contains all the paths for our frames. This allows a single manifest to contain multiple videos for a single job. For example, the following code:

# create manifest
man_dict = {}
for vid in all_vids:
    source_ref = f"s3://{bucket}/tracking_manifests/{vid.split('/')[-1]}_seq.json"
    annot_labels = f"s3://{bucket}/tracking_manifests/SeqLabel.json"
    manifest = {
        "source-ref": source_ref,
        'Person':annot_labels, 
        "Person-metadata":{"class-map": {"1": "Pedestrian"},
                         "human-annotated": "yes",
                         "creation-date": "2020-05-25T12:53:54+0000",
                         "type": "groundtruth/video-object-tracking"}
    }
    man_dict[vid] = manifest
    
# save videos as individual jobs
for vid in all_vids:
    with open(f"tracking_manifests/{vid.split('/')[-1]}.manifest", 'w') as f:
        json.dump(man_dict[vid],f)
        
# put multiple videos in a single manifest, with each job as a line
# with open(f"/home/ec2-user/SageMaker/tracking_manifests/MOT17.manifest", 'w') as f:
#     for vid in all_vids:    
#         f.write(json.dumps(man_dict[vid]))
#         f.write('n')
        
print('Example manifest: ', manifest)

The following is our manifest file:

Example manifest:  {'source-ref': 's3://smgt-qa-metrics-input-322552456788-us-west-2/tracking_manifests/MOT17-13-SDP_seq.json', 'Person': 's3://smgt-qa-metrics-input-322552456788-us-west-2/tracking_manifests/SeqLabel.json', 'Person-metadata': {'class-map': {'1': 'Vehicle'}, 'human-annotated': 'yes', 'creation-date': '2020-05-25T12:53:54+0000', 'type': 'groundtruth/video-object-tracking'}}

Launch jobs

We can use this template for launching labeling jobs (see the following code). For the purposes of this post, we already have labeled data, so this isn’t necessary, but if you want to label the data yourself, you can do so using a private workteam.

# generate jobs
job_names = []
outputs = []
# for vid in all_vids:
LABELING_JOB_NAME = f"mot17-tracking-adjust-{int(time.time()} "
task = 'AdjustmentVideoObjectTracking'
job_names.append(LABELING_JOB_NAME)
INPUT_MANIFEST_S3_URI = f's3://{bucket}/tracking_manifests/MOT20-01.manifest'
createLabelingJob_request = {
  "LabelingJobName": LABELING_JOB_NAME,
  "HumanTaskConfig": {
    "AnnotationConsolidationConfig": {
      "AnnotationConsolidationLambdaArn": f"arn:aws:lambda:us-east-1:432418664414:function:ACS-{task}"
    }, # changed us-west-2 to us-east-1
    "MaxConcurrentTaskCount": 200,
    "NumberOfHumanWorkersPerDataObject": 1,
    "PreHumanTaskLambdaArn": f"arn:aws:lambda:us-east-1:432418664414:function:PRE-{task}",
    "TaskAvailabilityLifetimeInSeconds": 864000,
    "TaskDescription": f"Please draw boxes around vehicles, with a specific focus on the following frames {prob_frame_dict}",
    "TaskKeywords": [
      "Image Classification",
      "Labeling"
    ],
    "TaskTimeLimitInSeconds": 7200,
    "TaskTitle": LABELING_JOB_NAME,
    "UiConfig": {
      "HumanTaskUiArn": f'arn:aws:sagemaker:us-east-1:394669845002:human-task-ui/VideoObjectTracking'
    },
    "WorkteamArn": WORKTEAM_ARN
  },
  "InputConfig": {
    "DataAttributes": {
      "ContentClassifiers": [
        "FreeOfPersonallyIdentifiableInformation",
        "FreeOfAdultContent"
      ]
    },
    "DataSource": {
      "S3DataSource": {
        "ManifestS3Uri": INPUT_MANIFEST_S3_URI
      }
    }
  },
  "LabelAttributeName": "Person-ref",
  "LabelCategoryConfigS3Uri": LABEL_CATEGORIES_S3_URI,
  "OutputConfig": {
    "S3OutputPath": f"s3://{bucket}/gt_job_results"
  },
  "RoleArn": role,
  "StoppingConditions": {
    "MaxPercentageOfInputDatasetLabeled": 100
  }
}
print(createLabelingJob_request)
out = sagemaker_cl.create_labeling_job(**createLabelingJob_request)
outputs.append(out)
print(out)

Conclusion

In this post, we introduced how to measure the quality of sequential annotations, namely video multi-frame object tracking annotations, using statistical analysis and various quality metrics (IoU, rolling IoU and embedding comparisons). In addition, we walked through how to flag frames that aren’t labeled properly using these quality metrics and send those frames for verification or audit jobs using SageMaker Ground Truth to generate a new version of the dataset with more accurate annotations. We can perform quality checks on the annotations for video data using this approach or similar approaches such as 3D IoU for 3D point cloud data in automated manner at scale with reduction in the number of frames for human audit.

Try out the notebook and add your own quality metrics for different task types supported by SageMaker Ground Truth. With this process in place, you can generate high-quality datasets for a wide range of business use cases in a cost-effective manner without compromising the quality of annotations.

For more information about labeling with Ground Truth, see Easily perform bulk label quality assurance using Amazon SageMaker Ground Truth.

References

  1. https://en.wikipedia.org/wiki/Hausdorff_distance
  2. https://aws.amazon.com/blogs/machine-learning/easily-perform-bulk-label-quality-assurance-using-amazon-sagemaker-ground-truth/

About the Authors

 Vidya Sagar Ravipati is a Deep Learning Architect at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption. Previously, he was a Machine Learning Engineer in Connectivity Services at Amazon who helped to build personalization and predictive maintenance platforms.

 

 

Isaac Privitera is a Machine Learning Specialist Solutions Architect and helps customers design and build enterprise-grade computer vision solutions on AWS. Isaac has a background in using machine learning and accelerated computing for computer vision and signals analysis. Isaac also enjoys cooking, hiking, and keeping up with the latest advancements in machine learning in his spare time.

Read More

It’s here! Join us for Amazon SageMaker Month, 30 days of content, discussion, and news

Want to accelerate machine learning (ML) innovation in your organization? Join us for 30 days of new Amazon SageMaker content designed to help you build, train, and deploy ML models faster. On April 20, we’re kicking off 30 days of hands-on workshops, Twitch sessions, Slack chats, and partner perspectives. Our goal is to connect you with AWS experts—including Greg Coquillio, the second-most influential speaker according to LinkedIn Top Voices 2020: Data Science & AI and Julien Simon, the number one AI evangelist according to AI magazine —to learn hints and tips for success with ML.

We built SageMaker from the ground up to provide every developer and data scientist with the ability to build, train, and deploy ML models quickly and at lower cost by providing the tools required for every step of the ML development lifecycle in one integrated, fully managed service. We have launched over 50 SageMaker capabilities in the past year alone, all aimed at making this process easier for our customers. The customer response to what we’re building has been incredible, making SageMaker one of the fastest growing services in AWS history.

To help you dive deep into these SageMaker innovations, we’re dedicating April 20 – May 21, 2021 to SageMaker education. Here are some must dos to add to your calendar:

Besides these virtual hands-on opportunities, we will have regular blog posts from AWS experts and our partners, including Snowflake, Tableau, Genesys, and DOMO. Bookmark the SageMaker Month webpage or sign up to our weekly newsletters so you don’t miss any of the planned activities.

But we aren’t stopping there!

To coincide with SageMaker Month, we launched new Savings Plans. The SageMaker Savings Plans offer a flexible, usage-based pricing model for SageMaker. The goal of the savings plans is to offer you the flexibility to save up to 64% on SageMaker ML instance usage in exchange for a commitment of consistent usage for a 1 or 3-year term. For more information, read the launch blog. Further, to help you save even more, we also just announced a price drop on several instance families in SageMaker.

The SageMaker Savings Plans are on top of the productivity and cost-optimizing capabilities already available in SageMaker Studio. You can improve your data science team’s productivity up to 10 times using SageMaker Studio. SageMaker Studio provides a single web-based visual interface where you can perform all your ML development steps. SageMaker Studio gives you complete access, control, and visibility into each step required to build, train, and deploy models. You can quickly upload data, create new notebooks, train and tune models, move back and forth between steps to adjust experiments, compare results, and deploy models to production all in one place, which boosts productivity.

You can also optimize costs through capabilities such as Managed Spot Training, in which you use Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances for your SageMaker training jobs (see Optimizing and Scaling Machine Learning Training with Managed Spot Training for Amazon SageMaker), and Amazon Elastic Inference, which allows you to attach just the right amount of GPU-powered inference acceleration to any SageMaker instance type.

We are also excited to see continued customer momentum with SageMaker. Just in the first quarter of 2021, we launched 15 new SageMaker case studies and references, spanning a wide range industries including SNCF, Mueller, Bundesliga, University of Oxford, and Latent Space. Some highlights include:

  • The data science team at SNFC reduced model training time from 3 days to 10 hours.
  • Mueller Water Products automated the daily collection of more than 5 GB of data and used ML to improve leak-detection performance.
  • Latent Space scaled model training beyond 1 billion parameters.

We would love for you to join the thousands of customers who are seeing success with Amazon SageMaker. We want to add you to our customer reference list, and we can’t wait to work with you this month!


About the Author

Kimberly Madia is a Principal Product Marketing Manager with AWS Machine Learning. Her goal is to make it easy for customers to build, train, and deploy machine learning models using Amazon SageMaker. For fun outside work, Kimberly likes to cook, read, and run on the San Francisco Bay Trail.

Read More

Enforce VPC rules for Amazon Comprehend jobs and CMK encryption for custom models

You can now control the Amazon Virtual Private Cloud (Amazon VPC) and encryption settings for your Amazon Comprehend APIs using AWS Identity and Access Management (IAM) condition keys, and encrypt your Amazon Comprehend custom models using customer managed keys (CMK) via AWS Key Management Service (AWS KMS). IAM condition keys enable you to further refine the conditions under which an IAM policy statement applies. You can use the new condition keys in IAM policies when granting permissions to create asynchronous jobs and creating custom classification or custom entity training jobs.

Amazon Comprehend now supports five new condition keys:

  • comprehend:VolumeKmsKey
  • comprehend:OutputKmsKey
  • comprehend:ModelKmsKey
  • comprehend:VpcSecurityGroupIds
  • comprehend:VpcSubnets

The keys allow you to ensure that users can only create jobs that meet your organization’s security posture, such as jobs that are connected to the allowed VPC subnets and security groups. You can also use these keys to enforce encryption settings for the storage volumes where the data is pulled down for computation and on the Amazon Simple Storage Service (Amazon S3) bucket where the output of the operation is stored. If users try to use an API with VPC settings or encryption parameters that aren’t allowed, Amazon Comprehend rejects the operation synchronously with a 403 Access Denied exception.

Solution overview

The following diagram illustrates the architecture of our solution.

We want to enforce a policy to do the following:

  • Make sure that all custom classification training jobs are specified with VPC settings
  • Have encryption enabled for the classifier training job, the classifier output, and the Amazon Comprehend model

This way, when someone starts a custom classification training job, the training data that is pulled in from Amazon S3 is copied to the storage volumes in your specified VPC subnets and is encrypted with the specified VolumeKmsKey. The solution also makes sure that the results of the model training are encrypted with the specified OutputKmsKey. Finally, the Amazon Comprehend model itself is encrypted with the AWS KMS key specified by the user when it’s stored within the VPC. The solution uses three different keys for the data, output, and the model, respectively, but you can choose to use the same key for all three tasks.

Additionally, this new functionality enables you to audit model usage in AWS CloudTrail by tracking the model encryption key usage.

Encryption with IAM policies

The following policy makes sure that users must specify VPC subnets and security groups for VPC settings and AWS KMS keys for both the classifier and output:

{
   "Version": "2012-10-17",
   "Statement": [{
    "Action": ["comprehend:CreateDocumentClassifier"],
    "Effect": "Allow",
    "Resource": "*",
    "Condition": {
      "Null": {
        "comprehend:VolumeKmsKey": "false",
        "comprehend:OutputKmsKey": "false",
        "comprehend:ModelKmsKey": "false",
        "comprehend:VpcSecurityGroupIds": "false",
        "comprehend:VpcSubnets": "false"
      }
    }
  }]
}

For example, in the following code, User 1 provides both the VPC settings and the encryption keys, and can successfully complete the operation:

aws comprehend create-document-classifier 
--region region 
--document-classifier-name testModel 
--language-code en 
--input-data-config S3Uri=s3://S3Bucket/docclass/filename 
--data-access-role-arn arn:aws:iam::[your account number]:role/testDataAccessRole
--volume-kms-key-id arn:aws:kms:region:[your account number]:alias/ExampleAlias
--output-data-config S3Uri=s3://S3Bucket/output/file name,KmsKeyId=arn:aws:kms:region:[your account number]:alias/ExampleAlias
--vpc-config SecurityGroupIds=sg-11a111111a1exmaple,Subnets=subnet-11aaa111111example

User 2, on the other hand, doesn’t provide any of these required settings and isn’t allowed to complete the operation:

aws comprehend create-document-classifier 
--region region 
--document-classifier-name testModel 
--language-code en 
--input-data-config S3Uri=s3://S3Bucket/docclass/filename 
--data-access-role-arn arn:aws:iam::[your account number]:role/testDataAccessRole
--output-data-config S3Uri=s3://S3Bucket/output/file name

In the preceding code examples, as long as the VPC settings and the encryption keys are set, you can run the custom classifier training job. Leaving the VPC and encryption settings in their default state results in a 403 Access Denied exception.

In the next example, we enforce an even stricter policy, in which we have to set the VPC and encryption settings to also include specific subnets, security groups, and KMS keys. This policy applies these rules for all Amazon Comprehend APIs that start new asynchronous jobs, create custom classifiers, and create custom entity recognizers. See the following code:

{
   "Version": "2012-10-17",
   "Statement": [{
    "Action":
     [
    "comprehend:CreateDocumentClassifier",
    "comprehend:CreateEntityRecognizer",
    "comprehend:Start*Job"
    ],
    "Effect": "Allow",
    "Resource": "*",
    "Condition": {
      "ArnEquals": {
        "comprehend:VolumeKmsKey": "arn:aws:kms:region:[your account number]:key/key_id",
        "comprehend:ModelKmsKey": "arn:aws:kms:region:[your account number]:key/key_id1",
        "comprehend:OutputKmsKey": "arn:aws:kms:region:[your account number]:key/key_id2"
      },
      "ForAllValues:StringLike": {
        "comprehend:VpcSecurityGroupIds": [
          "sg-11a111111a1exmaple"
        ],
        "comprehend:VpcSubnets": [
          "subnet-11aaa111111example"
        ]
      }
    }
  }]
}

In the next example, we first create a custom classifier on the Amazon Comprehend console without specifying the encryption option. Because we have the IAM conditions specified in the policy, the operation is denied.

When you enable classifier encryption, Amazon Comprehend encrypts the data in the storage volume while your job is being processed. You can either use an AWS KMS customer managed key from your account or a different account. You can specify the encryption settings for the custom classifier job as in the following screenshot.

Output encryption enables Amazon Comprehend to encrypt the output results from your analysis. Similar to Amazon Comprehend job encryption, you can either use an AWS KMS customer managed key from your account or another account.

Because our policy also enforces the jobs to be launched with VPC and security group access enabled, you can specify these settings in the VPC settings section.

Amazon Comprehend API operations and IAM condition keys

The following table lists the Amazon Comprehend API operations and the IAM condition keys that are supported as of this writing. For more information, see Actions, resources, and condition keys for Amazon Comprehend.

Model encryption with a CMK

Along with encrypting your training data, you can now encrypt your custom models in Amazon Comprehend using a CMK. In this section, we go into more detail about this feature.

Prerequisites

You need to add an IAM policy to allow a principal to use or manage CMKs. CMKs are specified in the Resource element of the policy statement. When writing your policy statements, it’s a best practice to limit CMKs to those that the principals need to use, rather than give the principals access to all CMKs.

In the following example, we use an AWS KMS key (1234abcd-12ab-34cd-56ef-1234567890ab) to encrypt an Amazon Comprehend custom model.

When you use AWS KMS encryption, kms:CreateGrant and kms:RetireGrant permissions are required for model encryption.

For example, the following IAM policy statement in your dataAccessRole provided to Amazon Comprehend allows the principal to call the create operations only on the CMKs listed in the Resource element of the policy statement:

{"Version": "2012-10-17",
  "Statement": {"Effect": "Allow",
    "Action": [
      "kms:CreateGrant",
      "kms:RetireGrant",
      "kms:GenerateDataKey",
      "kms:Decrypt"
    ],
    "Resource": [
      "arn:aws:kms:us-west-2:[your account number]:key/1234abcd-12ab-34cd-56ef-1234567890ab"
    ]
  }
}

Specifying CMKs by key ARN, which is a best practice, makes sure that the permissions are limited only to the specified CMKs.

Enable model encryption

As of this writing, custom model encryption is available only via the AWS Command Line Interface (AWS CLI). The following example creates a custom classifier with model encryption:

 aws comprehend create-document-classifier 
--document-classifier-name my-document-classifier  
--data-access-role-arn arn:aws:iam::[your account number]:role/mydataaccessrole 
--language-code en  --region us-west-2 
--model-kms-key-id arn:aws:kms:us-west-2:[your account number]:key/[your key Id] 
--input-data-config S3Uri=s3://path-to-data/multiclass_train.csv

The next example trains a custom entity recognizer with model encryption:

aws comprehend create-entity-recognizer 
--recognizer-name my-entity-recognizer 
--data-access-role-arn arn:aws:iam::[your account number]:role/mydataaccessrole  
--language-code "en" --region us-west-2 
--input-data-config '{
      "EntityTypes": [{"Type": "PERSON"}, {"Type": "LOCATION"}],
      "Documents": {
            "S3Uri": "s3://path-to-data/documents"
      },
      "Annotations": {
          "S3Uri": "s3://path-to-data/annotations"
      }
}'

Finally, you can also create an endpoint for your custom model with encryption enabled:

aws comprehend create-endpoint 
 --endpoint-name myendpoint 
 --model-arn arn:aws:comprehend:us-west-2:[your account number]:document-classifier/my-document-classifier 
 --data-access-role-arn arn:aws:iam::[your account number]:role/mydataaccessrole 
 --desired-inference-units 1 --region us-west-2

Conclusion

You can now enforce security settings like enabling encryption and VPC settings for your Amazon Comprehend jobs using IAM condition keys. The IAM condition keys are available in all AWS Regions where Amazon Comprehend is available. You can also encrypt the Amazon Comprehend custom models using customer managed keys.

To learn more about the new condition keys and view policy examples, see Using IAM condition keys for VPC settings and Resource and Conditions for Amazon Comprehend APIs. To learn more about using IAM condition keys, see IAM JSON policy elements: Condition.


About the Authors

Sam Palani is an AI/ML Specialist Solutions Architect at AWS. He enjoys working with customers to help them architect machine learning solutions at scale. When not helping customers, he enjoys reading and exploring the outdoors.

 

 

Shanthan Kesharaju is a Senior Architect in the AWS ProServe team. He helps our customers with AI/ML strategy, architecture, and developing products with a purpose. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.

Read More

AWS launches free digital training courses to empower business leaders with ML knowledge

Today, we’re pleased to launch Machine Learning Essentials for Business and Technical Decision Makersa series of three free, on-demand, digital-training courses from AWS Training and Certification. These courses are intended to empower business leaders and technical decision makers with the foundational knowledge needed to begin shaping a machine learning (ML) strategy for their organization, even if they have no prior ML experience. Each 30-minute course includes real-world examples from Amazon’s 20+ years of experience scaling ML within its own operations as well as lessons learned through countless successful customer implementations. These new courses are based on content delivered through the AWS Machine Learning Embark program, an exclusive, hands-on, ML accelerator that brings together executives and technologists at an organization to solve business problems with ML via a holistic learning experience. After completing the three courses, business leaders and technical decision makers will be better able to assess their organization’s readiness, identify areas of the business where ML will be the most impactful, and identify concrete next steps.

Last year, Amazon announced that we’re committed to helping 29 million individuals around the world grow their tech skills with free cloud computing skills training by 2025. The new Machine Learning Essentials for Business and Technical Decision Makers series presents one more step in this direction, with three courses:

  • Machine Learning: The Art of the Possible is the first course in the series. Using clear language and specific examples, this course helps you understand the fundamentals of ML, common use cases, and even potential challenges.
  • Planning a Machine Learning Project – the second course – breaks down how you can help your organization plan for an ML project. Starting with the process of assessing whether ML is the right fit for your goals and progressing through the key questions you need to ask during deployment, this course helps you understand important issues, such as data readiness, project timelines, and deployment.
  • Building a Machine Learning Ready Organization – the final course- offers insights into how to prepare your organization to successfully implement ML, from data-strategy evaluation, to culture, to starting an ML pilot, and more.

Democratizing access to free ML training

ML has the potential to transform nearly every industry, but most organizations struggle to adopt and implement ML at scale. Recent Gartner research shows that only 53% of ML projects make it from prototype to production. The most common barriers we see today are business and culture related. For instance, organizations often struggle to identify the right use cases to start their ML journey; this is often exacerbated by a shortage of skilled talent to execute on an organization’s ML ambitions. In fact, as an additional Gartner study shows, “skills of staff” is the number one challenge or barrier to the adoption of artificial intelligence (AI) and ML. Business leaders play a critical role in addressing these challenges by driving a culture of continuous learning and innovation; however, many lack the resources to develop their own knowledge of ML and its use cases.

With the new Machine Learning Essentials for Business and Technical Decision Makers course, we’re making a portion of the AWS Machine Learning Embark curriculum available globally as free, self-paced, digital-training courses.

The AWS Machine Learning Embark program has already helped many organizations harness the power of ML at scale. For example, the Met Office (the UK’s national weather service) is a great example of how organizations can accelerate their team’s ML knowledge using the program. As a research- and science-based organization, the Met Office develops custom weather-forecasting and climate-projection models that rely on very large observational data sets that are constantly being updated. As one of its many data-driven challenges, the Met Office was looking to develop an approach using ML to investigate how the Earth’s biosphere could alter in response to climate change. The Met Office partnered with the Amazon ML Solutions Lab through the AWS Machine Learning Embark program to explore novel approaches to solving this. “We were excited to work with colleagues from the AWS ML Solutions Lab as part of the Embark program,” said Professor Albert Klein-Tank, head of the Met Office’s Hadley Centre for Climate Science and Services. “They provided technical skills and experience that enabled us to explore a complex categorization problem that offers improved insight into how Earth’s biosphere could be affected by climate change. Our climate models generate huge volumes of data, and the ability to extract added value from it is essential for the provision of advice to our government and commercial stakeholders. This demonstration of the application of machine learning techniques to research projects has supported the further development of these skills across the Met Office.”

In addition to giving access to ML Embark content through the Machine Learning Essentials for Business and Technical Decision Makers, we’re also expanding the availability of the full ML Embark program through key strategic AWS Partners, including Slalom Consulting. We’re excited to jointly offer this exclusive program to all enterprise customers looking to jump-start their ML journey.

We invite you to expand your ML knowledge and help lead your organization to innovate with ML. Learn more and get started today.


About the Author

Michelle K. Lee is vice president of the Machine Learning Solutions Lab at AWS.

Read More