The phrase launches a feature built to help customers navigate an increasingly complex and diverse world of content.Read More
Prevent account takeover at login with the new Account Takeover Insights model in Amazon Fraud Detector
Digital is the new normal, and there’s no going back. Every year, consumers visit, on average, 191 websites or services requiring a user name and password, and the digital footprint is expected to grow exponentially. So much exposure naturally brings added risks like account takeover (ATO).
Each year, bad actors compromise billions of accounts through stolen credentials, phishing, social engineering, and multiple forms of ATO. To put it into perspective: account takeover fraud increased by 90% to an estimated $11.4 billion in 2021 compared with 2020. Beyond the financial impact, ATOs damage the customer experience, threaten brand loyalty and reputation, and strain fraud teams as they manage chargebacks and customer claims.
Many companies, even those with sophisticated fraud teams, use rules-based solutions to detect compromised accounts because they’re simple to create. To bolster their defenses and reduce friction for legitimate users, businesses are increasingly investing in AI and machine learning (ML) to detect account takeovers.
AWS can help you improve your fraud mitigation with solutions like Amazon Fraud Detector. This fully managed AI service allows you to identify potentially fraudulent online activities by enabling you to train custom ML fraud detection models without ML expertise.
This post discusses how to create a real-time detector endpoint using the new Account Takeover Insights (ATI) model in Amazon Fraud Detector.
Overview of solution
Amazon Fraud Detector relies on specific models with tailored algorithms, enrichments, and feature transformations to detect fraudulent events across multiple use cases. The newly launched ATI model is a low-latency fraud detection ML model designed to detect potentially compromised accounts and ATO fraud. The ATI model detects up to four times more ATO fraud than traditional rules-based account takeover solutions while minimizing the level of friction for legitimate users.
The ATI model is trained using a dataset containing your business’s historical login events. Event labels are optional for model training because the ATI model uses an innovative approach to unsupervised learning. The model differentiates events generated by the actual account owner (legit events) from those generated by bad actors (anomalous events).
Amazon Fraud Detector derives the user’s past behavior by continuously aggregating the data provided. Examples of user behavior include the number of times the user signed in from a specific IP address. With these additional enrichments and aggregates, Amazon Fraud Detector can generate strong model performance from a small set of inputs from your login events.
For a real-time prediction, you call the GetEventPrediction
API after a user presents valid login credentials to quantify the risk of ATO. In response, you receive a model score between 0–1000, where 0 shows low fraud risk and 1000 shows high fraud risk, and an outcome based on a set of business rules you define. You can then take the appropriate action on your end: approve the login, deny the login, or challenge the user by enforcing an additional identity verification.
You can also use the ATI model to asynchronously evaluate account logins and take action based on the outcome, such as adding the account to an investigation queue so a human reviewer can determine if further action should be taken.
The following steps outline the process of training an ATI model and publishing a detector endpoint to generate fraud predictions:
- Prepare and validate the data.
- Define the entity, event and event variables, and event label (optional).
- Upload event data.
- Initiate model training.
- Evaluate the model.
- Create a detector endpoint and define business rules.
- Get real-time predictions.
Prerequisites
Before getting started, complete the following prerequisite steps:
- Have an AWS account.
- Create an Amazon Simple Storage Service (Amazon S3) bucket and upload the training dataset to the bucket. You can find the synthetic training dataset in our GitHub repo.
- Copy the dataset location under S3 URI. The S3 URI is the Amazon S3 location of your dataset. You use it later to import the model variables.
Prepare and validate the data
Amazon Fraud Detector requires that you provide your user account login data in a CSV file encoded in the UTF-8 format. For the ATI, you must provide certain event metadata and event variables in the header line of your CSV file.
The required event metadata is as follows:
- EVENT_ID – A unique identifier for the login event.
- ENTITY_TYPE – The entity that performs the login event, such as a merchant or a customer.
- ENTITY_ID – An identifier for the entity performing the login event.
- EVENT_TIMESTAMP – The timestamp when the login event occurred. The timestamp format must be in ISO 8601 standard in UTC.
- EVENT_LABEL (optional) – A label that classifies the event as fraudulent or legitimate. You can use any labels, such as fraud, legit, 1, or 0.
Event metadata must be in uppercase letters. Labels aren’t required for login events. However, we recommend including EVENT_LABEL
metadata and providing labels for your login events if available. If you provide labels, Amazon Fraud Detector uses them to automatically calculate an Account Takeover Discovery Rate and display it in the model performance metrics.
The ATI model has both required and optional variables. Event variable names must be in lowercase letters.
The following table summarizes the mandatory variables.
Category | Variable type | Description |
---|---|---|
IP address | IP_ADDRESS |
The IP address used in the login event |
Browser and device | USERAGENT |
The browser, device, and OS used in the login event |
Valid credentials | VALIDCRED |
Indicates if the credentials that were used for login are valid |
The following table summarizes the optional variables.
Category | Type | Description |
---|---|---|
Browser and device | FINGERPRINT |
The unique identifier for a browser or device fingerprint |
Session ID | SESSION_ID |
The identifier for an authentication session |
Label | EVENT_LABEL |
A label that classifies the event as fraudulent or legitimate (such as fraud , legit , 1 , or 0 ) |
Timestamp | LABEL_TIMESTAMP |
The timestamp when the label was last updated; this is required if EVENT_LABEL is provided |
You can provide additional variables. However, Amazon Fraud Detector won’t include these variables for training an ATI model.
Dataset preparation
As you start to prepare your login data, you must meet the following requirements:
- Provide at least 1,500 entities (individual user accounts), each with at least two associated login events
- Your dataset must cover at least 30 days of login events
The following configurations are optional:
- Your dataset can include examples of unsuccessful login events
- You can optionally label these unsuccessful logins as
fraudulent
orlegitimate
- You can prepare historical data with login events spanning more than 6 months and include 100,000 entities
We provide a sample dataset for testing purposes that you can use to get started.
Data validation
Before creating your ATI model, Amazon Fraud Detector checks if the metadata and variables you included in your dataset for training the model meet the size and format requirements. For more information, see Dataset validation. If the dataset doesn’t pass validation, a model isn’t created. For details on common dataset errors, see Common event dataset errors.
Define the entity, event type, and event variables
In this section, we walk through the steps to create an entity, event type, and event variables. Optionally, you can also define event labels.
Define the entity
The entity defines who is performing the event. To create an entity, complete the following steps:
- On the Amazon Fraud Detector console, in the navigation pane, choose Entities.
- Choose Create.
- Enter an entity name and optional description.
- Choose Create entity.
Define the event and event variables
An event is a business activity evaluated for fraud risk; this event is performed by the entity we just created. The event type defines the structure for an event sent to Amazon Fraud Detector, including variables of the event, the entity performing the event, and, if available, the labels that classify the event.
To create an event, complete the following steps:
- On the Amazon Fraud Detector console, in the navigation pane, choose Events.
- Choose Create.
- For Name, enter a name for your event type.
- For Entity, choose the entity created in the previous step.
Define the event variables
For event variables, complete the following steps:
- To define this event’s variables, choose Select variables from a training dataset.
- For the AWS Identity and Access Management role, choose Create IAM role.
- In the Create IAM role section, enter the specific bucket name where you uploaded your training data.
The name of the S3 bucket must be the name where you uploaded your dataset. Otherwise, you get an access denied exception error. - Choose Create role.
- For Data location, enter the path to your training data, the path is the S3 URI you copied during the prerequisite steps, and choose Upload.
Amazon Fraud Detector extracts the headers from your training dataset and creates a variable for each header. Make sure to assign the variable to the correct variable type. As part of the model training process, Amazon Fraud Detector uses the variable type associated with the variable to perform variable enrichment and feature engineering. For more details about variable types, see Variable types.
Define event labels (optional)
Labels are used to categorize individual events as either fraud or legitimate. Event labels are optional for model training because the ATI model uses an innovative approach to unsupervised learning. The model differentiates events generated by the actual account owner (legit events) from those generated by abusive actors (anomalous events). We recommend you include EVENT_LABEL
metadata and provide labels for your login events if available. If you provide labels, Amazon Fraud Detector uses them to automatically calculate an Account Takeover Discovery Rate and display it in the model performance metrics.
To create an event, complete the following steps:
- Define two labels (for this post, 1 and 0).
- Choose Create event type.
Upload event data
In this session, we walk through the steps to upload events data to the service for model training.
ATI models are trained on a dataset stored internally in Amazon Fraud Detector. By storing event data in Amazon Fraud Detector, you can train models that use auto-computed variables to improve performance, simplify model retraining, and update fraud labels to close the machine learning feedback loop. See Stored events for more information on storing your event dataset with Amazon Fraud Detector.
After you define your event, navigate to the Stored events tab. On the Stored events tab, you can see information about your dataset, such as the number of events stored and the total size of the dataset in MB. Because you just created this event type, there are no stored events yet. On this page, you can turn event ingestion on or off. When event ingestion is on, you can upload historical event data to Amazon Fraud Detector and automatically store event data from predictions in real time.
The easiest way to store historical data is by uploading a CSV file and importing the events. Alternatively, you can stream the data into Amazon Fraud Detector using the SendEvent
API (see our GitHub repository for sample notebooks). To import the event from a CSV file, complete the following steps:
- Under Import events data, choose New import.
You likely need to create a new IAM role. The import events feature requires both read and write access to Amazon S3.
- Create a new IAM role and provide the S3 buckets for input and output files.
The IAM role you create grants Amazon Fraud Detector access to these buckets to read input files and store output files. If you don’t plan to store output files in a separate bucket, enter the same bucket name for both. - Choose Create role.
- Enter the location of the CSV file that contains your event data. This should be the S3 URI you copied earlier.
- Chose Start to start importing the events.
The import time varies based on the number of events you’re importing. For a dataset with 20,000 events, the process takes around 12 minutes, and after you refresh the page, the status changes to Completed
. If the status changes to Error
, choose the job name to show why the import failed.
Initiate model training
After successfully importing the events, you have all the pieces to initiate model training. To train a model, complete the following steps:
- On the Amazon Fraud Detector console, in the navigation pane, choose Models.
- Choose Add model and select Create model.
- For Model name, enter the desired name for your model
- For Model type, select Takeover Account Insights.
- For Event type, choose the event type you created earlier.
- Under Historical event data, you can specify the date range of events to train the model if needed.
- Choose Next.
- For this post, you configure training by identifying the variables used as inputs to the model.
- After evaluating the variables, choose Next.
It’s a best practice to include all the available variables, even if you’re unsure about their value to the model. After the model is trained, Amazon Fraud Detector provides a ranked list of each variable’s impact on the model performance, so you can know whether to include that variable in future model training. If labels are provided, Amazon Fraud Detector uses them to evaluate and display model performance in terms of the model’s discovery rate.
If labels aren’t provided, Amazon Fraud Detector uses negative sampling to provide examples or analogous login attempts that help the model distinguish between legitimate and fraudulent activities. This produces precise risk scores that improve the model’s ability to capture incorrectly flagged legitimate activities.
After reviewing the model configured in the first two steps, choose Create and train the model.
You can see the model in training status in the console page. Creating and training the model takes approximately 45 minutes to complete. When the model has stopped training, you can check model performance by choosing the model version.
Evaluate model performance and deploy the model
In this session, we walk through the steps to review and evaluate the model performance.
Amazon Fraud Detector validates model performance using 15% of your data that wasn’t used to train the model and provides performance metrics. You need to consider these metrics and your business objectives to define a threshold that aligns with your business model. For further details on the metrics and how to determine thresholds, see Model performance metrics.
ATI is an anomaly detection model rather than a classification model; therefore, the evaluation metrics differ from classification models. When your ATI model has finished training, you can see the Anomaly Separation Index (ASI), a holistic measure of the model’s ability to identify high-risk anomalous logins. An ASI of 75% or more is considered good, 90% or more is considered high, and below 75% is considered poor.
To assist in choosing the right balance, Amazon Fraud Detector provides the following metrics to evaluate ATI model performance:
- Anomaly Separation Index (ASI) – Summarizes the overall ability of the model to separate anomalous activities from the expected behavior of users. A model with no separability power will have the lowest possible ASI score of 0.5. In contrast, the model with a high separability power will have the highest possible ASI score of 1.0.
- Challenge Rate (CR) – The score threshold indicates the percentage of login events the model would recommend challenging in the form of a one-time password, multi-factor authentication, identify verification, investigation, and so on.
- Anomaly Discovery Rate (ADR) – Quantifies the percentage of anomalies the model can detect at the selected score threshold. A lower score threshold increases the percentage of anomalies captured by the model. Still, it would also require challenging a more significant percentage of login events, leading to higher customer friction.
- ATO Discovery Rate (ATODR) – Quantifies the percentage of account compromise events that the model can detect at the selected score threshold. This metric is only available if 50 or more entities with at least one labeled ATO event are present in the ingested dataset.
In the following example, we have an ASI of 0.96 (high), which indicates a high ability to separate anomalous activities from the normal behavior of users. By writing a rule using a model score threshold of 500, you challenge or create friction on 6% of all login activities catching 96% of anomalous activities.
Another important metric is the model variable importance. Variable importance gives you an understanding of how the different variables relate to the model performance. You can have two types of variables: raw and aggregate variables. Raw variables are the ones that were defined based on the dataset, whereas aggregate variables are a combination of multiple variables that are enriched and have an aggregated importance value.
For more information about variable importance, see Model variable importance.
A variable (raw or aggregate) with a much higher number relative to the rest could indicate that the model might be overfitting. In contrast, variables with relatively lowest numbers could just be noise.
After reviewing the model performance and deciding what model score thresholds align with your business model, you can deploy the model version. For that, on the Actions menu, choose Deploy model version. With the model deployed, we create a detector endpoint and perform real-time prediction.
Create a detector endpoint and define business rules
Amazon Fraud Detector uses detector endpoints to generate fraud prediction. A detector contains detection logic, such as trained models and business rules, for a specific event you want to evaluate for fraud. Detection logic uses rules to tell Amazon Fraud Detector how to interpret the data associated with the model.
To create a detector, complete the following steps:
- On the Amazon Fraud Detector console, in the navigation pane, choose Detectors.
- Choose Create detector.
- For Detector name, enter a name.
- Optionally, describe your detector.
- For Event type, choose the same event type as the model created earlier.
- Choose Next.
- On the Add model (optional) page, choose Add model.
- To add a model, choose the model you trained and published during the model training steps and choose the active version.
- Choose Add model.
As part of the next step, you create the business rules that define an outcome. A rule is a condition that tells Amazon Fraud Detector how to interpret variable values during a fraud prediction. A rule consists of one or more variables, a logic expression, and one or more outcomes. An outcome is the result of a fraud prediction and is returned if the rule matches during an evaluation.
- Define
decline_rule
as$<your_model_name_insightscore >= 950
with outcomedeny_login
. - Define
friction_rule
as$ your_model_name _insightscore >= 855
and$ your_model_name_insightscore >= 950
with outcomechallenge_login
. - Define
approve_rule
as$account_takeover_model_insightscore < 855
with outcomeapprove_login
.
Outcomes are strings returned in the GetEventPrediction
API response. You can use outcomes to trigger events by calling applications and downstream systems or to simply identify who is likely to be fraud or legitimate.
- On the Add Rules page, choose Next after you finish adding all your rules.
- In the Configure rule execution section, choose the mode for your rules engine.
The Amazon Fraud Detector rules engine has two modes: first matched or all matched. First matched mode is for sequential rule runs, returning the outcome for the first condition met. The other mode is all matched, which evaluates all rules and returns outcomes from all the matching rules. In this example, we use the first matched mode for our detector.
After this process, you’re ready to create your detector and run some tests.
- To run a test, go to your newly created detector and choose the detector version you want to use.
- Provide the variable values as requested and choose Run test.
As a result of the test, you receive the risk score and the outcome based on your business rules.
You can also search past predictions by going to the left panel and choosing Search past predictions. The prediction is based on each variable’s contribution to the overall likelihood of a fraudulent event. The following screenshot is an example of a past prediction showing the input variables and how they influenced the fraud prediction score.
Get real-time predictions
To get real-time predictions and integrate Amazon Fraud Detector into your workflow, we need to publish the detector endpoint. Complete the following steps:
- Go to the newly created detector and choose the detector version, which will be version 1.
- On the Actions menu, choose Publish.
You can perform real-time predictions with the published detector by calling the GetEventPrediction
API. The following is a sample Python code for calling the GetEventPrediction
API:
Conclusion
Amazon Fraud Detector relies on specific models with tailored algorithms, enrichments, and feature transformations to detect fraudulent events across multiple use cases. In this post, you learned how to ingest data, train and deploy a model, write business rules, and publish a detector to generate real-time fraud prediction on potentially compromised accounts.
Visit Amazon Fraud Detector to learn more about Amazon Fraud Detector or our GitHub repo for code samples, notebook, and synthetic datasets.
About the authors
Marcel Pividal is a Sr. AI Services Solutions Architect in the World-Wide Specialist Organization. Marcel has more than 20 years of experience solving business problems through technology for Fintechs, Payment Providers, Pharma, and government agencies. His current areas of focus are Risk Management, Fraud Prevention, and Identity Verification.
Mike Ames is a data scientist turned identity verification solution specialist, he has extensive experience developing machine learning and AI solutions to protect organizations from fraud, waste and abuse. In his spare time, you can find him hiking, mountain biking or playing freebee with his dog Max.
Metrics for evaluating content moderation in Amazon Rekognition and other content moderation services
Content moderation is the process of screening and monitoring user-generated content online. To provide a safe environment for both users and brands, platforms must moderate content to ensure that it falls within preestablished guidelines of acceptable behavior that are specific to the platform and its audience.
When a platform moderates content, acceptable user-generated content (UGC) can be created and shared with other users. Inappropriate, toxic, or banned behaviors can be prevented, blocked in real time, or removed after the fact, depending on the content moderation tools and procedures the platform has in place.
You can use Amazon Rekognition Content Moderation to detect content that is inappropriate, unwanted, or offensive, to create a safer user experience, provide brand safety assurances to advertisers, and comply with local and global regulations.
In this post, we discuss the key elements needed to evaluate the performance aspect of a content moderation service in terms of various accuracy metrics, and a provide an example using Amazon Rekognition Content Moderation API’s.
What to evaluate
When evaluating a content moderation service, we recommend the following steps.
Before you can evaluate the performance of the API on your use cases, you need to prepare a representative test dataset. The following are some high-level guidelines:
- Collection – Take a large enough random sample (images or videos) of the data you eventually want to run through Amazon Rekognition. For example, if you plan to moderate user-uploaded images, you can take a week’s worth of user images for the test. We recommend choosing a set that has enough images without getting too large to process (such as 1,000–10,000 images), although larger sets are better.
- Definition – Use your application’s content guidelines to decide which types of unsafe content you’re interested in detecting from the Amazon Rekognition moderation concepts taxonomy. For example, you may be interested in detecting all types of explicit nudity and graphic violence or gore.
- Annotation – Now you need a human-generated ground truth for your test set using the chosen labels, so that you can compare machine predictions against them. This means that each image is annotated for the presence or absence of your chosen concepts. To annotate your image data, you can use Amazon SageMaker Ground Truth (GT)to manage image annotation. You can refer to GT for image labeling, consolidating annotations and processing annotation output.
Get predictions on your test dataset with Amazon Rekognition
Next, you want to get predictions on your test dataset.
The first step is to decide on a minimum confidence score (a threshold value, such as 50%) at which you want to measure results. Our default threshold is set to 50, which offers a good balance between retrieving large amounts of unsafe content without incurring too many false predictions on safe content. However, your platform may have different business needs, so you should customize this confidence threshold as needed. You can use the MinConfidence
parameter in your API requests to balance detection of content (recall) vs the accuracy of detection (precision). If you reduce MinConfidence
, you are likely to detect most of the inappropriate content, but are also likely to pick up content that is not actually inappropriate. If you increase MinConfidence
you are likely to ensure that all your detected content is truly inappropriate but some content may not be tagged. We suggest experimenting with a few MinConfidence
values on your dataset and quantitatively select the best value for your data domain.
Next, run each sample (image or video) of your test set through the Amazon Rekognition moderation API (DetectModerationLabels).
Measure model accuracy on images
You can assess the accuracy of a model by comparing human-generated ground truth annotations with the model predictions. You repeat this comparison for every image independently and then aggregate over the whole test set:
-
Per-image results – A model prediction is defined as the pair
{label_name, confidence_score}
(where the confidence score >= the threshold you selected earlier). For each image, a prediction is considered correct when it matches the ground truth (GT). A prediction is one of the following options:- True Positive (TP): both prediction and GT are “unsafe”
- True Negative (TN): both prediction and GT are “safe”
- False Positive (FP): the prediction says “unsafe”, but the GT is “safe”
- False Negative (FN): the prediction is “safe”, but the GT is “unsafe”
-
Aggregated results over all images – Next, you can aggregate these predictions into dataset-level results:
- False positive rate (FPR) – This is the percentage of images in the test set that are wrongly flagged by the model as containing unsafe content: (FP): FP / (TN+FP).
- False negative rate (FNR) – This is the percentage of unsafe images in the test set that are missed by the model: (FN): FN / (FN+TP).
- True positive rate (TPR) – Also called recall, this computes the percentage of unsafe content (ground truth) that is correctly discovered or predicted by the model: TP / (TP + FN) = 1 – FNR.
- Precision – This computes the percentage of correct predictions (unsafe content) with regards to the total number of predictions made: TP / (TP+FP).
Let’s explore an example. Let’s assume that your test set contains 10,000 images: 9,950 safe and 50 unsafe. The model correctly predicts 9,800 out of 9,950 images as safe and 45 out of 50 as unsafe:
- TP = 45
- TN = 9800
- FP = 9950 – 9800 = 150
- FN = 50 – 45 = 5
- FPR = 150 / (9950 + 150) = 0.015 = 1.5%
- FNR = 5 / (5 + 45) = 0.1 = 10%
- TPR/Recall = 45 / (45 + 5) = 0.9 = 90%
- Precision = 45 / (45 + 150) = 0.23 = 23%
Measure model accuracy on videos
If you want to evaluate the performance on videos, a few additional steps are necessary:
- Sample a subset of frames from each video. We suggest sampling uniformly with a rate of 0.3–1 frames per second (fps). For example, if a video is encoded at 24 fps and you want to sample one frame every 3 seconds (0.3 fps), you need to select one every 72 frames.
- Run these sampled frames through Amazon Rekognition content moderation. You can either use our video API, which already samples frames for you (at a rate of 3 fps), or use the image API, in which case you want to sample more sparsely. We recommend the latter option, given the redundancy of information in videos (consecutive frames are very similar).
- Compute the per-frame results as explained in the previous section (per-image results).
- Aggregate results over the whole test set. Here you have two options, depending on the type of outcome that matters for your business:
- Frame-level results – This considers all the sampled frames as independent images and aggregates the results exactly as explained earlier for images (FPR, FNR, recall, precision). If some videos are considerably longer than others, they will contribute more frames to the total count, making the comparison unbalanced. In that case, we suggest changing the initial sampling strategy to a fixed number of frames per video. For example, you could uniformly sample 50–100 frames per video (assuming videos are at least 2–3 minutes long).
- Video-level results – For some use cases, it doesn’t matter whether the model is capable of correctly predicting 50% or 99% of the frames in a video. Even a single wrong unsafe prediction on a single frame could trigger a downstream human evaluation and only videos with 100% correct predictions are truly considered correctly. If this is your use case, we suggest you compute FPR/FNR/TPR over the frames of each video and consider the video as follows:
Video ID | Accuracy | Per-Video Categorization |
Results Aggregated Over All the Frames of Video ID |
Total FP = 0 Total FN = 0 |
Perfect predictions |
. | Total FP > 0 | False Positive (FP) |
. | Total FN > 0 | False Negative (FN) |
After you have computed these for each video independently, you can then compute all the metrics we introduced earlier:
- The percentage of videos that are wrongly flagged (FP) or missed (FN)
- Precision and recall
Measure performance against goals
Finally, you need to interpret these results in the context of your goals and capabilities.
First, consider your business needs in regards to the following:
- Data – Learn about your data (daily volume, type of data, and so on) and the distribution of your unsafe vs. safe content. For example, is it balanced (50/50), skewed (10/90) or very skewed (1/99, meaning that only 1% is unsafe)? Understanding such distribution can help you define your actual metric goals. For example, the number of safe content is often an order of magnitude larger than unsafe content (very skewed), making this almost an anomaly detection problem. Within this scenario, the number of false positives may outnumber the number of true positives, and you can use your data information (distribution skewness, volume of data, and so on) to decide the FPR you can work with.
- Metric goals – What are the most critical aspects of your business? Lowering the FPR often comes at the cost of a higher FNR (and vice versa) and it’s important to find the right balance that works for you. If you can’t miss any unsafe content, you likely want close to 0% FNR (100% recall). However, this will incur the largest number of false positives, and you need to decide the target (maximum) FPR you can work with, based on your post-prediction pipeline. You may want to allow some level of false negatives to be able to find a better balance and lower your FPR: for example, accepting a 5% FNR instead of 0% could reduce the FPR from 2% to 0.5%, considerably reducing the number of flagged contents.
Next, ask yourself what mechanisms you will use to parse the flagged images. Even though the API’s may not provide 0% FPR and FNR, it can still bring huge savings and scale (for example, by only flagging 3% of your images, you have already filtered out 97% of your content). When you pair the API with some downstream mechanisms, like a human workforce that reviews the flagged content, you can easily reach your goals (for example, 0.5% flagged content). Note how this pairing is considerably cheaper than having to do a human review on 100% of your content.
When you have decided on your downstream mechanisms, we suggest you evaluate the throughput that you can support. For example, if you have a workforce that can only verify 2% of your daily content, then your target goal from our content moderation API is a flag rate (FPR+TPR) of 2%.
Finally, if obtaining ground truth annotations is too hard or too expensive (for example, your volume of data is too large), we suggest annotating the small number of images flagged by the API. Although this doesn’t allow for FNR evaluations (because your data doesn’t contain any false negatives), you can still measure TPR and FPR.
In the following section, we provide a solution for image moderation evaluation. You can take a similar approach for video moderation evaluation.
Solution overview
The following diagram illustrates the various AWS services you can use to evaluate the performance of Amazon Rekognition content moderation on your test dataset.
The content moderation evaluation has the following steps:
- Upload your evaluation dataset into Amazon Simple Storage Service (Amazon S3).
- Use Ground Truth to assign ground truth moderation labels.
- Generate the predicted moderation labels using the Amazon Rekognition pre-trained moderation API using a few threshold values. (For example, 70%, 75% and 80%).
- Assess the performance for each threshold by computing true positives, true negatives, false positives, and false negatives. Determine the optimum threshold value for your use case.
- Optionally, you can tailor the size of the workforce based on true and false positives, and use Amazon Augmented AI (Amazon A2I) to automatically send all flagged content to your designated workforce for a manual review.
The following sections provide the code snippets for steps 1, 2, and 3. For complete end-to-end source code, refer to the provided Jupyter notebook.
Prerequisites
Before you get started, complete the following steps to set up the Jupyter notebook:
- Create a notebook instance in Amazon SageMaker.
- When the notebook is active, choose Open Jupyter.
- On the Jupyter dashboard, choose New, and choose Terminal.
- In the terminal, enter the following code:
- Open the notebook for this post:
content-moderation-evaluation/Evaluating-Amazon-Rekognition-Content-Moderation-Service.ipynb
. - Upload your evaluation dataset to Amazon Simple Storage Service (Amazon S3).
We will now go through steps 2 through 4 in the Jupyter notebook.
Use Ground Truth to assign moderation labels
To assign labels in Ground Truth, complete the following steps:
- Create a manifest input file for your Ground Truth job and upload it to Amazon S3.
- Create the labeling configuration, which contains all moderation labels that are needed for the Ground Truth labeling job.To check the limit for the number of label categories you can use, refer to Label Category Quotas. In the following code snippet, we use five labels (refer to the hierarchical taxonomy used in Amazon Rekognition for more details) plus one label (
Safe_Content
) that marks content as safe: - Create a custom worker task template to provide the Ground Truth workforce with labeling instructions and upload it to Amazon S3.
The Ground Truth label job is defined as an image classification (multi-label) task. Refer to the source code for instructions to customize the instruction template. - Decide which workforce you want to use to complete the Ground Truth job. You have two options (refer to the source code for details):
- Use a private workforce in your own organization to label the evaluation dataset.
- Use a public workforce to label the evaluation dataset.
- Create and submit a Ground Truth labeling job. You can also adjust the following code to configure the labeling job parameters to meet your specific business requirements. Refer to the source code for complete instructions on creating and configuring the Ground Truth job.
After the job is submitted, you should see output similar to the following:
Wait for labeling job on the evaluation dataset to complete successfully, then continue to the next step.
Use the Amazon Rekognition moderation API to generate predicted moderation labels.
The following code snippet shows how to use the Amazon Rekognition moderation API to generate moderation labels:
Assess the performance
You first retrieved ground truth moderation labels from the Ground Truth labeling job results for the evaluation dataset, then you ran the Amazon Rekognition moderation API to get predicted moderation labels for the same dataset. Because this is a binary classification problem (safe vs. unsafe content), we calculate the following metrics (assuming unsafe content is positive):
We also calculate the corresponding evaluation metrics:
The following code snippet shows how to calculate those metrics:
Conclusion
This post discusses the key elements needed to evaluate the performance aspect of your content moderation service in terms of various accuracy metrics. However, accuracy is only one of the many dimensions that you need to evaluate when choosing a particular content moderation service. It’s critical that you include other parameters, such as the service’s total feature set, ease of use, existing integrations, privacy and security, customization options, scalability implications, customer service, and pricing. To learn more about content moderation in Amazon Rekognition, visit Amazon Rekognition Content Moderation.
About the authors
Amit Gupta is a Senior AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.
Davide Modolo is an Applied Science Manager at AWS AI Labs. He has a PhD in computer vision from the University of Edinburgh (UK) and is passionate about developing new scientific solutions for real-world customer problems. Outside of work, he enjoys traveling and playing any kind of sport, especially soccer.
Jian Wu is a Senior Enterprise Solutions Architect at AWS. He’s been with AWS for 6 years working with customers of all sizes. He is passionate about helping customers to innovate faster via the adoption of the Cloud and AI/ML. Prior to joining AWS, Jian spent 10+ years focusing on software development, system implementation and infrastructure management. Aside from work, he enjoys staying active and spending time with his family.
Face-off Probability, part of NHL Edge IQ: Predicting face-off winners in real time during televised games
Face-off Probability is the National Hockey League’s (NHL) first advanced statistic using machine learning (ML) and artificial intelligence. It uses real-time Player and Puck Tracking (PPT) data to show viewers which player is likely to win a face-off before the puck is dropped, and provides broadcasters and viewers the opportunity to dive deeper into the importance of face-off matches and the differences in player abilities. Based on 10 years of historical data, hundreds of thousands of face-offs were used to engineer over 70 features fed into the model to provide real-time probabilities. Broadcasters can now discuss how a key face-off win by a player led to a goal or how the chances of winning a face-off decrease as a team’s face-off specialist is waived out of a draw. Fans can see visual, real-time predictions that show them the importance of a key part of the game.
In this post, we focus on how the ML model for Face-off Probability was developed and the services used to put the model into production. We also share the key technical challenges that were solved during construction of the Face-off Probability model.
How it works
Imagine the following scenario: It’s a tie game between two NHL teams that will determine who moves forward. We’re in the third period with 1:22 seconds left to play. Two players from opposite teams line up to take the draw in the closest face-off closer to one of the nets. The linesman notices a defensive player encroaching into the face-off circle and waives their player out of the face-off due to the violation. A less experienced defensive player comes in to take the draw as his replacement. The attacking team wins the face-off, gets possession of the puck, and immediately scores to take the lead. The score holds up for the remaining minute of the game and decides who moves forward. What player was favored to win the face-off before the initial duo was changed? How much did the defensive’s team probability of winning the face-off decrease by the violation that forced a different player to take the draw? Face-off Probability, the newest NHL Edge IQ statistic powered by AWS, can now answer these questions.
When there is a stoppage in play, Face-off Probability generates predictions for who will win the upcoming face-off based on the players on the ice, the location of the face-off, and the current game situation. The predictions are generated throughout the stoppage until the game clock starts running again. Predictions occur at sub-second latency and are triggered any time there is a change in the players involved in the face-off.
Overcoming key obstacles for face-off probability
Predicting face-off probability in real-time broadcasts can be broken down into two specific sub-problems:
- Modeling the face-off event as an ML problem, understanding the requirements and limitations, preparing the data, engineering the data signals, exploring algorithms, and ensuring reliability of results
- Detecting a face-off event during the game from a stream of PPT events, collecting parameters needed for prediction, calling the model, and submitting results to broadcasters
Predicting the probability of a player winning a face-off in real time on a televised broadcast has several technical challenges that had to be overcome. These included determining the features required and modeling methods to predict an event that has a large amount of uncertainty, and determining how to use streaming PPT sensor data to identify where a face-off is occurring, the players involved, and the probability of each player winning the face-off, all within hundreds of milliseconds.
Building an ML model for difficult-to-predict events
Predicting events such as face-off winning probabilities during a live game is a complex task that requires a significant amount of quality historic data and data streaming capabilities. To identify and understand the important signals in such a rich data environment, the development of ML models requires extensive subject matter expertise. The Amazon Machine Learning Solutions Lab partnered with NHL hockey and data experts to work backward from their target goal of enhancing their fan experience. By continuously listening to NHL’s expertise and testing hypotheses, AWS’s scientists engineered over 100 features that correlate to the face-off event. In particular, the team classified this feature set into one of three categories:
- Historical statistics on player performances such as the number of face-offs a player has taken and won in the last five seasons, the number of face-offs the player has taken and won in previous games, a player’s winning percentages over several time windows, and the head-to-head winning percentage for each player in the face-off
- Player characteristics such as height, weight, handedness, and years in the league
- In-game situational data that might affect a player’s performance, such as the score of the game, the elapsed time in the game to that point, where the face-off is located, the strength of each team, and which player has to put their stick down for the face-off first
AWS’s ML scientists considered the problem as a binary classification problem: either the home player wins the face-off or the away player wins the face-off. With data from more than 200,000 historical face-offs, they used a LightGBM model to predict which of the two players involved with a face-off event is likely to win.
Determining if a face-off is about to occur and which players are involved
When a whistle blows and the play is stopped, Face-off Probability begins to make predictions. However, Face-off Probability has to first determine where the face-off is occurring and which player from each team is involved in the face-off. The data stream indicates events as they occur but doesn’t provide information on when an event is likely to occur in the future. As such, the sensor data of the players on the ice is needed to determineif and where a face-off is about to happen.
The PPT system produces real-time locations and velocities for players on the ice at up to 60 events per second. These locations and velocities were used to determine where the face-off is happening on the ice and if it’s likely to happen soon. By knowing how close the players are to known face-off locations and how stationary the players were, Face-off Probability was able to determine that a face-off was likely to occur and the two players that would be involved in the face-off.
Determining the correct cut-off distance for proximity to a face-off location and the corresponding cut-off velocity for stationary players was accomplished using a decision tree model. With PPT data from the 2020-2021 season, we built a model to predict the likelihood that a face-off is occurring at a specified location given the average distance of each team to the location and the velocities of the players. The decision tree provided the cut-offs for each metric, which we included as rules-based logic in the streaming application.
With the correct face-off location determined, the player from each team taking the face-off was calculated by taking the player closest to the known location from each team. This provided the application with the flexibility to identify the correct players while also being able to adjust to a new player having to take a face-off if a current player is waived out due to an infraction. Making and updating the prediction for the correct player was a key focus for the real-time usability of the model in broadcasts, which we describe further in the next section.
Model development and training
To develop the model, we used more than 200,000 historical face-off data points, along with the custom engineered feature set designed by working with the subject matter experts. We looked at features like in-game situations, historical performance of the players taking the face-off, player-specific characteristics, and head-to-head performances of the players taking the face-off, both in the current season and for their careers. Collectively, this resulted in over 100 features created using a combination of available and derived techniques.
To assess different features and how they might influence the model, we conducted extensive feature analysis as part of the exploratory phase. We used a mix of univariate tests and multivariate tests. For multivariate tests, for interpretability, we used decision tree visualization techniques. To assess statistical significance, we used Chi Test and KS tests to test dependence or distribution differences.
We explored classification techniques and models with the expectation that the raw probabilities would be treated as the predictions. We explored nearest neighbors, decision trees, neural networks, and also collaborative filtering in terms of algorithms, while trying different sampling strategies (filtering, random, stratified, and time-based sampling) and evaluated performance on Area Under the Curve (AUC) and calibration distribution along with Brier score loss. At the end, we found that the LightGBM model worked best with well-calibrated accuracy metrics.
To evaluate the performance of the models, we used multiple techniques. We used a test set that the trained model was never exposed to. Additionally, the teams conducted extensive manual assessments of the results, looking at edge cases and trying to understand the nuances of how the model looked to determine why a certain player should have won or lost a face-off event.
With information collected from manual reviewers, we would adjust the features when required, or run iterations on the model to see if the performance of the model was as expected.
Deploying Face-off Probability for real-time use during national television broadcasts
One of the goals of the project was not just to predict the winner of the face-off, but to build a foundation for solving a number of similar problems in a real-time and cost-efficient way. That goal helped determine which components to use in the final architecture.
The first important component is Amazon Kinesis Data Streams, a serverless streaming data service that acts as a decoupler between the specific implementation of the PPT data provider and consuming applications, thereby protecting the latter from the disrupting changes in the former. It has also enhanced the fan-out feature, which provides the ability to connect up to 20 parallel consumers and maintain a low latency of 70 milliseconds and the same throughput of 2MB/s per shard between all of them simultaneously.
PPT events don’t come for all players at once, but arrive discretely for each player as well as other events in the game. Therefore, to implement the upcoming face-off detection algorithm, the application needs to maintain a state.
The second important component of the architecture is Amazon Kinesis Data Analytics for Apache Flink. Apache Flink is a distributed streaming, high-throughput, low-latency data flow engine that provides a convenient and easy way to use the Data Stream API, and it supports stateful processing functions, checkpointing, and parallel processing out of the box. This helps speed up development and provides access to low-level routines and components, which allows for a flexible design and implementation of applications.
Kinesis Data Analytics provides the underlying infrastructure for your Apache Flink applications. It eliminates a need to deploy and configure a Flink cluster on Amazon Elastic Compute Cloud (Amazon EC2) or Kubernetes, which reduces maintenance complexity and costs.
The third crucial component is Amazon SageMaker. Although we used SageMaker to build a model, we also needed to make a decision at the early stages of the project: should scoring be implemented inside the face-off detecting application itself and complicate the implementation, or should the face-off detecting application call SageMaker remotely and sacrifice some latency due to communication over the network? To make an informed decision, we performed a series of benchmarks to verify SageMaker latency and scalability, and validated that average latency was less than 100 milliseconds under the load, which was within our expectations.
With the main parts of high-level architecture decided, we started to work on the internal design of the face-off detecting application. A computation model of the application is depicted in the following diagram.
The compute model of the face-off detecting application can be modeled as a simple finite-state machine, where each incoming message transitions the system from one state to another while performing some computation along with that transition. The application maintains several data structures to keep track of the following:
- Changes in the game state – The current period number, status and value of the game clock, and score
- Changes in the player’s state – If the player is currently on the ice or on the bench, the current coordinates on the field, and the current velocity
- Changes in the player’s personal face-off stats – The success rate of one player vs. another, and so on
The algorithm checks each location update event of a player to decide whether a face-off prediction should be made and whether the result should be submitted to broadcasters. Taking into account that each player location is updated roughly every 80 milliseconds and players move much slower during game pauses than during the game, we can conclude that the situation between two updates doesn’t drastically change. If the application called SageMaker for predictions and sent predictions to broadcasters every time a new location update event was received and all conditions are satisfied, SageMaker and the broadcasters would be overwhelmed with a number of duplicate requests.
To avoid all this unnecessary noise, the application keeps track of a combination of parameters for which predictions were already made, along with the result of the prediction, and caches them in memory to avoid expensive duplicate requests to SageMaker. Also, it keeps track of what predictions were already sent to broadcasters and makes sure that only new predictions are sent or the previously sent ones are sent again only if necessary. Testing showed that this approach reduces the amount of outgoing traffic by more than 100 times.
Another optimization technique that we used was grouping requests to SageMaker and performing them asynchronously in parallel. For example, if we have four new combinations of face-off parameters for which we need to get predictions from SageMaker, we know that each request will take less than 100 milliseconds. If we perform each request synchronously one by one, the total response time will be under 400 milliseconds. But if we group all four requests, submit them asynchronously, and wait for the result for the entire group before moving forward, we effectively parallelize requests and the total response time will be under 100 milliseconds, just like for only one request.
Summary
NHL Edge IQ, powered by AWS, is bringing fans closer to the action with advanced analytics and new ML stats. In this post, we showed insights into the building and deployment of the new Face-off Probability model, the first on-air ML statistic for the NHL. Be sure to keep an eye out for the probabilities generated by Face-off Probability in upcoming NHL games.
To find full examples of building custom training jobs for SageMaker, visit Bring your own training-completed model with SageMaker by building a custom container. For examples of using Amazon Kinesis for streaming, refer to Learning Amazon Kinesis Development.
To learn more about the partnership between AWS and the NHL, visit NHL Innovates with AWS Cloud Services. If you’d like to collaborate with experts to bring ML solutions to your organization, contact the Amazon ML Solutions Lab.
About the Authors
Ryan Gillespie is a Sr. Data Scientist with AWS Professional Services. He has a MSc from Northwestern University and a MBA from the University of Toronto. He has previous experience in the retail and mining industries.
Yash Shah is a Science Manager in the Amazon ML Solutions Lab. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.
Alexander Egorov is a Principal Data Architect, specializing in streaming technologies. He helps organizations to design and build platforms for processing and analyzing streaming data in real time.
Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.
Erick Martinez is a Sr. Media Application Architect with 25+ years of experience, with focus on Media and Entertainment. He is experienced in all aspects of systems development life-cycle ranging from discovery, requirements gathering, design, implementation, testing, deployment, and operation.
Johns Hopkins & Amazon announce fellows, faculty research awards
Inaugural recipients named as part of the JHU + Amazon Initiative for Interactive AI (AI2AI).Read More
Amazon releases dataset for complex, multilingual question answering
Dataset that requires question-answering models to look up multiple facts and perform comparisons bridges a significant gap in the field.Read More
Redact sensitive data from streaming data in near-real time using Amazon Comprehend and Amazon Kinesis Data Firehose
Near-real-time delivery of data and insights enable businesses to rapidly respond to their customers’ needs. Real-time data can come from a variety of sources, including social media, IoT devices, infrastructure monitoring, call center monitoring, and more. Due to the breadth and depth of data being ingested from multiple sources, businesses look for solutions to protect their customers’ privacy and keep sensitive data from being accessed from end systems. You previously had to rely on personally identifiable information (PII) rules engines that could flag false positives or miss data, or you had to build and maintain custom machine learning (ML) models to identify PII in your streaming data. You also needed to implement and maintain the infrastructure necessary to support these engines or models.
To help streamline this process and reduce costs, you can use Amazon Comprehend, a natural language processing (NLP) service that uses ML to find insights and relationships like people, places, sentiments, and topics in unstructured text. You can now use Amazon Comprehend ML capabilities to detect and redact PII in customer emails, support tickets, product reviews, social media, and more. No ML experience is required. For example, you can analyze support tickets and knowledge articles to detect PII entities and redact the text before you index the documents. After that, documents are free of PII entities and users can consume the data. Redacting PII entities helps you protect your customer’s privacy and comply with local laws and regulations.
In this post, you learn how to implement Amazon Comprehend into your streaming architectures to redact PII entities in near-real time using Amazon Kinesis Data Firehose with AWS Lambda.
This post is focused on redacting data from select fields that are ingested into a streaming architecture using Kinesis Data Firehose, where you want to create, store, and maintain additional derivative copies of the data for consumption by end-users or downstream applications. If you’re using Amazon Kinesis Data Streams or have additional use cases outside of PII redaction, refer to Translate, redact and analyze streaming data using SQL functions with Amazon Kinesis Data Analytics, Amazon Translate, and Amazon Comprehend, where we show how you can use Amazon Kinesis Data Analytics Studio powered by Apache Zeppelin and Apache Flink to interactively analyze, translate, and redact text fields in streaming data.
Solution overview
The following figure shows an example architecture for performing PII redaction of streaming data in real time, using Amazon Simple Storage Service (Amazon S3), Kinesis Data Firehose data transformation, Amazon Comprehend, and AWS Lambda. Additionally, we use the AWS SDK for Python (Boto3) for the Lambda functions. As indicated in the diagram, the S3 raw bucket contains non-redacted data, and the S3 redacted bucket contains redacted data after using the Amazon Comprehend DetectPiiEntities
API within a Lambda function.
Costs involved
In addition to Kinesis Data Firehose, Amazon S3, and Lambda costs, this solution will incur usage costs from Amazon Comprehend. The amount you pay is a factor of the total number of records that contain PII and the characters that are processed by the Lambda function. For more information, refer to Amazon Kinesis Data Firehose pricing, Amazon Comprehend Pricing, and AWS Lambda Pricing.
As an example, let’s assume you have 10,000 logs records, and the key value you want to redact PII from is 500 characters. Out of the 10,000 log records, 50 are identified as containing PII. The cost details are as follows:
Contains PII Cost:
- Size of each key value = 500 characters (1 unit = 100 characters)
- Number of units (100 characters) per record (minimum is 3 units) = 5
- Total units = 10,000 (records) x 5 (units per record) x 1 (Amazon Comprehend requests per record) = 50,000
- Price per unit = $0.000002
- Total cost for identifying log records with PII using ContainsPiiEntities API = $0.1 [50,000 units x $0.000002]
Redact PII Cost:
- Total units containing PII = 50 (records) x 5 (units per record) x 1 (Amazon Comprehend requests per record) = 250
- Price per unit = $0.0001
- Total cost for identifying location of PII using DetectPiiEntities API = [number of units] x [cost per unit] = 250 x $0.0001 = $0.025
Total Cost for identification and redaction:
- Total cost: $0.1 (validation if field contains PII) + $0.025 (redact fields that contain PII) = $0.125
Deploy the solution with AWS CloudFormation
For this post, we provide an AWS CloudFormation streaming data redaction template, which provides the full details of the implementation to enable repeatable deployments. Upon deployment, this template creates two S3 buckets: one to store the raw sample data ingested from the Amazon Kinesis Data Generator (KDG), and one to store the redacted data. Additionally, it creates a Kinesis Data Firehose delivery stream with DirectPUT
as input, and a Lambda function that calls the Amazon Comprehend ContainsPiiEntities and DetectPiiEntities API to identify and redact PII data. The Lambda function relies on user input in the environment variables to determine what key values need to be inspected for PII.
The Lambda function in this solution has limited payload sizes to 100 KB. If a payload is provided where the text is greater than 100 KB, the Lambda function will skip it.
To deploy the solution, complete the following steps:
- Launch the CloudFormation stack in US East (N. Virginia)
us-east-1
:
- Enter a stack name, and leave other parameters at their default
- Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- Choose Create stack.
Deploy resources manually
If you prefer to build the architecture manually instead of using AWS CloudFormation, complete the steps in this section.
Create the S3 buckets
Create your S3 buckets with the following steps:
- On the Amazon S3 console, choose Buckets in the navigation pane.
- Choose Create bucket.
- Create one bucket for your raw data and one for your redacted data.
- Note the names of the buckets you just created.
Create the Lambda function
To create and deploy the Lambda function, complete the following steps:
- On the Lambda console, choose Create function.
- Choose Author from scratch.
- For Function Name, enter
AmazonComprehendPII-Redact
. - For Runtime, choose Python 3.9.
- For Architecture, select x86_64.
- For Execution role, select Create a new role with Lambda permissions.
- After you create the function, enter the following code:
- Choose Deploy.
- In the navigation pane, choose Configuration.
- Navigate to Environment variables.
- Choose Edit.
- For Key, enter
keys
. - For Value, enter the key values you want to redact PII from, separated by a comma and space. For example, enter
Tweet1
,Tweet2
if you’re using the sample test data provided in the next section of this post. - Choose Save.
- Navigate to General configuration.
- Choose Edit.
- Change the value of Timeout to 1 minute.
- Choose Save.
- Navigate to Permissions.
- Choose the role name under Execution Role.
You’re redirected to the AWS Identity and Access Management (IAM) console. - For Add permissions, choose Attach policies.
- Enter
Comprehend
into the search bar and choose the policyComprehendFullAccess
. - Choose Attach policies.
Create the Firehose delivery stream
To create your Firehose delivery stream, complete the following steps:
- On the Kinesis Data Firehose console, choose Create delivery stream.
- For Source, select Direct PUT.
- For Destination, select Amazon S3.
- For Delivery stream name, enter
ComprehendRealTimeBlog
. - Under Transform source records with AWS Lambda, select Enabled.
- For AWS Lambda function, enter the ARN for the function you created, or browse to the function
AmazonComprehendPII-Redact
. - For Buffer Size, set the value to 1 MB.
- For Buffer Interval, leave it as 60 seconds.
- Under Destination Settings, select the S3 bucket you created for the redacted data.
- Under Backup Settings, select the S3 bucket that you created for the raw records.
- Under Permission, either create or update an IAM role, or choose an existing role with the proper permissions.
- Choose Create delivery stream.
Deploy the streaming data solution with the Kinesis Data Generator
You can use the Kinesis Data Generator (KDG) to ingest sample data to Kinesis Data Firehose and test the solution. To simplify this process, we provide a Lambda function and CloudFormation template to create an Amazon Cognito user and assign appropriate permissions to use the KDG.
- On the Amazon Kinesis Data Generator page, choose Create a Cognito User with CloudFormation.You’re redirected to the AWS CloudFormation console to create your stack.
- Provide a user name and password for the user with which you log in to the KDG.
- Leave the other settings at their defaults and create your stack.
- On the Outputs tab, choose the KDG UI link.
- Enter your user name and password to log in.
Send test records and validate redaction in Amazon S3
To test the solution, complete the following steps:
- Log in to the KDG URL you created in the previous step.
- Choose the Region where the AWS CloudFormation stack was deployed.
- For Stream/delivery stream, choose the delivery stream you created (if you used the template, it has the format
accountnumber-awscomprehend-blog
). - Leave the other settings at their defaults.
- For the record template, you can create your own tests, or use the following template.If you’re using the provided sample data below for testing, you should have updated environment variables in the
AmazonComprehendPII-Redact
Lambda function toTweet1
,Tweet2
. If deployed via CloudFormation, update environment variables toTweet1
,Tweet2
within the created Lambda function. The sample test data is below: - Choose Send Data, and allow a few seconds for records to be sent to your stream.
- After few seconds, stop the KDG generator and check your S3 buckets for the delivered files.
The following is an example of the raw data in the raw S3 bucket:
The following is an example of the redacted data in the redacted S3 bucket:
The sensitive information has been removed from the redacted messages, providing confidence that you can share this data with end systems.
Cleanup
When you’re finished experimenting with this solution, clean up your resources by using the AWS CloudFormation console to delete all the resources deployed in this example. If you followed the manual steps, you will need to manually delete the two buckets, the AmazonComprehendPII-Redact
function, the ComprehendRealTimeBlog
stream, the log group for the ComprehendRealTimeBlog
stream, and any IAM roles that were created.
Conclusion
This post showed you how to integrate PII redaction into your near-real-time streaming architecture and reduce data processing time by performing redaction in flight. In this scenario, you provide the redacted data to your end-users and a data lake administrator secures the raw bucket for later use. You could also build additional processing with Amazon Comprehend to identify tone or sentiment, identify entities within the data, and classify each message.
We provided individual steps for each service as part of this post, and also included a CloudFormation template that allows you to provision the required resources in your account. This template should be used for proof of concept or testing scenarios only. Refer to the developer guides for Amazon Comprehend, Lambda, and Kinesis Data Firehose for any service limits.
To get started with PII identification and redaction, see Personally identifiable information (PII). With the example architecture in this post, you could integrate any of the Amazon Comprehend APIs with near-real-time data using Kinesis Data Firehose data transformation. To learn more about what you can build with your near-real-time data with Kinesis Data Firehose, refer to the Amazon Kinesis Data Firehose Developer Guide. This solution is available in all AWS Regions where Amazon Comprehend and Kinesis Data Firehose are available.
About the authors
Joe Morotti is a Solutions Architect at Amazon Web Services (AWS), helping Enterprise customers across the Midwest US. He has held a wide range of technical roles and enjoy showing customer’s art of the possible. In his free time, he enjoys spending quality time with his family exploring new places and overanalyzing his sports team’s performance
Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing Tennis, binge-watching TV shows, and playing Tabla.
Reduce cost and development time with Amazon SageMaker Pipelines local mode
Creating robust and reusable machine learning (ML) pipelines can be a complex and time-consuming process. Developers usually test their processing and training scripts locally, but the pipelines themselves are typically tested in the cloud. Creating and running a full pipeline during experimentation adds unwanted overhead and cost to the development lifecycle. In this post, we detail how you can use Amazon SageMaker Pipelines local mode to run ML pipelines locally to reduce both pipeline development and run time while reducing cost. After the pipeline has been fully tested locally, you can easily rerun it with Amazon SageMaker managed resources with just a few lines of code changes.
Overview of the ML lifecycle
One of the main drivers for new innovations and applications in ML is the availability and amount of data along with cheaper compute options. In several domains, ML has proven capable of solving problems previously unsolvable with classical big data and analytical techniques, and the demand for data science and ML practitioners is increasing steadily. From a very high level, the ML lifecycle consists of many different parts, but the building of an ML model usually consists of the following general steps:
- Data cleansing and preparation (feature engineering)
- Model training and tuning
- Model evaluation
- Model deployment (or batch transform)
In the data preparation step, data is loaded, massaged, and transformed into the type of inputs, or features, the ML model expects. Writing the scripts to transform the data is typically an iterative process, where fast feedback loops are important to speed up development. It’s normally not necessary to use the full dataset when testing feature engineering scripts, which is why you can use the local mode feature of SageMaker Processing. This allows you to run locally and update the code iteratively, using a smaller dataset. When the final code is ready, it’s submitted to the remote processing job, which uses the complete dataset and runs on SageMaker managed instances.
The development process is similar to the data preparation step for both model training and model evaluation steps. Data scientists use the local mode feature of SageMaker Training to iterate quickly with smaller datasets locally, before using all the data in a SageMaker managed cluster of ML-optimized instances. This speeds up the development process and eliminates the cost of running ML instances managed by SageMaker while experimenting.
As an organization’s ML maturity increases, you can use Amazon SageMaker Pipelines to create ML pipelines that stitch together these steps, creating more complex ML workflows that process, train, and evaluate ML models. SageMaker Pipelines is a fully managed service for automating the different steps of the ML workflow, including data loading, data transformation, model training and tuning, and model deployment. Until recently, you could develop and test your scripts locally but had to test your ML pipelines in the cloud. This made iterating on the flow and form of ML pipelines a slow and costly process. Now, with the added local mode feature of SageMaker Pipelines, you can iterate and test your ML pipelines similarly to how you test and iterate on your processing and training scripts. You can run and test your pipelines on your local machine, using a small subset of data to validate the pipeline syntax and functionalities.
SageMaker Pipelines
SageMaker Pipelines provides a fully automated way to run simple or complex ML workflows. With SageMaker Pipelines, you can create ML workflows with an easy-to-use Python SDK, and then visualize and manage your workflow using Amazon SageMaker Studio. Your data science teams can be more efficient and scale faster by storing and reusing the workflow steps you create in SageMaker Pipelines. You can also use pre-built templates that automate the infrastructure and repository creation to build, test, register, and deploy models within your ML environment. These templates are automatically available to your organization, and are provisioned using AWS Service Catalog products.
SageMaker Pipelines brings continuous integration and continuous deployment (CI/CD) practices to ML, such as maintaining parity between development and production environments, version control, on-demand testing, and end-to-end automation, which helps you scale ML throughout your organization. DevOps practitioners know that some of the main benefits of using CI/CD techniques include an increase in productivity via reusable components and an increase in quality through automated testing, which leads to faster ROI for your business objectives. These benefits are now available to MLOps practitioners by using SageMaker Pipelines to automate the training, testing, and deployment of ML models. With local mode, you can now iterate much more quickly while developing scripts for use in a pipeline. Note that local pipeline instances can’t be viewed or run within the Studio IDE; however, additional viewing options for local pipelines will be available soon.
The SageMaker SDK provides a general purpose local mode configuration that allows developers to run and test supported processors and estimators in their local environment. You can use local mode training with multiple AWS-supported framework images (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) as well as images you supply yourself.
SageMaker Pipelines, which builds a Directed Acyclic Graph (DAG) of orchestrated workflow steps, supports many activities that are part of the ML lifecycle. In local mode, the following steps are supported:
- Processing job steps – A simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation
- Training job steps – An iterative process that teaches a model to make predictions by presenting examples from a training dataset
- Hyperparameter tuning jobs – An automated way to evaluate and select the hyperparameters that produce the most accurate model
- Conditional run steps – A step that provides a conditional run of branches in a pipeline
- Model step – Using CreateModel arguments, this step can create a model for use in transform steps or later deployment as an endpoint
- Transform job steps – A batch transform job that generates predictions from large datasets, and runs inference when a persistent endpoint isn’t needed
- Fail steps – A step that stops a pipeline run and marks the run as failed
Solution overview
Our solution demonstrates the essential steps to create and run SageMaker Pipelines in local mode, which means using local CPU, RAM, and disk resources to load and run the workflow steps. Your local environment could be running on a laptop, using popular IDEs like VSCode or PyCharm, or it could be hosted by SageMaker using classic notebook instances.
Local mode allows data scientists to stitch together steps, which can include processing, training, and evaluation jobs, and run the entire workflow locally. When you’re done testing locally, you can rerun the pipeline in a SageMaker managed environment by replacing the LocalPipelineSession
object with PipelineSession
, which brings consistency to the ML lifecycle.
For this notebook sample, we use a standard publicly available dataset, the UCI Machine Learning Abalone Dataset. The goal is to train an ML model to determine the age of an abalone snail from its physical measurements. At the core, this is a regression problem.
All of the code required to run this notebook sample is available on GitHub in the amazon-sagemaker-examples repository. In this notebook sample, each pipeline workflow step is created independently and then wired together to create the pipeline. We create the following steps:
- Processing step (feature engineering)
- Training step (model training)
- Processing step (model evaluation)
- Condition step (model accuracy)
- Create model step (model)
- Transform step (batch transform)
- Register model step (model package)
- Fail step (run failed)
The following diagram illustrates our pipeline.
Prerequisites
To follow along in this post, you need the following:
- An AWS account
- Valid AWS Identity and Access Management (IAM) credentials to access Amazon Simple Storage Service (Amazon S3), datasets, and more
-
docker-compose
installed in the environment you’re using (for instructions, refer to Install Docker Compose)
After these prerequisites are in place, you can run the sample notebook as described in the following sections.
Build your pipeline
In this notebook sample, we use SageMaker Script Mode for most of the ML processes, which means that we provide the actual Python code (scripts) to perform the activity and pass a reference to this code. Script Mode provides great flexibility to control the behavior within the SageMaker processing by allowing you to customize your code while still taking advantage of SageMaker pre-built containers like XGBoost or Scikit-Learn. The custom code is written to a Python script file using cells that begin with the magic command %%writefile
, like the following:
%%writefile code/evaluation.py
The primary enabler of local mode is the LocalPipelineSession
object, which is instantiated from the Python SDK. The following code segments show how to create a SageMaker pipeline in local mode. Although you can configure a local data path for many of the local pipeline steps, Amazon S3 is the default location to store the data output by the transformation. The new LocalPipelineSession
object is passed to the Python SDK in many of the SageMaker workflow API calls described in this post. Notice that you can use the local_pipeline_session
variable to retrieve references to the S3 default bucket and the current Region name.
Before we create the individual pipeline steps, we set some parameters used by the pipeline. Some of these parameters are string literals, whereas others are created as special enumerated types provided by the SDK. The enumerated typing ensures that valid settings are provided to the pipeline, such as this one, which is passed to the ConditionLessThanOrEqualTo
step further down:
mse_threshold = ParameterFloat(name="MseThreshold", default_value=7.0)
To create a data processing step, which is used here to perform feature engineering, we use the SKLearnProcessor
to load and transform the dataset. We pass the local_pipeline_session
variable to the class constructor, which instructs the workflow step to run in local mode:
Next, we create our first actual pipeline step, a ProcessingStep
object, as imported from the SageMaker SDK. The processor arguments are returned from a call to the SKLearnProcessor
run() method. This workflow step is combined with other steps towards the end of the notebook to indicate the order of operation within the pipeline.
Next, we provide code to establish a training step by first instantiating a standard estimator using the SageMaker SDK. We pass the same local_pipeline_session
variable to the estimator, named xgb_train, as the sagemaker_session
argument. Because we want to train an XGBoost model, we must generate a valid image URI by specifying the following parameters, including the framework and several version parameters:
We can optionally call additional estimator methods, for example set_hyperparameters()
, to provide hyperparameter settings for the training job. Now that we have an estimator configured, we’re ready to create the actual training step. Once again, we import the TrainingStep
class from the SageMaker SDK library:
Next, we build another processing step to perform model evaluation. This is done by creating a ScriptProcessor
instance and passing the local_pipeline_session
object as a parameter:
To enable deployment of the trained model, either to a SageMaker real-time endpoint or to a batch transform, we need to create a Model
object by passing the model artifacts, the proper image URI, and optionally our custom inference code. We then pass this Model
object to a ModelStep
, which is added to the local pipeline. See the following code:
Next, we create a batch transform step where we submit a set of feature vectors and perform inference. We first need to create a Transformer
object and pass the local_pipeline_session
parameter to it. Then we create a TransformStep
, passing the required arguments, and add this to the pipeline definition:
Finally, we want to add a branch condition to the workflow so that we only run batch transform if the results of model evaluation meet our criteria. We can indicate this conditional by adding a ConditionStep
with a particular condition type, like ConditionLessThanOrEqualTo
. We then enumerate the steps for the two branches, essentially defining the if/else or true/false branches of the pipeline. The if_steps provided in the ConditionStep
(step_create_model, step_transform) are run whenever the condition evaluates to True
.
The following diagram illustrates this conditional branch and the associated if/else steps. Only one branch is run, based on the outcome of the model evaluation step as compared in the condition step.
Now that we have all our steps defined, and the underlying class instances created, we can combine them into a pipeline. We provide some parameters, and crucially define the order of operation by simply listing the steps in the desired order. Note that the TransformStep
isn’t shown here because it’s the target of the conditional step, and was provided as step argument to the ConditionalStep
earlier.
To run the pipeline, you must call two methods: pipeline.upsert()
, which uploads the pipeline to the underlying service, and pipeline.start()
, which starts running the pipeline. You can use various other methods to interrogate the run status, list the pipeline steps, and more. Because we used the local mode pipeline session, these steps are all run locally on your processor. The cell output beneath the start method shows the output from the pipeline:
You should see a message at the bottom of the cell output similar to the following:
Pipeline execution d8c3e172-089e-4e7a-ad6d-6d76caf987b7 SUCCEEDED
Revert to managed resources
After we’ve confirmed that the pipeline runs without errors and we’re satisfied with the flow and form of the pipeline, we can recreate the pipeline but with SageMaker managed resources and rerun it. The only change required is to use the PipelineSession
object instead of LocalPipelineSession
:
from sagemaker.workflow.pipeline_context import LocalPipelineSession
from sagemaker.workflow.pipeline_context import PipelineSession
local_pipeline_session = LocalPipelineSession()
pipeline_session = PipelineSession()
This informs the service to run each step referencing this session object on SageMaker managed resources. Given the small change, we illustrate only the required code changes in the following code cell, but the same change would need to be implemented on each cell using the local_pipeline_session
object. The changes are, however, identical across all cells because we’re only substituting the local_pipeline_session
object with the pipeline_session
object.
After the local session object has been replaced everywhere, we recreate the pipeline and run it with SageMaker managed resources:
Clean up
If you want to keep the Studio environment tidy, you can use the following methods to delete the SageMaker pipeline and the model. The full code can be found in the sample notebook.
Conclusion
Until recently, you could use the local mode feature of SageMaker Processing and SageMaker Training to iterate on your processing and training scripts locally, before running them on all the data with SageMaker managed resources. With the new local mode feature of SageMaker Pipelines, ML practitioners can now apply the same method when iterating on their ML pipelines, stitching the different ML workflows together. When the pipeline is ready for production, running it with SageMaker managed resources requires just a few lines of code changes. This reduces the pipeline run time during development, leading to more rapid pipeline development with faster development cycles, while reducing the cost of SageMaker managed resources.
To learn more, visit Amazon SageMaker Pipelines or Use SageMaker Pipelines to Run Your Jobs Locally.
About the authors
Paul Hargis has focused his efforts on machine learning at several companies, including AWS, Amazon, and Hortonworks. He enjoys building technology solutions and teaching people how to make the most of it. Prior to his role at AWS, he was lead architect for Amazon Exports and Expansions, helping amazon.com improve the experience for international shoppers. Paul likes to help customers expand their machine learning initiatives to solve real-world problems.
Niklas Palm is a Solutions Architect at AWS in Stockholm, Sweden, where he helps customers across the Nordics succeed in the cloud. He’s particularly passionate about serverless technologies along with IoT and machine learning. Outside of work, Niklas is an avid cross-country skier and snowboarder as well as a master egg boiler.
Kirit Thadaka is an ML Solutions Architect working in the SageMaker Service SA team. Prior to joining AWS, Kirit worked in early-stage AI startups followed by some time consulting in various roles in AI research, MLOps, and technical leadership.
Amazon and USC name three new ML Fellows
PhD students will receive fellowship funding and be mentored by Amazon scientists.Read More
Create high-quality data for ML models with Amazon SageMaker Ground Truth
Machine learning (ML) has improved business across industries in recent years—from the recommendation system on your Prime Video account, to document summarization and efficient search with Alexa’s voice assistance. However, the question remains of how to incorporate this technology into your business. Unlike traditional rule-based methods, ML automatically infers patterns from data so as to perform your task of interest. Although this bypasses the need to curate rules for automation, it also means that ML models can only be as good as the data on which they’re trained. However, data creation is often a challenging task. At the Amazon Machine Learning Solutions Lab, we’ve repeatedly encountered this problem and want to ease this journey for our customers. If you want to offload this process, you can use Amazon SageMaker Ground Truth Plus.
By the end of this post, you’ll be able to achieve the following:
- Understand the business processes involved in setting up a data acquisition pipeline
- Identify AWS Cloud services for supporting and expediting your data labeling pipeline
- Run a data acquisition and labeling task for custom use cases
- Create high-quality data following business and technical best practices
Throughout this post, we focus on the data creation process and rely on AWS services to handle the infrastructure and process components. Namely, we use Amazon SageMaker Ground Truth to handle the labeling infrastructure pipeline and user interface. This service uses a point-and-go approach to collect your data from Amazon Simple Storage Service (Amazon S3) and set up a labeling workflow. For labeling, it provides you with the built-in flexibility to acquire data labels using your private team, an Amazon Mechanical Turk force, or your preferred labeling vendor from AWS Marketplace. Lastly, you can use AWS Lambda and Amazon SageMaker notebooks to process, visualize, or quality control the data—either pre- or post-labeling.
Now that all of the pieces have been laid down, let’s start the process!
The data creation process
Contrary to common intuition, the first step for data creation is not data collection. Working backward from the users to articulate the problem is crucial. For example, what do users care about in the final artifact? Where do experts believe the signals relevant to the use case reside in the data? What information about the use case environment could be provided to model? If you don’t know the answers to those questions, don’t worry. Give yourself some time to talk with users and field experts to understand the nuances. This initial understanding will orient you in the right direction and set you up for success.
For this post, we assume that you have covered this initial process of user requirement specification. The next three sections walk you through the subsequent process of creating quality data: planning, source data creation, and data annotation. Piloting loops at the data creation and annotation steps are vital for ensuring the efficient creation of labeled data. This involves iterating between data creation, annotation, quality assurance, and updating the pipeline as necessary.
The following figure provides an overview of the steps required in a typical data creation pipeline. You can work backward from the use case to identify the data that you need (Requirements Specification), build a process to obtain the data (Planning), implement the actual data acquisition process (Data Collection and Annotation), and assess the results. Pilot runs, highlighted with dashed lines, let you iterate on the process until a high-quality data acquisition pipeline has been developed.

Overview of steps required in a typical data creation pipeline.
Planning
A standard data creation process can be time-consuming and a waste of valuable human resources if conducted inefficiently. Why would it be time-consuming? To answer this question, we must understand the scope of the data creation process. To assist you, we have collected a high-level checklist and description of key components and stakeholders that you must consider. Answering these questions can be difficult at first. Depending on your use case, only some of these may be applicable.
- Identify the legal point of contact for required approvals – Using data for your application can require license or vendor contract review to ensure compliance with company policies and use cases. It’s important to identify your legal support throughout the data acquisition and annotation steps of the process.
- Identify the security point of contact for data handling –Leakage of purchased data might result in serious fines and repercussions for your company. It’s important to identify your security support throughout the data acquisition and annotation steps to ensure secure practices.
- Detail use case requirements and define source data and annotation guidelines – Creating and annotating data is difficult due to the high specificity required. Stakeholders, including data generators and annotators, must be completely aligned to avoid wasting resources. To this end, it’s common practice to use a guidelines document that specifies every aspect of the annotation task: exact instructions, edge cases, an example walkthrough, and so on.
-
Align on expectations for collecting your source data – Consider the following:
- Conduct research on potential data sources – For example, public datasets, existing datasets from other internal teams, self-collected, or purchased data from vendors.
- Perform quality assessment – Create an analysis pipeline with relation to the final use case.
-
Align on expectations for creating data annotations – Consider the following:
- Identify the technical stakeholders – This is usually an individual or team in your company capable of using the technical documentation regarding Ground Truth to implement an annotation pipeline. These stakeholders are also responsible for quality assessment of the annotated data to make sure that it meets the needs of your downstream ML application.
- Identify the data annotators – These individuals use predetermined instructions to add labels to your source data within Ground Truth. They may need to possess domain knowledge depending on your use case and annotation guidelines. You can use a workforce internal to your company, or pay for a workforce managed by an external vendor.
- Ensure oversight of the data creation process – As you can see from the preceding points, data creation is a detailed process that involves numerous specialized stakeholders. Therefore, it’s crucial to monitor it end to end toward the desired outcome. Having a dedicated person or team oversee the process can help you ensure a cohesive, efficient data creation process.
Depending on the route that you decide to take, you must also consider the following:
- Create the source dataset – This refers to instances when existing data isn’t suitable for the task at hand, or legal constraints prevent you from using it. Internal teams or external vendors (next point) must be used. This is often the case for highly specialized domains or areas with low public research. For example, a physician’s common questions, garment lay down, or sports experts. It can be internal or external.
- Research vendors and conduct an onboarding process – When external vendors are used, a contracting and onboarding process must be set in place between both entities.
In this section, we reviewed the components and stakeholders that we must consider. However, what does the actual process look like? In the following figure, we outline a process workflow for data creation and annotation. The iterative approach uses small batches of data called pilots to decrease turnaround time, detect errors early on, and avoid wasting resources in the creation of low-quality data. We describe these pilot rounds later in this post. We also cover some best practices for data creation, annotation, and quality control.
The following figure illustrates the iterative development of a data creation pipeline. Vertically, we find the data sourcing block (green) and the annotation block (blue). Both blocks have independent pilot rounds (Data creation/Annotation, QAQC, and Update). Increasingly higher sourced data is created and can be used to construct increasingly higher-quality annotations.

Overview of iterative development in a data creation pipeline.
Source data creation
The input creation process revolves around staging your items of interest, which depend on your task type. These could be images (newspaper scans), videos (traffic scenes), 3D point clouds (medical scans), or simply text (subtitle tracks, transcriptions). In general, when staging your task-related items, make sure of the following:
- Reflect the real-world use case for the eventual AI/ML system – The setup for collecting images or videos for your training data should closely match the setup for your input data in the real-world application. This means having consistent placement surfaces, lighting sources, or camera angles.
-
Account for and minimize variability sources – Consider the following:
- Develop best practices for maintaining data collection standards – Depending on the granularity of your use case, you may need to specify requirements to guarantee consistency among your data points. For example, if you’re collecting image or video data from single camera points, you may need to make sure of the consistent placement of your objects of interest, or require a quality check for the camera before a data capture round. This can avoid issues like camera tilt or blur, and minimize downstream overheads like removing out-of-frame or blurry images, as well as needing to manually center the image frame on your area of interest.
- Pre-empt test time sources of variability – If you anticipate variability in any of the attributes mentioned so far during test time, make sure that you can capture those variability sources during training data creation. For example, if you expect your ML application to work in multiple different light settings, you should aim to create training images and videos at various light settings. Depending on the use case, variability in camera positioning can also influence the quality of your labels.
-
Incorporate prior domain knowledge when available – Consider the following:
- Inputs on sources of error – Domain practitioners can provide insights into sources of error based on their years of experience. They can provide feedback on the best practices for the previous two points: What settings reflect the real-world use case best? What are the possible sources of variability during data collection, or at the time of use?
- Domain-specific data collection best practices – Although your technical stakeholders may already have a good idea of the technical aspects to focus on in the images or videos collected, domain practitioners can provide feedback on how best to stage or collect the data such that these needs are met.
Quality control and quality assurance of the created data
Now that you have set up the data collection pipeline, it might be tempting to go ahead and collect as much data as possible. Wait a minute! We must first check if the data collected through the setup is suitable for your real-word use case. We can use some initial samples and iteratively improve the setup through the insights that we gained from analyzing that sample data. Work closely with your technical, business, and annotation stakeholders during the pilot process. This will make sure that your resultant pipeline is meeting business needs while generating ML-ready labeled data within minimal overheads.
Annotations
The annotation of inputs is where we add the magic touch to our data—the labels! Depending on your task type and data creation process, you may need manual annotators, or you can use off-the-shelf automated methods. The data annotation pipeline itself can be a technically challenging task. Ground Truth eases this journey for your technical stakeholders with its built-in repertoire of labeling workflows for common data sources. With a few additional steps, it also enables you to build custom labeling workflows beyond preconfigured options.
Ask yourself the following questions when developing a suitable annotation workflow:
- Do I need a manual annotation process for my data? In some cases, automated labeling services may be sufficient for the task at hand. Reviewing the documentation and available tools can help you identify if manual annotation is necessary for your use case (for more information, see What is data labeling?). The data creation process can allow for varying levels of control regarding the granularity of your data annotation. Depending on this process, you can also sometimes bypass the need for manual annotation. For more information, refer to Build a custom Q&A dataset using Amazon SageMaker Ground Truth to train a Hugging Face Q&A NLU model.
- What forms my ground truth? In most cases, the ground truth will come from your annotation process—that’s the whole point! In others, the user may have access to ground truth labels. This can significantly speed up your quality assurance process, or reduce the overhead required for multiple manual annotations.
- What is the upper bound for the amount of deviance from my ground truth state? Work with your end-users to understand the typical errors around these labels, the sources of such errors, and the desired reduction in errors. This will help you identify which aspects of the labeling task are most challenging or are likely to have annotation errors.
- Are there preexisting rules used by the users or field practitioners to label these items? Use and refine these guidelines to build a set of instructions for your manual annotators.
Piloting the input annotation process
When piloting the input annotation process, consider the following:
- Review the instructions with the annotators and field practitioners – Instructions should be concise and specific. Ask for feedback from your users (Are the instructions accurate? Can we revise any instructions to make sure that they are understandable by non-field practitioners?) and annotators (Is everything understandable? Is the task clear?). If possible, add an example of good and bad labeled data to help your annotators identify what is expected, and what common labeling errors might look like.
- Collect data for annotations – Review the data with your customer to make sure that it meets the expected standards, and to align on expected outcomes from the manual annotation.
- Provide examples to your pool of manual annotators as a test run – What is the typical variance among the annotators in this set of examples? Study the variance for each annotation within a given image to identify the consistency trends among annotators. Then compare the variances across the images or video frames to identify which labels are challenging to place.
Quality control of the annotations
Annotation quality control has two main components: assessing consistency between the annotators, and assessing the quality of the annotations themselves.
You can assign multiple annotators to the same task (for example, three annotators label the key points on the same image), and measure the average value alongside the standard deviation of these labels among the annotators. Doing so helps you identify any outlier annotations (incorrect label used, or label far away from the average annotation), which can guide actionable outcomes, such as refining your instructions or providing further training to certain annotators.
Assessing the quality of annotations themselves is tied to annotator variability and (when available) the availability of domain experts or ground truth information. Are there certain labels (across all of your images) where the average variance between annotators is consistently high? Are any labels far off from your expectations of where they should be, or what they should look like?
Based on our experience, a typical quality control loop for data annotation can look like this:
- Iterate on the instructions or image staging based on results from the test run – Are any objects occluded, or does image staging not match the expectations of annotators or users? Are the instructions misleading, or did you miss any labels or common errors in your exemplar images? Can you refine the instructions for your annotators?
- If you are satisfied that you have addressed any issues from the test run, do a batch of annotations – For testing the results from the batch, follow the same quality assessment approach of assessing inter-annotator and inter-image label variabilities.
Conclusion
This post serves as a guide for business stakeholders to understand the complexities of data creation for AI/ML applications. The processes described also serve as a guide for technical practitioners to generate quality data while optimizing business constraints such as personnel and costs. If not done well, a data creation and labeling pipeline can take upwards of 4–6 months.
With the guidelines and suggestions outlined in this post, you can preempt roadblocks, reduce time to completion, and minimize the costs in your journey toward creating high-quality data.
About the authors
Jasleen Grewal is an Applied Scientist at Amazon Web Services, where she works with AWS customers to solve real world problems using machine learning, with special focus on precision medicine and genomics. She has a strong background in bioinformatics, oncology, and clinical genomics. She is passionate about using AI/ML and cloud services to improve patient care.
Boris Aronchik is a Manager in the Amazon AI Machine Learning Solutions Lab, where he leads a team of ML scientists and engineers to help AWS customers realize business goals leveraging AI/ML solutions.
Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.
Lin Lee Cheong is a Senior Scientist and Manager with the Amazon ML Solutions Lab team at Amazon Web Services. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems.