Detect abnormal equipment behavior and review predictions using Amazon Lookout for Equipment and Amazon A2I

Detect abnormal equipment behavior and review predictions using Amazon Lookout for Equipment and Amazon A2I

Companies that operate and maintain a broad range of industrial machinery such as generators, compressors, and turbines are constantly working to improve operational efficiency and avoid unplanned downtime due to component failure. They invest heavily in physical sensors (tags), data connectivity, data storage, and data visualization to monitor the condition of their equipment and get real-time alerts for predictive maintenance.

With machine learning (ML), more powerful technologies have become available that can provide data-driven models that learn from an equipment’s historical data. However, implementing such ML solutions is time-consuming and expensive because it involves managing and setting up complex infrastructure and having the right ML skills. Furthermore, ML applications need human oversight to ensure accuracy with sensitive data, help provide continuous improvements, and retrain models with updated predictions. However, you’re often forced to choose between an ML-only or human-only system. Companies are looking for the best of both worlds—integrating ML systems into your workflow while keeping a human eye on the results to achieve higher precision.

In this post, we show you how you can set up Amazon Lookout for Equipment to train an abnormal behavior detection model using a wind turbine dataset for predictive maintenance, use a human in the loop workflow to review the predictions using Amazon Augmented AI (Amazon A2I), and augment the dataset and retrain the model.

Solution overview

Amazon Lookout for Equipment analyzes the data from your sensors, such as pressure, flow rate, RPMs, temperature, and power, to automatically train a specific ML model based on your data, for your equipment, with no ML expertise required. Amazon Lookout for Equipment uses your unique ML model to analyze incoming sensor data in near-real time and accurately identify early warning signs that could lead to machine failures. This means you can detect equipment abnormalities with speed and precision, quickly diagnose issues, take action to reduce expensive downtime, and reduce false alerts.

Amazon A2I is an ML service that makes it easy to build the workflows required for human review. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers, whether running on AWS or not.

To get started with Amazon Lookout for Equipment, we create a dataset, ingest data, train a model, and run inference by setting up a scheduler. After going through these steps, we show you how you can quickly set up a human review process using Amazon A2I and retrain your model with augmented or human reviewed datasets.

In the accompanying Jupyter notebook, we walk you through the following steps:

  1. Create a dataset in Amazon Lookout for Equipment.
  2. Ingest data into the Amazon Lookout for Equipment dataset.
  3. Train a model in Amazon Lookout for Equipment.
  4. Run diagnostics on the trained model.
  5. Create an inference scheduler in Amazon Lookout for Equipment to send a simulated stream of real-time requests.
  6. Set up an Amazon A2I private human loop and review the predictions from Amazon Lookout for Equipment.
  7. Retrain your model based on augmented datasets from Amazon A2I.

Architecture overview

The following diagram illustrates our solution architecture.


The workflow contains the following steps:

  1. The architecture assumes that the inference pipeline is built and sensor data is periodically stored in the S3 path for inference inputs. These inputs are stored in CSV format with corresponding timestamps in the file name.
  2. Amazon Lookout for Equipment wakes up at a prescribed frequency and processes the most recent file from the inference inputs Amazon Simple Storage Service (Amazon S3) path.
  3. Inference results are stored in the inference outputs S3 path in JSON lines file format. The outputs also contain event diagnostics, which are used for root cause analysis.
  4. When Amazon Lookout for Equipment detects an anomaly, the inference input and outputs are presented to the private workforce for validation via Amazon A2I.
  5. A private workforce investigates and validates the detected anomalies and provides new anomaly labels. These labels are stored in a new S3 path.
  6. Training data is also updated, along with the corresponding new labels, and is staged for subsequent model retraining.
  7. After enough new labels are collected, a new Amazon Lookout for Equipment model is created, trained, and deployed. The retraining cycle can be repeated for continuous model retraining.


Before you get started, complete the following steps to set up the Jupyter notebook:

  1. Create a notebook instance in Amazon SageMaker.

Make sure your SageMaker notebook has the necessary AWS Identity and Access Management (IAM) roles and permissions mentioned in the prerequisite section of the notebook.

  1. When the notebook is active, choose Open Jupyter.
  2. On the Jupyter dashboard, choose New, and choose Terminal.
  3. In the terminal, enter the following code:
cd SageMaker
git clone
  1. First run the data preparation notebook – 1_data_preparation.ipynb
  2. Then open the notebook for this blog – 3_integrate_l4e_and_a2i.ipynb

You’re now ready to run the following steps through the notebook cells. Run the setup environment step to set up the necessary Python SDKs and libraries that we use throughout the notebook.

  1. Provide an AWS Region, create an S3 bucket, and provide details of the bucket in the following code cell:
REGION_NAME = '<your region>'
BUCKET = '<your bucket name>'
PREFIX = 'data/wind-turbine'

Analyze the dataset and create component metadata

In this section, we walk you through how you can preprocess the existing wind turbine data and ingest it for Amazon Lookout for Equipment. Please make sure to run the data preparation notebook prior to running the accompanying notebook for the blog to follow through all the steps in this post. You need a data schema for using your existing historical data with Amazon Lookout for Equipment. The data schema tells Amazon Lookout for Equipment what the data means. Because a data schema describes the data, its structure mirrors that of the data files of the components it describes.

All components must be described in the data schema. The data for each component is contained in a separate CSV file structured as shown in the data schema.

You store the data for each asset’s component in a separate CSV file using the following folder structure:

S3 bucket > Asset_name > Component 1 > Component1.csv

Go to the notebook section Pre-process and Load Datasets and run the following cell to inspect the data:

import pandas as pd
turbine_id = 'R80711'
df = pd.read_csv(f'../data/wind-turbine/interim/{turbine_id}.csv', index_col = 'Timestamp')

The following screenshot shows our output.

Now we create components map to create a dataset expected by Amazon Lookout for Equipment for ingest. Run the notebook cells under the section Create the Dataset Component Map to create a component map and generate a CSV file for ingest.

Create the Amazon Lookout for Equipment dataset

We use Amazon Lookout for Equipment Create Dataset APIs to create a dataset and provide the component map we created in the previous step as an input. Run the following notebook cell to create a dataset:

ROLE_ARN = sagemaker.get_execution_role()
# REGION_NAME = boto3.session.Session().region_name
DATASET_NAME = 'wind-turbine-train-dsv2-PR'
MODEL_NAME = 'wind-turbine-PR-v1'

lookout_dataset = lookout.LookoutEquipmentDataset(

pp = pprint.PrettyPrinter(depth=5)

You get the following output:

Dataset "wind-turbine-train-dsv2-PR" does not exist, creating it...

{'DatasetName': 'wind-turbine-train-dsv2-PR',
 'DatasetArn': 'arn:aws:lookoutequipment:ap-northeast-2:<aws-account>:dataset/wind-turbine-train-dsv2-PR/8325802a-9bb7-48fb-804b-ab9f5b79f49d',
 'Status': 'CREATED',
 'ResponseMetadata': {'RequestId': '52dc754c-84da-4a8c-aaef-1908e4348837',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '52dc754c-84da-4a8c-aaef-1908e4348837',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '203',
   'date': 'Thu, 25 Mar 2021 21:18:29 GMT'},
  'RetryAttempts': 0}}

Alternatively, you can go to the Amazon Lookout for Equipment console to view the dataset.

You can choose View under Data schema to view the schema of the dataset. You can choose Ingest new data to start ingesting data through the console, or you can use the APIs shown in the notebook to do the same using Python Boto3 APIs.

Run the notebook cells to ingest the data. When ingestion is complete, you get the following response:

=====Polling Data Ingestion Status=====

2021-03-25 21:18:45 |  IN_PROGRESS
2021-03-25 21:19:46 |  IN_PROGRESS
2021-03-25 21:20:46 |  IN_PROGRESS
2021-03-25 21:21:46 |  IN_PROGRESS
2021-03-25 21:22:46 |  SUCCESS

Now that we have preprocessed the data and ingested the data into Amazon Lookout for Equipment, we move on to the training steps. Alternatively, you can choose Ingest data.

Label your dataset using the SageMaker labeling workforce

If you don’t have an existing labeled dataset available to directly use with Amazon Lookout for Equipment, create a custom labeling workflow. This may be relevant in a use case in which, for example, a company wants to build a remote operating facility where alerts from various operations are sent to the central facility for the SMEs to review and update. For a sample crowd HTML template for your labeling UI, refer to our GitHub repository.

The following screenshot shows an example of what the sample labeling UI looks like.

For this post, we use the labels that came with the dataset for training. If you want to use the label file you created for your actual training in the next step, you need to copy the label file to an S3 bucket and provide the location in the training configuration.

Create a model in Amazon Lookout for Equipment

We walk you through the following steps in this section:

  • Prepare the model parameters and split data into test and train sets
  • Train the model using Amazon Lookout for Equipment APIs
  • Get diagnostics for the trained model

Prepare the model parameters and split the data

In this step, we split the datasets into test and train, prepare labels, and start the training using the notebook. Run the notebook code Split train and test to split the dataset into an 80/20 split for training and testing, respectively. Then run the prepare labels code and move on to setting up training config, as shown in the following code:

# Prepare the model parameters:
lookout_model = lookout.LookoutEquipmentModel(model_name=MODEL_NAME,

# Set the training / evaluation split date:

# Set the label data location:

# This sets up the rate the service will resample the data before 
# training:

In the preceding code, we set up model training parameters such as time periods, label data, and target sampling rate for our model. For more information about these parameters, see CreateModel.

Train model

After setting these model parameters, you need to run the following train model API to start training your model with your dataset and the training parameters:


You get the following response:

{'ModelArn': 'arn:aws:lookoutequipment:ap-northeast-2:<accountid>:model/wind-turbine-PR-v1/fac217a9-8855-4931-95f9-dd47f0af1ec5',
 'Status': 'IN_PROGRESS',
 'ResponseMetadata': {'RequestId': '3d385895-c62e-4126-9622-38f0ebed9715',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '3d385895-c62e-4126-9622-38f0ebed9715',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '152',
   'date': 'Thu, 25 Mar 2021 21:27:05 GMT'},
  'RetryAttempts': 0}}

Alternatively, you can go to the Amazon Lookout for Equipment console and monitor the training after you create the model.

The sample turbine dataset we provide in our example has millions of data points. Training takes approximately 2.5 hours.

Evaluate the trained model

After a model is trained, Amazon Lookout for Equipment evaluates its performance and displays the results. It provides an overview of the performance and detailed information about the abnormal equipment behavior events and how well the model performed when detecting those. With the data and failure labels that you provided for training and evaluating the model, Amazon Lookout for Equipment reports how many times the model’s predictions were true positives. It also reports the average forewarning time across all true positives. Additionally, it reports the false positive results generated by the model, along with the duration of the non-event.

For more information about performance evaluation, see Evaluating the output.

Review training diagnostics

Run the following code to generate the training diagnostics. Refer to the accompanying notebook for the complete code block to run for this step.

LookoutDiagnostics = lookout.LookoutEquipmentAnalysis(model_name=MODEL_NAME, tags_df=df, region_name=REGION_NAME)
LookoutDiagnostics.set_time_periods(evaluation_start, evaluation_end, training_start, training_end)
predicted_ranges = LookoutDiagnostics.get_predictions()

The results returned show the percentage contribution of each feature towards the abnormal equipment prediction for the corresponding date range.

Create an inference scheduler in Amazon Lookout for Equipment

In this step, we show you how the CreateInferenceScheduler API creates a scheduler and starts it—this starts costing you right away. Scheduling an inference is setting up a continuous real-time inference plan to analyze new measurement data. When setting up the scheduler, you provide an S3 bucket location for the input data, assign it a delimiter between separate entries in the data, set an offset delay if desired, and set the frequency of inference. You must also provide an S3 bucket location for the output data. Run the following notebook section to run inference on the model to create an inference scheduler:

scheduler = lookout.LookoutEquipmentScheduler(

scheduler_params = {
'upload_frequency': DATA_UPLOAD_FREQUENCY,
'timezone_offset': INPUT_TIMEZONE_OFFSET,
'timestamp_format': TIMESTAMP_FORMAT


After you create an inference scheduler, the next step is to create some sample datasets for inference.

Prepare the inference data

Run through the notebook steps to prepare the inference data. Let’s load the tags description; this dataset comes with a data description file. From here, we can collect the list of components (subsystem column) if required. We use the tag metadata from the data descriptions as a point of reference for our interpretation. We use the tag names to construct a list that Amazon A2I uses. For more details, refer to the section Set up Amazon A2I to review predictions from Amazon Lookout for Equipment in this post.

To build our sample inference dataset, we extract the last 2 hours of data from the evaluation period of the original time series. Specifically, we create three CSV files containing simulated real-time tags for our turbine 10 minutes apart. These are all stored in Amazon S3 in the inference-a2i folder. Now that we’ve prepared the data, create the scheduler by running the following code:

create_scheduler_response = scheduler.create()

You get the following response:

===== Polling Inference Scheduler Status =====

Scheduler Status: PENDING
Scheduler Status: RUNNING

===== End of Polling Inference Scheduler Status =====

Alternatively, on the Amazon Lookout for Equipment console, go to the Inference schedule settings section of your trained model and set up a scheduler by providing the necessary parameters.

Get inference results

Run through the notebook steps List inference executions to get the run details from the schedule you created in the previous step. Wait 5–15 minutes for the scheduler to run its first inference. When it’s complete, we can use the ListInferenceExecution API for our current inference scheduler. The only mandatory parameter is the scheduler name.

You can also choose a time period for which you want to query inference runs. If you don’t specify it, all runs for an inference scheduler are listed. If you want to specify the time range, you can use the following code:

START_TIME_FOR_INFERENCE_EXECUTIONS = datetime.datetime(2010,1,3,0,0,0)END_TIME_FOR_INFERENCE_EXECUTIONS = datetime.datetime(2010,1,5,0,0,0)

This code means that the runs after 2010-01-03 00:00:00 and before 2010-01-05 00:00:00 are listed.

You can also choose to query for runs in a particular status, such as IN_PROGRESS, SUCCESS, and FAILED:


execution_summaries = []

while len(execution_summaries) == 0:
    execution_summaries = scheduler.list_inference_executions(
    if len(execution_summaries) == 0:

You get the following response:

{'ModelName': 'wind-turbine-PR-v1',
  'ModelArn': 'arn:aws:lookoutequipment:ap-northeast-2:<aws-account>:model/wind-turbine-PR-v1/fac217a9-8855-4931-95f9-dd47f0af1ec5',
  'InferenceSchedulerName': 'wind-turbine-scheduler-a2i-PR-v10',
  'InferenceSchedulerArn': 'arn:aws:lookoutequipment:ap-northeast-2:<aws-account>:inference-scheduler/wind-turbine-scheduler-a2i-PR-v10/e633c39d-a4f9-49f6-8248-7594349db2d0',
  'ScheduledStartTime': datetime.datetime(2021, 3, 29, 15, 35, tzinfo=tzlocal()),
  'DataStartTime': datetime.datetime(2021, 3, 29, 15, 30, tzinfo=tzlocal()),
  'DataEndTime': datetime.datetime(2021, 3, 29, 15, 35, tzinfo=tzlocal()),
  'DataInputConfiguration': {'S3InputConfiguration': {'Bucket': '<your s3 bucket>',
    'Prefix': 'data/wind-turbine/inference-a2i/input/'}},
  'DataOutputConfiguration': {'S3OutputConfiguration': {'Bucket': '<your s3 bucket>',
    'Prefix': 'data/wind-turbine/inference-a2i/output/'}},
  'CustomerResultObject': {'Bucket': '<your s3 bucket>',
   'Key': 'data/wind-turbine/inference-a2i/output/2021-03-29T15:30:00Z/results.jsonl'},
  'Status': 'SUCCESS'}]

Get actual prediction results

After each successful inference, a JSON file is created in the output location of your bucket. Each inference creates a new folder with a single results.jsonl file in it. You can run through this section in the notebook to read these files and display their content.


The following screenshot shows the results.

Stop the inference scheduler

Make sure to stop the inference scheduler; we don’t need it for the rest of the steps in this post. However, as part of your solution, the inference scheduler should be running to ensure real-time inference for your equipment continues. Run through this notebook section to stop the inference scheduler.

Set up Amazon A2I to review predictions from Amazon Lookout for Equipment

Now that inference is complete, let’s understand how to set up a UI to review the inference results and update it, so we can send it back to Amazon Lookout for Equipment for retraining the model. In this section, we show how to use the Amazon A2I custom task type to integrate with Amazon Lookout for Equipment through the walkthrough notebook to set up a human in the loop process. It includes the following steps:

  • Create a human task UI
  • Create a workflow definition
  • Send predictions to Amazon A2I human loops
  • Sign in to the worker portal and annotate Amazon Lookout for Equipment inference predictions

Follow the steps provided in the notebook to initialize Amazon A2I APIs. Make sure to set up the bucket name in the initialization block where you want your Amazon A2I output:

a2ibucket = '<your bucket>'

You also need to create a private workforce and provide a work team ARN in the initialize step.

On the SageMaker console, create a private workforce. After you create the private workforce, find the workforce ARN and enter the ARN in the notebook:

WORKTEAM_ARN = 'your private workforce team ARN'

Create the human task UI

You now create a human task UI resource, giving a UI template in liquid HTML. You can download the provided template and customize it. This template is rendered to the human workers whenever a human loop is required. For over 70 pre-built UIs, see the amazon-a2i-sample-task-uis GitHub repo. We also provide this template in our GitHub repo.

You can use this template to create a task UI either via the console or by running the following code in the notebook:

def create_task_ui():
    response = sagemaker_client.create_human_task_ui(
        UiTemplate={'Content': template})
    return response

Create a human review workflow definition

Workflow definitions allow you to specify the following:

  • The worker template or human task UI you created in the previous step.
  • The workforce that your tasks are sent to. For this post, it’s the private workforce you created in the prerequisite steps.
  • The instructions that your workforce receives.

This post uses the Create Flow Definition API to create a workflow definition. Run the following cell in the notebook:

create_workflow_definition_response = sagemaker_client.create_flow_definition(
        FlowDefinitionName= flowDefinitionName,
        HumanLoopConfig= {
            "WorkteamArn": WORKTEAM_ARN,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Review the contents and select correct values as indicated",
            "TaskTitle": "Equipment Condition Review"
            "S3OutputPath" : OUTPUT_PATH
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] 

Send predictions to Amazon A2I human loops

We create an item list from the Pandas DataFrame where we have the Amazon Lookout for Equipement output saved. Run the following notebook cell to create a list of items to send for review:

NUM_TO_REVIEW = 5 # number of line items to review
dftimestamp = sig_full_df['Timestamp'].astype(str).to_list()
dfsig001 = sig_full_df['Q_avg'].astype(str).to_list()
dfsig002 = sig_full_df['Ws1_avg'].astype(str).to_list()
dfsig003 = sig_full_df['Ot_avg'].astype(str).to_list()
dfsig004 = sig_full_df['Nf_avg'].astype(str).to_list()
dfsig046 = sig_full_df['Ba_avg'].astype(str).to_list()
sig_list = [{'timestamp': dftimestamp[x], 'reactive_power': dfsig001[x], 'wind_speed_1': dfsig002[x], 'outdoor_temp': dfsig003[x], 'grid_frequency': dfsig004[x], 'pitch_angle': dfsig046[x]} for x in range(NUM_TO_REVIEW)]

Run the following code to create a JSON input for the Amazon A2I loop. This contains the lists that are sent as input to the Amazon A2I UI displayed to the human reviewers.

ip_content = {"signal": sig_list,
'anomaly': ano_list

Run the following notebook cell to call the Amazon A2I API to start the human loop:

import json
humanLoopName = str(uuid.uuid4())

start_loop_response = a2i.start_human_loop(
                "InputContent": json.dumps(ip_content)

You can check the status of human loop by running the next cell in the notebook.

Annotate the results via the worker portal

Run the following notebook cell to get a login link to navigate to the private workforce portal:

workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

You’re redirected to the Amazon A2I console. Select the human review job and choose Start working. After you review the changes and make corrections, choose Submit.

You can evaluate the results store in Amazon S3.

Evaluate the results

When the labeling work is complete, your results should be available in the S3 output path specified in the human review workflow definition. The human answers are returned and saved in the JSON file. Run the notebook cell to get the results from Amazon S3:

import re
import pprint

pp = pprint.PrettyPrinter(indent=4)
json_output = ''
for resp in completed_human_loops:
    splitted_string = re.split('s3://' + a2ibucket  + '/', resp['HumanLoopOutput']['OutputS3Uri'])
    output_bucket_key = splitted_string[1]
    response = s3.get_object(Bucket=a2ibucket, Key=output_bucket_key)
    content = response["Body"].read()
    json_output = json.loads(content)

You get a response with human reviewed answers and flow-definition. Refer to the notebook to get the complete response.

Model retraining based on augmented datasets from Amazon A2I

Now we take the Amazon A2I output, process it, and send it back to Amazon Lookout for Equipment to retrain our model based on the human corrections. Refer to the accompanying notebook for all the steps to complete in this section. Let’s look at the last few entries of our original label file:

labels_df = pd.read_csv(os.path.join(LABEL_DATA, 'labels.csv'), header=None)
labels_df[0] = pd.to_datetime(labels_df[0])
labels_df[1] = pd.to_datetime(labels_df[1])
labels_df.columns = ['start', 'end']

The following screenshot shows the labels file.

Update labels with new date ranges

Now let’s update our existing labels dataset with the new labels we received from the Amazon A2I human review process:

faulty = False
a2i_lbl_df = labels_df
x = json_output['humanAnswers'][0]
row_df = pd.DataFrame(columns=['rownr'])
tslist = {}

# Let's first check if the users mark equipment as faulty and if so get those row numbers into a dataframe            
for i in json_output['humanAnswers']:
    print("checking equipment review...")
    x = i['answerContent']
    for idx, key in enumerate(x):
        if "faulty" in key:
            if str(x.get(key)).split(':')[1].lstrip().strip('}') == "True": # faulty equipment selected
                    faulty = True
                    row_df.loc[len(row_df.index)] = [key.split('-')[1]] 
                    print("found faulty equipment in row: " + key.split('-')[1])

# Now we will get the date ranges for the faulty choices                     
for idx,k in row_df.iterrows():
    x = json_output['humanAnswers'][0]
    strchk = "TrueStart"+k['rownr']
    endchk = "TrueEnd"+k['rownr']
    for i in x['answerContent']:
        if i == strchk:
            tslist[i] = x['answerContent'].get(i)
        if i == endchk:
            tslist[i] = x['answerContent'].get(i)

# And finally let's add it to our new a2i labels dataset
for idx,k in row_df.iterrows():
    x = json_output['humanAnswers'][0]
    strchk = "TrueStart"+k['rownr']
    endchk = "TrueEnd"+k['rownr']
    a2i_lbl_df.loc[len(a2i_lbl_df.index)] = [tslist[strchk], tslist[endchk]]

You get the following response:

checking equipment review...
found faulty equipment in row: 1
found faulty equipment in row: 2

The following screenshot shows the updated labels file.

Let’s upload the updated labels data to a new augmented labels file:

a2i_label_s3_dest_path = f's3://{BUCKET}/{PREFIX}/augmented-labelled-data/labels.csv'
!aws s3 cp $a2i_label_src_fname $a2i_label_s3_dest_path

Update the training dataset with new measurements

We now update our original training dataset with the new measurement range based on what we got back from Amazon A2I. Run the following code to load the original dataset to a new DataFrame that we use to append our augmented data. Refer to the accompanying notebook for all the steps required.

turbine_id = 'R80711'
file = '../data/wind-turbine/final/training-data/'+turbine_id+'/'+turbine_id+'.csv'
newdf = pd.read_csv(file, index_col='Timestamp')

The following screenshot shows our original training dataset snapshot.

Now we use the updated training dataset with the simulated inference data we created earlier, in which the human reviewers indicated that they found faulty equipment when running the inference. Run the following code to modify the index of the simulated inference dataset to reflect a 10-minute duration for each reading:

sig_full_df = sig_full_df.set_index('Timestamp')
tm = pd.to_datetime('2021-04-05 20:30:00')
new_index = pd.date_range(
sig_full_df.index = new_index = 'Timestamp'
sig_full_df = sig_full_df.reset_index()
sig_full_df['Timestamp'] = pd.to_datetime(sig_full_df['Timestamp'], errors='coerce')

Run the following code to append the simulated inference dataset to the original training dataset:

newdf = newdf.reset_index()
newdf = pd.concat([newdf,sig_full_df])


The simulated inference data with the recent timestamp is appended to the end of the training dataset. Now let’s create a CSV file and copy the data to the training channel in Amazon S3:

TRAIN_DATA_AUGMENTED = os.path.join(TRAIN_DATA,'augmented')
os.makedirs(TRAIN_DATA_AUGMENTED, exist_ok=True)
!aws s3 sync $TRAIN_DATA_AUGMENTED s3://$BUCKET/$PREFIX/training_data/augmented

Now we update the components map with this augmented dataset, reload the data into Amazon Lookout for Equipment, and retrain this training model with this dataset. Refer to the accompanying notebook for the detailed steps to retrain the model.


In this post, we walked you through how to use Amazon Lookout for Equipment to train a model to detect abnormal equipment behavior with a wind turbine dataset, review diagnostics from the trained model, review the predictions from the model with a human in the loop using Amazon A2I, augment our original training dataset, and retrain our model with the feedback from the human reviews.

With Amazon Lookout for Equipment and Amazon A2I, you can set up a continuous prediction, review, train, and feedback loop to audit predictions and improve the accuracy of your models.

Please let us know what you think of this solution and how it applies to your industrial use case. Check out the GitHub repo for full resources to this post. Visit the webpages to learn more about Amazon Lookout for Equipment and Amazon Augmented AI. We look forward to hearing from you. Happy experimentation!

About the Authors 

 Dastan Aitzhanov is a Solutions Architect in Applied AI with Amazon Web Services. He specializes in architecting and building scalable cloud-based platforms with an emphasis on machine learning, internet of things, and big data-driven applications. When not working, he enjoys going camping, skiing, and spending time in the great outdoors with his family



Prem Ranga is an Enterprise Solutions Architect based out of Atlanta, GA. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an Autonomous Vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.


Mona Mona is a Senior AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with public sector customers, and helps them adopt machine learning on a large scale. She is passionate about NLP and ML explainability areas in AI/ML.



Baris Yasin is a Solutions Architect at AWS. He’s passionate about AI/ML & Analytics technologies and helping startup customers solve challenging business and technical problems with AWS.


Read More

Acoustic anomaly detection using Amazon Lookout for Equipment

Acoustic anomaly detection using Amazon Lookout for Equipment

As the modern factory becomes more connected, manufacturers are increasingly using a range of inputs (such as process data, audio, and visual) to increase their operational efficiency. Companies use this information to monitor equipment performance and anticipate failures using predictive maintenance techniques powered by machine learning (ML) and artificial intelligence (AI). Although traditional sensors built into the equipment can be informative, audio and visual inspection can also provide insights into the health of the asset. However, leveraging this data and gaining actionable insights can be highly manual and resource prohibitive.

Koch Ag & Energy Solutions, LLC (KAES) took the opportunity to collaborate with Amazon ML Solutions Lab to learn more about alternative acoustic anomaly detection solutions and to get another set of eyes on their existing solution.

The ML Solutions Lab team used the existing data collected by KAES equipment in the field for an in-depth acoustic data exploration. In collaboration with the lead data scientist at KAES, the ML Solutions Lab team engaged with an internal team at Amazon that had participated in the Detection and Classification of Acoustic Scenes and Events 2020 competition and won high marks for their efforts. After reviewing the documentation from Giri et al. (2020), the team presented some very interesting insights into the acoustic data:

  • Industrial data is relatively stationary, so the recorded audio window size can be longer in duration
  • Inference intervals could be increased from 1 second to 10–30 seconds.
  • The sampling rates for the recorded sounds could be lowered and still retain the pertinent information

Furthermore, the team investigated two different approaches for feature engineering that KAES hadn’t previously explored. The first was an average-spectral featurizer; the second was an advanced deep learning based (VGGish network) featurizer. For this effort, the team didn’t need to use the classifier for the VGGish classes. Instead, they removed the top-level classifier layer and kept the network as a feature extractor. With this feature extraction approach, the network can convert audio input into high-level 128-dimensional embedding, which can be fed as input to another ML model. Compared to raw audio features, such as waveforms and spectrograms, this deep learning embedding is more semantically meaningful. The ML Solutions Lab team also designed an optimized API for processing all the audio files, which decreases the I/O time by more than 90%, and the overall processing time by around 70%.

Anomaly detection with Amazon Lookout for Equipment

To implement these solutions, the ML Solutions Lab team used Amazon Lookout for Equipment, a new service that helps to enable predictive maintenance. Amazon Lookout for Equipment uses AI to learn the normal operating patterns of industrial equipment and alert users to abnormal equipment behavior. Amazon Lookout for Equipment helps organizations take action before machine failures occur and avoid unplanned downtime.

Successfully implementing predictive maintenance depends on using the data collected from industrial equipment sensors, under their unique operating conditions, and then applying sophisticated ML techniques to build a custom model that can detect abnormal machine conditions before machine failures occur.

Amazon Lookout for Equipment analyzes the data from industrial equipment sensors to automatically train a specific ML model for that equipment with no ML expertise required. It learns the multivariate relationships between the sensors (tags) that define the normal operating modes of the equipment. With this service, you can reduce the number of manual data science steps and resource hours to develop a model. Furthermore, Amazon Lookout for Equipment uses the unique ML model to analyze incoming sensor data in near-real time to accurately identify early warning signs that could lead to machine failures with little or no manual intervention. This enables detecting equipment abnormalities with speed and precision, quickly diagnosing issues, taking action to reduce expensive downtime, and reducing false alerts.

With KAES, the ML Solutions Lab team developed a proof of concept pipeline that demonstrated the data ingestion steps for both sound and machine telemetry. The team used the telemetry data to identify the machine operating states and inform which audio data was relevant for training. For example, a pump at low speed has a certain auditory signature, whereas a pump at high speed may have a different auditory signature. The relationship between measurements like RPMs (speed) and the sound are key to understanding machine performance and health. The ML training time decreased from around 6 hours to less than 20 minutes when using Amazon Lookout for Equipment, which enabled faster model explorations.

This pipeline can serve as the foundation to build and deploy anomaly detection models for new assets. After sufficient data is ingested into the Amazon Lookout for Equipment platform, inference can begin and anomaly detections can be identified.

“We needed a solution to detect acoustic anomalies and potential failures of critical manufacturing machinery,” says Dave Kroening, IT Leader at KAES. “Within a few weeks, the experts at the ML Solutions Lab worked with our internal team to develop an alternative, state-of-the-art, deep neural net embedding sound featurization technique and a prototype for acoustic anomaly detection. We were very pleased with the insight that the ML Solutions Lab team provided us regarding our data and educating us on the possibilities of using Amazon Lookout for Equipment to build and deploy anomaly detection models for new assets.”

By merging the sound data with the machine telemetry data and then using Amazon Lookout for Equipment, we can derive important relationships between the telemetry data and the acoustic signals. We can learn the normal healthy operating conditions and healthy sounds in varying operating modes.

If you’d like help accelerating the use of ML in your products and services, please contact the ML Solutions Lab.

About the Authors

Michael Robinson is a Lead Data Scientist at Koch Ag & Energy Solutions, LLC (KAES). His work focuses on computer vision, acoustic, and data engineering. He leverages technical knowledge to solve unique challenges for KAES. In his spare time, he enjoys golfing, photography and traveling.



Dave Kroening is an IT Leader with Koch Ag & Energy Solutions, LLC (KAES). His work focuses on building out a vision and strategy for initiatives that can create long term value. This includes exploring, assessing, and developing opportunities that have a potential to disrupt the Operating capability within KAES. He and his team also help to discover and experiment with technologies that can create a competitive advantage. In his spare time he enjoys spending time with his family, snowboarding, and racing.


Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies. Mehdi attended MIT as a postdoctoral researcher and obtained his Ph.D. in Engineering from UCF.



Xin Chen is a senior manager at Amazon ML Solutions Lab, where he leads the Automotive Vertical and helps AWS customers across different industries identify and build machine learning solutions to address their organization’s highest return-on-investment machine learning opportunities. Xin obtained his Ph.D. in Computer Science and Engineering from the University of Notre Dame.



Yunzhi Shi is a data scientist at the Amazon ML Solutions Lab where he helps AWS customers address business problems with AI and cloud capabilities. Recently, he has been building computer vision, search, and forecast solutions for customers from various industrial verticals. Yunzhi obtained his Ph.D. in Geophysics from the University of Texas at Austin.



Dan Volk is a Data Scientist at Amazon ML Solutions Lab, where he helps AWS customers across various industries accelerate their AI and cloud adoption. Dan has worked in several fields including manufacturing, aerospace, and sports and holds a Masters in Data Science from UC Berkeley.




Brant Swidler is the Technical Product Manager for Amazon Lookout for Equipment. He focuses on leading product development including data science and engineering efforts. Brant comes from an Industrial background in the oil and gas industry and has a B.S. in Mechanical and Aerospace Engineering from Washington University in St. Louis and an MBA from the Tuck school of business at Dartmouth.

Read More

Win a digital car and personalize your racer profile on the AWS DeepRacer console

Win a digital car and personalize your racer profile on the AWS DeepRacer console

AWS DeepRacer is the fastest way to get rolling with machine learning, giving developers the chance to learn ML hands-on with a 1/18th scale autonomous car, 3D virtual racing simulator, and the world’s largest global autonomous car racing league. With the 2021 AWS DeepRacer League Virtual Circuit now underway, developers have five times more opportunities to win physical prizes, such as exclusive AWS DeepRacer merchandise, AWS DeepRacer Evo devices, and even an expenses paid trip to AWS re:Invent 2021 to compete in the AWS DeepRacer Championship Cup.

To win physical prizes, show us your skills by racing in one of the AWS monthly qualifiers, becoming a Pro by finishing in the top 10% of an Open race leaderboard, or qualifying for the championship by winning a monthly Pro division finale. To make ML more fun and accessible to every developer, the AWS DeepRacer League is taking prizing a step further and introducing new digital car customizations for every participant in the league. For each month that you participate, you’ll earn a reward exclusive to that race and division. After all, if your ML model is getting rewarded, shouldn’t you get rewarded too?

Digital rewards: Collect them all and showcase your collection

Digital rewards are unique cars, paint jobs, and body kits that are stored in a new section of the AWS DeepRacer console: your racer profile. Unlocking a new reward is like giving your model the West Coast Customs car treatment. While X to the Z might be famous for kitting out your ride in the streets, A to the Z is here to hook you up in the virtual simulator!

No two rewards are exactly alike, and each month will introduce new rewards to be earned in each racing division to add to your collection. You’ll need to race every month to collect all of the open division digital rewards in your profile. If you advance to the Pro division, you’ll unlock twice the rewards, with an additional Pro division reward each month that’s only available to the fastest in the league.

If you participated in the March 2021 races, you’ll see some special deliveries dropping into your racer profile starting today. Open division racers will receive the white box van, and Pro division racers will receive both the white box van and the Pro Exclusive AWS DeepRacer Van. Despite their size, they’re just as fast as any other vehicle you race on the console—they’re merely skins and don’t change the agent’s capabilities or performance.

But that’s not all—AWS DeepRacer will keep the rewards coming with surprise limited edition rewards throughout the season for specific racing achievements and milestones. But you’ll have to keep racing to find them! When it’s time to celebrate, the confetti will fall! The next time you log in and access your racer profile, you’ll see the celebration to commemorate your achievement. After a new digital reward is added to your racer profile, you can choose the reward to open your garage and start personalizing your car and action space, or assign it to any existing model in your garage using the Mod vehicle feature.

When you select that model to race in the league, you’ll see your customized vehicle in your race evaluation video. You can also head over to the race leaderboard to watch other racer’s evaluations and check out which customizations they’re using for their models to size up the competition.

Customize your racer profile and avatar

While new digital rewards allow you to customize your car on the track, the new Your racer profile page allows you to customize your personal appearance across the AWS DeepRacer console. With the new avatar tool, you can select from a variety of options to replicate your real-life style or try out a completely new appearance to showcase your personality. Your racer profile also allows you to designate your country, which adds a flag to your avatar and in-console live races, giving you the opportunity to represent your region and see where other competitors are racing from all over the globe.

Your avatar appears on each race leaderboard page for you to see, and if you’re on top of the leaderboard, everyone can see your avatar in first position, claiming victory! If you qualify to participate in a live race such as the monthly Pro division finale, your avatar is also featured each time you’re on the track. In addition to housing the avatar tool and your digital rewards, the Your racer profile page also provides useful stats such as your division, the number of races you have won, and how long you have been racing in the AWS DeepRacer League.

Get rolling today

The April 2021 races are just getting underway in the Virtual Circuit. Head over to the AWS DeepRacer League today to get rolling, or sign in to the AWS DeepRacer console to start customizing your avatar and collecting digital rewards!

To see the new avatars in action, tune into the AWS DeepRacer League LIVE Pro Finale at 5:30pm PST, the second Thursday of every month on the AWS Twitch channel. The first race will take place on April 8th.

About the Author

Joe Fontaine is the Marketing Program Manager for AWS AI/ML Developer Devices. He is passionate about making machine learning more accessible to all through hands-on educational experiences. Outside of work he enjoys freeride mountain biking, aerial cinematography, and exploring the wilderness with his dogs. He is proud to be a recent inductee to the “rad dads” club.

Read More

Improve operational efficiency with integrated equipment monitoring with TensorIoT powered by AWS

Improve operational efficiency with integrated equipment monitoring with TensorIoT powered by AWS

Machine downtime has a dramatic impact on your operational efficiency. Unexpected machine downtime is even worse. Detecting industrial equipment issues at an early stage and using that data to inform proper maintenance can give your company a significant increase in operational efficiency.

Customers see value in detecting abnormal behavior in industrial equipment to improve maintenance lifecycles. However, implementing advanced maintenance approaches has multiple challenges. One major challenge is the plethora of data recorded from sensors and log information, as well as managing equipment and site metadata. These different forms of data may either be inaccessible or spread across disparate systems that can impede access and processing. After this data is consolidated, the next step is gaining insights to prioritize the most operationally efficient maintenance strategy.

A range of data processing tools exist today, but most require significant manual effort to implement or maintain, which acts as a barrier to use. Furthermore, managing advanced analytics such as machine learning (ML) requires either in-house or external data scientists to manage models for each type of equipment. This can lead to a high cost of implementation and can be daunting for operators that manage hundreds or thousands of sensors in a refinery or hundreds of turbines on a wind farm.

Real-time data capture and monitoring of your IoT assets with TensorIoT

TensorIoT, an AWS Advanced Consulting Partner, is no stranger to the difficulties companies face when looking to harness their data to improve their business practices. TensorIoT creates products and solutions to help companies benefit from the power of ML and IoT.

“Regardless of size or industry, companies are seeking to achieve greater situational awareness, gain actionable insight, and make more confident decisions,” says John Traynor, TensorIoT VP of Products.

For industrial customers, TensorIoT is adept at integrating sensors and machine data with AWS tools into a holistic system that keeps operators informed about the status of their equipment at all times. TensorIoT uses AWS IoT Greengrass with AWS IoT SiteWise and other AWS Cloud services to help clients collect data from both direct equipment measurements and add-on sensors through connected devices to measure factors such as humidity, temperature, pressure, power, and vibration, giving a holistic view of machine operation. To help businesses gain increased understanding of their data and processes, TensorIoT created SmartInsights, a product that incorporates data from multiple sources for analysis and visualization. Clear visualization tools combined with advanced analytics means that the assembled data is easy to understand and actionable for users. This is seen in the following screenshot, which shows the specific site where an anomaly occurred and a ranking based on production or process efficiency.

TensorIoT built the connectivity to get the data ingestion into Amazon Lookout for Equipment (an industrial equipment monitoring service that detects abnormal equipment behavior) for analysis, and then used SmartInsights as the visualization tool for users to act on the outcome. Whether an operational manager wants to visualize the health of the asset or provide an automated push notification sent to maintenance teams such as an alarm or Amazon Simple Notification Service (Amazon SNS) message, SmartInsights keeps industrial sites and factory floors operating at peak performance for even the most complex device hierarchies. Powered by AWS, TensorIoT helps companies rapidly and precisely detect equipment abnormalities, diagnose issues, and take immediate action to reduce expensive downtime.

Simplify machine learning with Amazon Lookout for Equipment

ML offers industrial companies the ability to automatically discover new insights from data that is being collected across systems and equipment types. In the past, however, industrial ML-enabled solutions such as equipment condition monitoring have been reserved for the most critical or expensive assets, due to the high cost of developing and managing the required models. Traditionally, a data scientist needed to go through dozens of steps to build an initial model for industrial equipment monitoring that can detect abnormal behavior. Amazon Lookout for Equipment automates these traditional data science steps to open up more opportunities for a broader set of equipment than ever before. Amazon Lookout for Equipment reduces the heavy lifting to create ML algorithms so you can take advantage of industrial equipment monitoring to identify anomalies, and gain new actionable insights that help you improve your operations and avoid downtime.

Historically, ML models can also be complex to manage due to changing or new operations. Amazon Lookout for Equipment is making it easier and faster to get feedback from the engineers closest to the equipment by enabling direct feedback and iteration of these models. That means that a maintenance engineer can prioritize which insights are the most important to detect based on current operations, such as process, signal, or equipment issues. Amazon Lookout for Equipment enables the engineer to label these events to continue to refine and prioritize so the insights stay relevant over the life of the asset.

Combining TensorIoT and Amazon Lookout for Equipment has never been easier

To delve deeper into how to visualize near real-time insights gained from Amazon Lookout for Equipment, let’s explore the process. It’s important to have historic and failure data so we can train the model to learn what patterns occur before failure. When trained, the model can create inferences about pending events from new, live data from that equipment. This, historically, is a time-consuming barrier to adoption because each piece of equipment requires separate training due to its unique operation and is solved through Amazon Lookout for Equipment and visualized by SmartInsights.

For our example, we start by identifying a suitable dataset where we have sensor and other operational data from a piece of equipment, as well as historic data about when the equipment has been operating outside of specifications or has failed, if available.

To demonstrate how to use Amazon Lookout for Equipment and visualize results in near real time in SmartInsights, we used a publicly available set of wind turbine data. Our dataset from the La Haute Borne wind farm spanned several hundred thousand rows and over 100 columns of data from a variety of sensors on the equipment. Data included the rotor speed, pitch angle, generator bearing temperatures, gearbox bearing temperatures, oil temperature, multiple power measurements, wind speed and direction, outdoor temperature, and more. The maximum, average, and other statistical characteristics were also stored for each data point.

The following table is a subset of the columns used in our analysis.

Variable_name Variable_long_name Unit_long_name
Turbine Wind_turbine_name
Time Date_time
Ba Pitch_angle deg
Cm Converter_torque Nm
Cosphi Power_factor
Db1t Generator_bearing_1_temperature deg_C
Db2t Generator_bearing_2_temperature deg_C
DCs Generator_converter_speed rpm
Ds Generator_speed rpm
Dst Generator_stator_temperature deg_C
Gb1t Gearbox_bearing_1_temperature deg_C
Gb2t Gearbox_bearing_2_temperature deg_C
Git Gearbox_inlet_temperature deg_C
Gost Gearbox_oil_sump_temperature deg_C
Na_c Nacelle_angle_corrected deg
Nf Grid_frequency Hz
Nu Grid_voltage V
Ot Outdoor_temperature deg_C
P Active_power kW
Pas Pitch_angle_setpoint
Q Reactive_power kVAr
Rbt Rotor_bearing_temperature deg_C
Rm Torque Nm
Rs Rotor_speed rpm
Rt Hub_temperature deg_C
S Apparent_power kVA
Va Vane_position deg
Va1 Vane_position_1 deg
Va2 Vane_position_2 deg
Wa Absolute_wind_direction deg
Wa_c Absolute_wind_direction_corrected deg
Ws Wind_speed m/s
Ws1 Wind_speed_1 m/s
Ws2 Wind_speed_2 m/s
Ya Nacelle_angle deg
Yt Nacelle_temperature deg_C

Using Amazon Lookout for Equipment consists of three stages: ingestion, training, and inference (or detection). After the model is trained with available historical data, inference can happen automatically on a selected time interval, such as every 5 minutes or 1 hour.

First, let’s look at the Amazon Lookout for Equipment side of the process. In this example, we trained using historic data and evaluated the model against 1 year of historic data. Based on these results, 148 of the 150 events were detected with an average forewarning time of 18 hours.

For each of the events, a diagnostic of the key contributing sensors is given to support evaluation of the root cause, as shown in the following screenshot.

SmartInsights provides visualization of data from each asset and incorporates the events from Amazon Lookout for Equipment. SmartInsights can then pair the original measurements with the anomalies identified by Amazon Lookout for Equipment using the common timestamp. This allows SmartInsights to show measurements and anomalies on a common timescale and gives the operator context to these events. In the following graphical representation, a green bar is overlaid on top of the anomalies. You can deep dive by evaluating the diagnostics against the asset to determine when and how to respond to the event.

With the wind turbine data that was used in our example, SmartInsights provided visual evidence of the events with forewarning based on results for Amazon Lookout for Equipment. In a production environment, the prediction could create a notification or alert to operating personnel or trigger a work order to be created in another application to dispatch personnel to take corrective action before failure.

SmartInsights supports triggering alerts in response to certain conditions. For example, you can configure SmartInsights to send a message to a Slack channel or send a text message. Because SmartInsights is built on AWS, the notification endpoint can be any destination supported by Amazon SNS. For example, the following view of SmartInsights on a mobile device contains a list of alerts that have been triggered within a certain time window, to which a SmartInsights user can subscribe.

The following architecture diagram shows how Amazon Lookout for Equipment is used with SmartInsights. For many applications, Amazon Lookout for Equipment provides an accelerated path to anomaly detection without the need to hire a data scientist and meet business return on investment.

Maximize uptime, increase safety, and improve machine efficiency

Condition-based maintenance is beneficial for your business on a multitude of levels:

  • Maximized uptime – When maintenance events are predicted, you decide the optimal scheduling to minimize the impact on your operational efficiency.
  • Increased safety – Condition-based maintenance ensures that your equipment remains in safe operating conditions, which protects your operators and your machinery by catching issues before they become problems.
  • Improved machine efficiency – As your machines undergo normal wear and tear, their efficiency decreases. Condition-based maintenance keeps your machines in optimal conditions and extends the lifespan of your equipment.


Even before the release of Amazon Lookout for Equipment, TensorIoT helped industrial manufacturers innovate their machinery through the implementation of modern architectures, sensors for legacy augmentation, and ML to make the newly acquired data intelligible and actionable. With Amazon Lookout for Equipment and TensorIoT solutions, TensorIoT helps make your assets even smarter.

To explore how you can use Amazon Lookout for Equipment with SmartInsights to more rapidly gain insight into pending equipment failures and reduce downtime, get in touch with TensorIoT via

Details on how to start using Amazon Lookout for Equipment are available on the webpage.

About the Authors

Alicia Trent is a Worldwide Business Development Manager at Amazon Web Services. She has 15 years of experience in Technology across industrial sectors and is a graduate of the Georgia Institute of Technology, where she earned a BS degree in chemical and biomolecular engineering, and an MS degree in mechanical engineering.

Dastan Aitzhanov is a Solutions Architect in Applied AI with Amazon Web Services. He specializes in architecting and building scalable cloud-based platforms with an emphasis on Machine Learning, Internet of Things, and Big Data driven applications. When not working, he enjoys going camping, skiing, and just spending time in the great outdoors with his family.

Nicholas Burden is a Senior Technical Evangelist at TensorIoT, where he focuses on translating complex technical jargon into digestible information. He has over a decade of technical writing experience and a Master’s in Professional Writing from USC. Outside of work, he enjoys tending to an ever-growing collection of houseplants and spending time with pets and family.

Read More

Object detection with Detectron2 on Amazon SageMaker

Object detection with Detectron2 on Amazon SageMaker

Deep learning is at the forefront of most machine learning (ML) implementations across a broad set of business verticals. Driven by the highly flexible nature of neural networks, the boundary of what is possible has been pushed to a point where neural networks can outperform humans in a variety of tasks, such as object detection tasks in the context of computer vision (CV) problems.

Object detection, which is one type of CV task, has many applications in various fields like medicine, retail, or agriculture. For example, retail businesses want to be able to detect stock keeping units (SKUs) in store shelf images to analyze buyer trends or identify when product restock is necessary. Object detection models allow you to implement these diverse use cases and automate your in-store operations.

In this post, we discuss Detectron2, an object detection and segmentation framework released by Facebook AI Research (FAIR), and its implementation on Amazon SageMaker to solve a dense object detection task for retail. This post includes an associated sample notebook, which you can run to demonstrate all the features discussed in this post. For more information, see the GitHub repository.

Toolsets used in this solution

To implement this solution, we use Detectron2, PyTorch, SageMaker, and the public SKU-110K dataset.


Detectron2 is a ground-up rewrite of Detectron that started with maskrcnn-benchmark. The platform is now implemented in PyTorch. With a new, more modular design, Detectron2 is flexible and extensible, and provides fast training on single or multiple GPU servers. Detectron2 includes high-quality implementations of state-of-the-art object detection algorithms, including DensePose, panoptic feature pyramid networks, and numerous variants of the pioneering Mask R-CNN model family also developed by FAIR. Its extensible design makes it easy to implement cutting-edge research projects without having to fork the entire code     base.


PyTorch is an open-source, deep learning framework that makes it easy to develop ML models and deploy them to production. With PyTorch’s TorchScript, developers can seamlessly transition between eager mode, which performs computations immediately for easy development, and graph mode, which creates computational graphs for efficient implementations in production environments. PyTorch also offers distributed training, deep integration into Python, and a rich ecosystem of tools and libraries, which makes it popular with researchers and engineers.

An example of that rich ecosystem of tools is TorchServe, a recently released model-serving framework for PyTorch that helps deploy trained models at scale without having to write custom code. TorchServe is built and maintained by AWS in collaboration with Facebook and is available as part of the PyTorch open-source project. For more information, see the TorchServe GitHub repo and Model Server for PyTorch Documentation.

Amazon SageMaker

SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.


For our use case, we use the SKU-110 dataset introduced by Goldman et al. in the paper “Precise Detection in Densely Packed Scenes” (Proceedings of 2019 conference on Computer Vision and Patter Recognition). This dataset contains 11,762 images of store shelves from around the world. Researchers use this dataset to test object detection algorithms on dense scenes. The term density here refers to the number of objects per image. The average number of items per image is 147.4, which is 19 times more than the COCO dataset. Moreover, the images contain multiple identical objects grouped together that are challenging to separate. The dataset contains bounding box annotation on SKUs. The categories of product aren’t distinguished because the bounding box labels only indicate the presence or absence of an item.

Introduction to Detectron2

Detectron2 is FAIR’s next generation software system that implements state-of-the-art object detection algorithms. It’s a ground-up rewrite of the previous version, Detectron, and it originates from maskrcnn-benchmark. The following screenshot is an example of the high-level structure of the Detectron2 repo, which will make more sense when we explore configuration files and network architectures later in this post.

For more information about the general layout of computer vision and deep learning architectures, see A Survey of the Recent Architectures of Deep Convolutional Neural Networks.

Additionally, if this is your first introduction to Detectron2, see the official documentation to learn more about the feature-rich capabilities of Detectron2. For the remainder of this post, we solely focus on implementation details pertaining to deploying Detectron2-powered object detection on SageMaker rather than discussing the underlying computer vision-specific theory.

Update the SageMaker role

To build custom training and serving containers, you need to attach additional Amazon Elastic Container Registry (Amazon ECR) permissions to your SageMaker AWS Identity and Access Management (IAM) role. You can use an AWS-authored policy (such as AmazonEC2ContainerRegistryPowerUser) or create your own custom policy. For more information, see How Amazon SageMaker Works with IAM.

Update the dataset

Detectron2 includes a set of utilities for data loading and visualization. However, you need to register your custom dataset to use Detectron2’s data utilities. You can do this by using the function register_dataset in the file from the GitHub repo. This function iterates on the training, validation, and test sets. At each iteration, it calls the function aws_file_mode, which returns a list of annotations given the path to the folder that contains the images and the path to the augmented manifest file that contains the annotations. Augmented manifest files are the output format of Amazon SageMaker Ground Truth bounding box jobs. You can reuse the code associated with this post on your own data labeled for object detection with Ground Truth.

Let’s prepare the SKU-110K dataset so that training, validation, and test images are in dedicated folders, and the annotations are in augmented manifest file format. First, import the required packages, define the S3 bucket, and set up the SageMaker session:

from pathlib import Path
from urllib import request
import tarfile
from typing import Sequence, Mapping, Optional
from tqdm import tqdm
from datetime import datetime
import tempfile
import json

import pandas as pd
import numpy as np
import boto3
import sagemaker

bucket = "my-bucket" # TODO: replace with your bucker
prefix_data = "detectron2/data"
prefix_model = "detectron2/training_artefacts"
prefix_code = "detectron2/model"
prefix_predictions = "detectron2/predictions"
local_folder = "cache"

sm_session = sagemaker.Session(default_bucket=bucket)
role = sagemaker.get_execution_role()

Then, download the dataset:

sku_dataset = ("SKU110K_fixed", "")

if not (Path(local_folder) / sku_dataset[0]).exists():
    compressed_file =[1]), mode="r|gz")
    print(f"Using the data in `{local_folder}` folder")
path_images = Path(local_folder) / sku_dataset[0] / "images"
assert path_images.exists(), f"{path_images} not found"

prefix_to_channel = {
    "train": "training",
    "val": "validation",
    "test": "test",
for channel_name in prefix_to_channel.values():
    if not (path_images.parent / channel_name).exists():
        (path_images.parent / channel_name).mkdir()

for path_img in path_images.iterdir():
    for prefix in prefix_to_channel:
            path_img.replace(path_images.parent / prefix_to_channel[prefix] /

Next, upload the image files to Amazon Simple Storage Service (Amazon S3) using the utilities from the SageMaker Python SDK:

channel_to_s3_imgs = {}

for channel_name in prefix_to_channel.values():
    inputs = sm_session.upload_data(
        path=str(path_images.parent / channel_name),
    print(f"{channel_name} images uploaded to {inputs}")
    channel_to_s3_imgs[channel_name] = inputs

SKU-110k annotations are stored in CSV files. The following function converts the annotations to JSON lines (refer to the GitHub repo to see the implementation):

def create_annotation_channel(
    channel_id: str, path_to_annotation: Path, bucket_name: str, data_prefix: str,
    img_annotation_to_ignore: Optional[Sequence[str]] = None
) -> Sequence[Mapping]:
    r"""Change format from original to augmented manifest files

    channel_id : str
        name of the channel, i.e. training, validation or test
    path_to_annotation : Path
        path to annotation file
    bucket_name : str
        bucket where the data are uploaded
    data_prefix : str
        bucket prefix
    img_annotation_to_ignore : Optional[Sequence[str]]
        annotation from these images are ignore because the corresponding images are corrupted, default to None

        List of json lines, each lines contains the annotations for a single. This recreates the
        format of augmented manifest files that are generated by Amazon SageMaker GroundTruth
        labeling jobs

channel_to_annotation_path = {
    "training": Path(local_folder) / sku_dataset[0] / "annotations" / "annotations_train.csv",
    "validation": Path(local_folder) / sku_dataset[0] / "annotations" / "annotations_val.csv",
    "test": Path(local_folder) / sku_dataset[0] / "annotations" / "annotations_test.csv",
channel_to_annotation = {}

for channel in channel_to_annotation_path:
    annotations = create_annotation_channel(
    print(f"Number of {channel} annotations: {len(annotations)}")
    channel_to_annotation[channel] = annotations

Finally, upload the manifest files to Amazon S3:

def upload_annotations(p_annotations, p_channel: str):
    rsc_bucket = boto3.resource("s3").Bucket(bucket)
    json_lines = [json.dumps(elem) for elem in p_annotations]
    to_write = "n".join(json_lines)

    with tempfile.NamedTemporaryFile(mode="w") as fid:
        rsc_bucket.upload_file(, f"{prefix_data}/annotations/{p_channel}.manifest")

for channel_id, annotations in channel_to_annotation.items():
    upload_annotations(annotations, channel_id)

Visualize the dataset

Detectron2 provides toolsets to inspect datasets. You can visualize the dataset input images and their ground truth bounding boxes. First, you need to add the dataset to the Detectron2 catalog:

import random
from typing import Sequence, Mapping
import cv2
from matplotlib import pyplot as plt
from import DatasetCatalog, MetadataCatalog
from detectron2.utils.visualizer import Visualizer
# custom code
from datasets.catalog import register_dataset, DataSetMeta

ds_name = "sku110k"
metadata = DataSetMeta(name=ds_name, classes=["SKU",])
channel_to_ds = {"test": ("data/test/", "data/test.manifest")}
    metadata=metadata, label_name="sku", channel_to_dataset=channel_to_ds,

You can now plot annotations on an image as follows:

dataset_samples: Sequence[Mapping] = DatasetCatalog.get(f"{ds_name}_test")
sample = random.choice(dataset_samples)
fname = sample["file_name"]
img = cv2.imread(fname)
visualizer = Visualizer(
    img[:, :, ::-1], metadata=MetadataCatalog.get(f"{ds_name}_test"), scale=1.0
out = visualizer.draw_dataset_dict(sample)


The following picture shows an example of ground truth bounding boxes on a test image.

Distributed training on Detectron2

You can use Docker containers with SageMaker to train Detectron2 models. In this post, we describe how you can run distributed Detectron2 training jobs for a larger number of iterations across multiple nodes and GPU devices on a SageMaker training cluster.

The process includes the following steps:

  1. Create a training script capable of running and coordinating training tasks in a distributed environment.
  2. Prepare a custom Docker container with configured training runtime and training scripts.
  3. Build and push the training container to Amazon ECR.
  4. Initialize training jobs via the SageMaker Python SDK.

Prepare the training script for the distributed cluster

The sku-100k folder contains the source code that we use to train the custom Detectron2 model. The script is the entry point of the training process. The following sections of the script are worth discussing in detail:

  • __main__ guard – The SageMaker Python SDK runs the code inside the main guard when used for training. The train function is called with the script arguments.
  • _parse_args() – This function parses arguments from the command line and from the SageMaker environments. For example, you can choose which model to train among Faster-RCNN and RetinaNet. The SageMaker environment variables define the input channel locations and where the model artifacts are stored. The number of GPUs and the number of hosts define the properties of the training cluster.
  • train() – We use the Detectron2 launch utility to start training on multiple nodes.
  • _train_impl()– This is the actual training script, which is run on all processes and GPU devices. This function runs the following steps:
    • Register the custom dataset to Detectron2’s catalog.
    • Create the configuration node for training.
    • Fit the training dataset to the chosen object detection architecture.
    • Save the training artifacts and run the evaluation on the test set if the current node is the primary.

Prepare the training container

We build a custom container with the specific Detectron2 training runtime environment. As a base image, we use the latest SageMaker PyTorch container and further extend it with Detectron2 requirements. We first need to make sure that we have access to the public Amazon ECR (to pull the base PyTorch image) and our account registry (to push the custom container). The following example code shows how to log in to both registries prior to building and pushing your custom containers:

# loging to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin
# loging to your private ECR
!aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin <YOUR-ACCOUNT-ID>

After you successfully authenticate with Amazon ECR, you can build the Docker image for training. This Dockerfile runs the following instructions:

  1. Define the base container.
  2. Install the required dependencies for Detectron2.
  3. Copy the training script and the utilities to the container.
  4. Build Detectron2 from source.

Build and push the custom training container

We provide a simple bash script to build a local training container and push it to your account registry. If needed, you can specify a different image name, tag, or Dockerfile. The following code is a short snippet of the Dockerfile:

# Build an image of Detectron2 that can do distributing training on Amazon Sagemaker  using Sagemaker PyTorch container as base image
# from
ARG REGION=us-east-1

FROM 763104351884.dkr.ecr.${REGION}

############# Detectron2 pre-built binaries Pytorch default install ############
RUN pip install --upgrade torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f

############# Detectron2 section ##############
RUN pip install  
   --no-cache-dir pycocotools~=2.0.0 
   --no-cache-dir detectron2 -f

# Build D2 only for Volta architecture - V100 chips (ml.p3 AWS instances)

# Set a fixed model cache directory. Detectron2 requirement

############# SageMaker section ##############

COPY container_training/sku-110k /opt/ml/code
WORKDIR /opt/ml/code



# Starts PyTorch distributed framework
ENTRYPOINT ["bash", "-m", ""]

Schedule the training job

You’re now ready to schedule your distributed training job. First, you need to do several common imports and configurations, which are described in detail in our companion notebook. Second, it’s important to specify which metrics you want to track during the training, which you can do by creating a JSON file with the appropriate regular expressions for each metric of interest. See the following example code:

metrics = [
    {"Name": "training:loss", "Regex": "total_loss: ([0-9\.]+)",},
    {"Name": "training:loss_cls", "Regex": "loss_cls: ([0-9\.]+)",},
    {"Name": "training:loss_box_reg", "Regex": "loss_box_reg: ([0-9\.]+)",},
    {"Name": "training:loss_rpn_cls", "Regex": "loss_rpn_cls: ([0-9\.]+)",},
    {"Name": "training:loss_rpn_loc", "Regex": "loss_rpn_loc: ([0-9\.]+)",},
    {"Name": "validation:loss", "Regex": "total_val_loss: ([0-9\.]+)",},
    {"Name": "validation:loss_cls", "Regex": "val_loss_cls: ([0-9\.]+)",},
    {"Name": "validation:loss_box_reg", "Regex": "val_loss_box_reg: ([0-9\.]+)",},
    {"Name": "validation:loss_rpn_cls", "Regex": "val_loss_rpn_cls: ([0-9\.]+)",},
    {"Name": "validation:loss_rpn_loc", "Regex": "val_loss_rpn_loc: ([0-9\.]+)",},

Finally, you create the estimator to start the distributed training job by calling the fit method:

training_instance = "ml.p3.8xlarge"
od_algorithm = "faster_rcnn" # choose one in ("faster_rcnn", "retinanet")

d2_estimator = Estimator(
    base_job_name=f"detectron2-{od_algorithm.replace('_', '-')}",
        "training": training_channel,
        "validation": validation_channel,
        "test": test_channel,
        "annotation": annotation_channel,
    wait=training_instance == "local",

Benchmark the training job performance

This set of steps allows you to scale the training performance as needed without changing a single line of code. You just have to pick your training instance and the size of your cluster. Detectron2 automatically adapts to the training cluster size by using the launch utility. The following table compares the training runtime in seconds of jobs running for 3,000 iterations.

Faster-RCNN (seconds) RetinaNet (seconds)
ml.p3.2xlarge – 1 node 2,685 2,636
ml.p3.8xlarge – 1 node 774 742
ml.p3.16xlarge – 1 node 439 400
ml.p3.16xlarge – 2 nodes 338 311

The training time reduces on both Faster-RCNN and RetinaNet with the total number of GPUs. The distribution efficiency is approximately of 85% and 75% when passing from an instance with a single GPU to instances with four and eight GPUs, respectively.

Deploy the trained model to a remote endpoint

To deploy your trained model remotely, you need to prepare, build, and push a custom serving container and deploy this custom container for serving via the SageMaker SDK.

Build and push the custom serving container

We use the SageMaker inference container as a base image. This image includes a pre-installed PyTorch model server to host your PyTorch model, so no additional configuration or installation is required. For more information about the Docker files and shell scripts to push and build the containers, see the GitHub repo.

For this post, we build Detectron2 for the Volta and Turing chip architectures. Volta architecture is used to run SageMaker batch transform on P3 instance types. If you need real-time prediction, you should use G4 instance types because they provide optimal price-performance compromise. Amazon Elastic Compute Cloud (Amazon EC2) G4 instances provide the latest generation NVIDIA T4 GPUs, AWS custom Intel Cascade Lake CPUs, up to 100 Gbps of networking throughput, and up to 1.8 TB of local NVMe storage and direct access to GPU libraries such as CUDA and CuDNN.

Run batch transform jobs on the test set

The SageMaker Python SDK gives a simple way of running inference on a batch of images. You can get the predictions on the SKU-110K test set by running the following code:

model = PyTorchModel(
    name = "d2-sku110k-model",
    sagemaker_session = sm_session,
transformer = model.transformer(
    instance_type="ml.p3.2xlarge", # "ml.p2.xlarge"

The batch transform saves the predictions to an S3 bucket. You can evaluate your trained models by comparing the predictions to the ground truth. We use the pycocotools library to compute the metrics that official competitions use to evaluate object detection algorithms. The authors who published the SKU-110k dataset took into account three measures in their paper “Precise Detection in Densely Packed Scenes” (Goldman et al.):

  • Average Precision (AP) at 0.5:0.95 Intersection over Union (IoU)
  • AP at 75% IoU, i.e. AP75
  • Average Recall (AR) at 0.5:0.95 IoU

You can refer to the COCO website for the whole list of metrics that characterize the performance of an object detector on the COCO dataset. The following table compares the results from the paper to those obtained on SageMaker with Detectron2.

    AP AP75 AR
From paper by Goldman et al. RetinaNet 0.46 0.39 0.53
Faster-RCNN 0.04 0.01 0.05
Custom Method 0.49 0.56 0.55
Detectron2 on Amazon SageMaker RetinaNet 0.47 0.54 0.55
Faster-RCNN 0.49 0.53 0.55

We use SageMaker hyperparameter tuning jobs to optimize the hyperparameters of the object detectors. Faster-RCNN has the same performance in terms of AP and AR compared with the model proposed by Goldman et al. that is specifically conceived for object detection in dense scenes. Our Faster-RCNN loses three points on the AP75. However, this may be an acceptable performance decrease according to the business use case. Moreover, the advantage of our solution is that is doesn’t require any custom implementation because it only relies on Detecron2 modules. This proves that you can use Detectron2 to train at scale with SageMaker object detectors that compete with state-of-the-art solutions in challenging contexts such as dense scenes.


This post only scratches the surface of what is possible when deploying Detectron2 on the SageMaker platform. We hope that you found this introductory use case useful and we look forward to seeing what you build on AWS with this new tool in your ML toolset!

About the Authors

Vadim Dabravolski is Sr. AI/ML Architect at AWS. Areas of interest include distributed computations and data engineering, computer vision, and NLP algorithms. When not at work, he is catching up on his reading list (anything around business, technology, politics, and culture) and jogging in NYC boroughs.



Paolo Irrera is a Data Scientist at the Amazon Machine Learning Solutions Lab where he helps customers address business problems with ML and cloud capabilities. He holds a PhD in Computer Vision from Telecom ParisTech, Paris.


Read More

Save the date for the AWS Machine Learning Summit: June 2, 2021

Save the date for the AWS Machine Learning Summit: June 2, 2021

On June 2, 2021, don’t miss the opportunity to hear from some of the brightest minds in machine learning (ML) at the free virtual AWS Machine Learning Summit. Machine learning is one of the most disruptive technologies we will encounter in our generation. It’s improving customer experience, creating more efficiencies in operations, and spurring new innovations and discoveries like helping researchers discover new vaccines and aiding autonomous drones to sail our world’s oceans. But we’re just scratching the surface about what is possible. This Summit, which is open to all and free to attend, brings together industry luminaries, AWS customers, and leading ML experts to share the latest in machine learning. You’ll learn about the latest science breakthroughs in ML, how ML is impacting business, best practices in building ML, and how to get started now without prior ML expertise.

Hear from ML leaders from across AWS, Amazon, and the industry, including Swami Sivasubramanian, VP of AI and Machine Learning, AWS; Bratin Saha, VP of Machine Learning, AWS; and Yoelle Maarek, VP of Research, Alexa Shopping, who will share a keynote on how we’re applying customer-obsessed science to advance ML. Andrew Ng, founder and CEO of Landing AI and founder of, will join Swami Sivasubramanian in a fireside chat about the future of ML, the skills that are fundamental for the next generation of ML practitioners, and how we can bridge the gap from proof of concept to production in ML. You’ll also get an inside look at trends in deep learning and natural language in a powerhouse fireside chat with Amazon distinguished scientists Alex Smola and Bernhard Schölkopf, and Alexa AI senior principal scientist Dilek Hakkani-Tur.

Pick from over 30 session across four tracks, which all offer something for anyone who is interested in ML. Advanced practitioners and data scientists can learn about scientific breakthroughs and dive deep into the tools for building ML. Business and technical leaders can learn from their peers about implementing organization-wide ML initiatives. And developers can learn how to perform ML without needing any experience.

The science of machine learning

Advanced practitioners will get a technical deep dive into the groundbreaking work that ML scientists within AWS, Amazon, and beyond are doing to advance the science of ML in areas including computer vision, natural language processing, bias, and more. Speakers include two Amazon Scholars, Michael Kearns and Kathleen McKeown. Kearns a professor in the Computer and Information Science department at the University of Pennsylvania, where he holds the National Center Chair. He is co-author of the book “The Ethical Algorithm: The Science of Socially Aware Algorithm Design,” and joined Amazon as a scholar June 2020. McKeown is the Henry and Gertrude Rothschild professor of computer science at Columbia University, and the founding director of the school’s Data Science Institute. She joined Amazon as a scholar in 2019.

The impact of machine learning

Business leaders will learn from AWS customers that are leading the way in ML adoption. Customers including 3M, AstraZeneca, Vanguard, and Latent Space will share how they’re applying ML to create efficiencies, deliver new revenue streams, and launch entirely new products and business models. You’ll get best practices for scaling ML in an organization and showing impact.

How machine learning is done

Data scientists and ML developers will get practical deep dives into tools that can speed up the entire ML lifecycle, from building to training to deploying ML models. Sessions include how to choose the right algorithms, more accurate and speedy data prep, model explainability, and more.

Machine learning: no expertise required

If you’re a developer who wants to apply ML and AI to a use case but doesn’t have the expertise, this track is for you. Learn how to use AWS AI services and other tools to get started with your ML project right away, for use cases including contact center intelligence, personalization, intelligent document processing, business metrics analysis, computer vision, and more.

For more details, visit the website.

About the Author

Laura Jones is a product marketing lead for AWS AI/ML where she focuses on sharing the stories of AWS’s customers and educating organizations on the impact of machine learning. As a Florida native living and surviving in rainy Seattle, she enjoys coffee, attempting to ski and enjoying the great outdoors.

Read More

Use computer vision to detect crop disease through image analysis with Amazon Rekognition Custom Labels

Use computer vision to detect crop disease through image analysis with Amazon Rekognition Custom Labels

Currently, many diseases affect farming and lead to significant economic losses due to reduction of yield and loss of quality produce. In many cases, the health condition of a crop or a plant is often assessed by the condition of its leaves. For farmers, it is crucial to identify these symptoms early. Early identification is key to controlling diseases before they spread too far. However, manually identifying if a leaf is infected, the type of the infection, and the required disease control solution is a hard problem to solve. Current methods can be error prone and very costly. This is where an automated machine learning (ML) solution for computer vision (CV) can help. Typically, building complex machine learning models require hundreds of thousands of labeled images, along with expertise in data science. In this post, we showcase how you can build an end-to-end disease detection, identification, and resolution recommendation solution using Amazon Rekognition Custom Labels.

Amazon Rekognition is a fully managed service that provides CV capabilities for analyzing images and video at scale, using deep learning technology without requiring ML expertise. Amazon Rekognition Custom Labels, an automated ML feature of Amazon Rekognition, lets you quickly train custom CV models specific to your business needs, simply by bringing labeled images.

Solution overview

We create a custom model to detect the plant leaf disease. To create our custom model, we follow these steps:

  1. Create a project in Amazon Rekognition Custom Labels.
  2. Create a dataset with images containing multiple types of plant leaf diseases.
  3. Train the model and evaluate the performance.
  4. Test the new custom model using the automatically generated API endpoint.

Amazon Rekognition Custom Labels lets you manage the ML model training process on the Amazon Rekognition console, which simplifies the end-to-end model development and inference process.

Creating your project

To create your plant leaf disease detection project, complete the following steps:

  1. On the Amazon Rekognition console, choose Custom Labels.
  2. Choose Get Started.
  3. For Project name, enter plant-leaf-disease-detection.
  4. Choose Create project.

You can also create a project on the Projects page. You can access the Projects page via the navigation pane.

Creating your dataset

To create your leaf disease detection model, you first need to create a dataset to train the model with. For this post, our dataset is composed of three categories of plant leaf disease images: bacterial leaf blight, brown spots, and leaf smut.

The following images show examples of bacterial leaf blight.

The following images show examples of brown spots.

The following images show examples of leaf smut.

We sourced our images from UCI, Citation (Prajapati HB, Shah JP, Dabhi VK. Detection and classification of rice plant diseases. Intelligent Decision Technologies. 2017 Jan 1;11(3):357-73, doi: 10.3233/IDT-170301) (Dua, D. and Graff, C. (2019). UCI Machine Learning Repository []. Irvine, CA: University of California, School of Information and Computer Science.)

To create your dataset, complete the following steps:

  1. Create an Amazon Simple Storage Service (Amazon S3) bucket.

For this post, I create an S3 bucket called plan-leaf-disease-data.

  1. Create three folders inside this bucket called Bacterial-Leaf-Blight, Brown-Spot, and Leaf-Smut to store images of each disease category.

  1. Upload each category of image files in their respective bucket.
  2. On the Amazon Rekognition console, under Datasets, choose Create dataset.
  3. Select Import images from Amazon S3 bucket.

  1. For S3 folder location, enter the S3 bucket path.
  2. For automatic labeling, select Automatically attach a label to my images based on the folder they’re stored in.

This creates data labeling of the images as folder names.

You can now see the generated S3 bucket permissions policy.

  1. Copy the JSON policy.

  1. Navigate to the S3 bucket.
  2. On the Permission tab, under Bucket policy, choose Edit.
  3. Enter the JSON policy you copied.
  4. Chose Save changes.

  1. Choose Submit.

You can see that image labeling is organized based on the folder name.

Training your model

After you label your images, you’re ready to train your model.

  1. Choose Train Model.
  2. For Choose project, choose your project plant-leaf-disease-detection.
  3. For Choose training dataset, choose your dataset plant-leaf-disease-dataset.

As part of model training, Amazon Rekognition Custom Labels requires a labeled test dataset. Amazon Rekognition Custom Labels uses the test dataset to verify how well your trained model predicts the correct labels and generates evaluation metrics. Images in the test dataset are not used to train your model and should represent the same types of images you use with your model to analyze.

  1. For Create test set, select how you want to create your test dataset.

Amazon Rekognition Custom Labels provides three options:

  • Choose an existing test dataset
  • Create a new test dataset
  • Split training dataset

For this post, we select Split training dataset and let Amazon Rekognition hold back 20% of the images for testing and use the remaining 80% of the images to train the model.

Our model took approximately 1 hour to train. The training time required for your model depends on many factors, including the number of images provided in the dataset and the complexity of the model.

When training is complete, Amazon Rekognition Custom Labels outputs key quality metrics, including F1 score, precision, recall, and the assumed threshold for each label. For more information about metrics, see Metrics for Evaluating Your Model.

Our evaluation results show that our model has a precision of 1.0 for Bacterial-Leaf-Blight and Brown-Spot, which means that no objects were mistakenly identified (false positives) in our test set. Our model also didn’t miss any objects in our test set (false negatives), which is reflected in our recall score of 1. You can often use the F1 score as an overall quality score because it takes both precision and recall into account. Finally, we see that our assumed threshold to generate the F1 score, precision, and recall metrics each category is 0.62, 0.69, and 0.54 for Bacterial-Leaf-Blight, Brown-Spot, and Leaf-Smut, respectively. By default, our model returns predictions above this assumed threshold.

We can also choose View test results to see how our model performed on each test image. The following screenshot shows an example of a correctly identified image of bacterial leaf blight during the model testing (true positive).

Testing your model

Your plant disease detection model is now ready for use. Amazon Rekognition Custom Labels provides the API calls for starting, using, and stopping your model; you don’t need to manage any infrastructure. For more information, see Starting or Stopping an Amazon Rekognition Custom Labels Model (Console).

In addition to using the API, you can also use the Custom Labels Demonstration. This CloudFormation template enables you to set up a custom, password-protected UI where you can start and stop your models and run demonstration inferences.

Once deployed, the application can be accessed using a web browser using the address specified in url output from the CloudFormation stack created during deployment of the solution.

  1. Choose Start the model.

  1. Provide the inference unit required. For this example, let’s give a value of 1.

You’re charged for the amount of time, in minutes, that the model is running. For more information, see Inference hours.

It might take a while to start.

  1. Choose the model name.

  1. Choose Upload.

A window opens for you to choose the plant leaf image from your local drive.

The model detects the disease in the uploaded leaf image along with confidence score. It also gives the pest control recommendation based on the type of disease.

Cleaning up

To avoid incurring unnecessary charges, delete the resources used in this walkthrough when not in use. For instructions, see the following:


In this post, we showed you how to create an object detection model with Amazon Rekognition Custom Labels. This feature makes it easy to train a custom model that can detect an object class without needing to specify other objects or losing accuracy in its results.

For more information about using custom labels, see What Is Amazon Rekognition Custom Labels?

About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.



Sameer Goel is a Solutions Architect in Seattle, who drives customer success by building prototypes on cutting-edge initiatives. Prior to joining AWS, Sameer graduated with a master’s degree from NEU Boston, with a concentration in data science. He enjoys building and experimenting with AI/ML projects on Raspberry Pi.

Read More

Join AWS at NVIDIA GTC 21, April 12–16

Join AWS at NVIDIA GTC 21, April 12–16

Starting Monday, April 12, 2021, the NVIDIA GPU Technology Conference (GTC) is offering online sessions for you to learn AWS best practices to accomplish your machine learning (ML), virtual workstations, high performance computing (HPC), and Internet of Things (IoT) goals faster and more easily.

Amazon Elastic Compute Cloud (Amazon EC2) instances powered by NVIDIA GPUs deliver the scalable performance needed for fast ML training, cost-effective ML inference, flexible remote virtual workstations, and powerful HPC computations. At the edge, you can use AWS IoT Greengrass and Amazon SageMaker Neo to extend a wide range of AWS Cloud services and ML inference to NVIDIA-based edge devices so the devices can act locally on the data they generate.

AWS is a Global Diamond Sponsor of the conference.

Available sessions

ML infrastructure:

ML with Amazon SageMaker:

ML deep dive:

High performance computing:

Internet of Things:

Edge computing with AWS Wavelength:


Computer vision with AWS Panorama:

Game tech:

Visit AWS at NVIDIA GTC 21 for more details and register for free for access to this content during the week of April 12, 2021. See you there!

About the Author

Geoff Murase is a Senior Product Marketing Manager for AWS EC2 accelerated computing instances, helping customers meet their compute needs by providing access to hardware-based compute accelerators such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs). In his spare time, he enjoys playing basketball and biking with his family.

Read More

Build a CI/CD pipeline for deploying custom machine learning models using AWS services

Build a CI/CD pipeline for deploying custom machine learning models using AWS services

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality ML artifacts. AWS Serverless Application Model (AWS SAM) is an open-source framework for building serverless applications. It provides shorthand syntax to express functions, APIs, databases, event source mappings, steps in AWS Step Functions, and more.

Generally, ML workflows orchestrate and automate sequences of ML tasks. A workflow includes data collection, training, testing, human evaluation of the ML model, and deployment of the models for inference.

For continuous integration and continuous delivery (CI/CD) pipelines, AWS recently released Amazon SageMaker Pipelines, the first purpose-built, easy-to-use CI/CD service for ML. Pipelines is a native workflow orchestration tool for building ML pipelines that takes advantage of direct SageMaker integration. For more information, see Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines.

In this post, I show you an extensible way to automate and deploy custom ML models using service integrations between Amazon SageMaker, Step Functions, and AWS SAM using a CI/CD pipeline.

To build this pipeline, you also need to be familiar with the following AWS services:

  • AWS CodeBuild – A fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy
  • AWS CodePipeline – A fully managed continuous delivery service that helps you automate your release pipelines
  • Amazon Elastic Container Registry (Amazon ECR) – A container registry
  • AWS Lambda – A service that lets you run code without provisioning or managing servers. You pay only for the compute time you consume
  • Amazon Simple Storage Service (Amazon S3) – An object storage service that offers industry-leading scalability, data availability, security, and performance
  • AWS Step Functions – A serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services

Solution overview

The solution has two main sections:

  • Use AWS SAM to create a Step Functions workflow with SageMaker – Step Functions recently announced native service integrations with SageMaker. You can use this feature to train ML models, deploy ML models, test results, and expose an inference endpoint. This feature also provides a way to wait for human approval before the state transitions can progress towards the final ML model inference endpoint’s configuration and deployment.
  • Deploy the model with a CI/CD pipeline – One of the requirements of SageMaker is that the source code of custom models needs to be stored as a Docker image in an image registry such as Amazon ECR. SageMaker then references this Docker image for training and inference. For this post, we create a CI/CD pipeline using CodePipeline and CodeBuild to build, tag, and upload the Docker image to Amazon ECR and then start the Step Functions workflow to train and deploy the custom ML model on SageMaker, which references this tagged Docker image.

The following diagram describes the general overview of the MLOps CI/CD pipeline.

The workflow includes the following steps:

  1. The data scientist works on developing custom ML model code using their local notebook or a SageMaker notebook. They commit and push changes to a source code repository.
  2. A webhook on the code repository triggers a CodePipeline build in the AWS Cloud.
  3. CodePipeline downloads the source code and starts the build process.
  4. CodeBuild downloads the necessary source files and starts running commands to build and tag a local Docker container image.
  5. CodeBuild pushes the container image to Amazon ECR. The container image is tagged with a unique label derived from the repository commit hash.
  6. CodePipeline invokes Step Functions and passes the container image URI and the unique container image tag as parameters to Step Functions.
  7. Step Functions starts a workflow by initially calling the SageMaker training job and passing the necessary parameters.
  8. SageMaker downloads the necessary container image and starts the training job. When the job is complete, Step Functions directs SageMaker to create a model and store the model in the S3 bucket.
  9. Step Functions starts a SageMaker batch transform job on the test data provided in the S3 bucket.
  10. When the batch transform job is complete, Step Functions sends an email to the user using Amazon Simple Notification Service (Amazon SNS). This email includes the details of the batch transform job and links to the test data prediction outcome stored in the S3 bucket. After sending the email, Step Function enters a manual wait phase.
  11. The email sent by Amazon SNS has links to either accept or reject the test results. The recipient can manually look at the test data prediction outcomes in the S3 bucket. If they’re not satisfied with the results, they can reject the changes to cancel the Step Functions workflow.
  12. If the recipient accepts the changes, an Amazon API Gateway endpoint invokes a Lambda function with an embedded token that references the waiting Step Functions step.
  13. The Lambda function calls Step Functions to continue the workflow.
  14. Step Functions resumes the workflow.
  15. Step Functions creates a SageMaker endpoint config and a SageMaker inference endpoint.
  16. When the workflow is successful, Step Functions sends an email with a link to the final SageMaker inference endpoint.

Use AWS SAM to create a Step Functions workflow with SageMaker

In this first section, you visualize the Step Functions ML workflow easily in Visual Studio Code and deploy it to the AWS environment using AWS SAM. You use some of the new features and service integrations such as support in AWS SAM for AWS Step Functions, native support in Step Functions for SageMaker integrations, and support in Step Functions to visualize workflows directly in VS Code.


Before getting started, make sure you complete the following prerequisites:

Deploy the application template

To get started, follow the instructions on GitHub to complete the application setup. Alternatively, you can switch to the terminal and enter the following command:

git clone

The directory structure should be as follows:

. sam-sf-sagemaker-workflow
|– cfn
|—- sam-template.yaml
| — functions
| —- api_sagemaker_endpoint
| —- create_and_email_accept_reject_links
| —- respond_to_links
| —- update_sagemakerEndpoint_API
| — scripts
| — statemachine
| —- mlops.asl.json

The code has been broken down into subfolders with the main AWS SAM template residing in path cfn/sam-template.yaml.

The Step Functions workflows are stored in the folder statemachine/mlops.asl.json, and any other Lambda functions used are stored in functions folder.

To start with the AWS SAM template, run the following bash scripts from the root folder:

#Create S3 buckets if required before executing the commands.
S3_BUCKET=bucket-mlops #bucket to store AWS SAM template
S3_BUCKET_MODEL=ml-models   #bucket to store ML models
STACK_NAME=sam-sf-sagemaker-workflow   #Name of the AWS SAM stack
sam build  -t cfn/sam-template.yaml    #AWS SAM build 
sam deploy --template-file .aws-sam/build/template.yaml 
--stack-name ${STACK_NAME} --force-upload 
--s3-bucket ${S3_BUCKET} --s3-prefix sam 
--parameter-overrides S3ModelBucket=${S3_BUCKET_MODEL} 
--capabilities CAPABILITY_IAM

The sam build command builds all the functions and creates the final AWS CloudFormation template. The sam deploy command uploads the necessary files to the S3 bucket and starts creating or updating the CloudFormation template to create the necessary AWS infrastructure.

When the template has finished successfully, go to the CloudFormation console. On the Outputs tab, copy the MLOpsStateMachineArn value to use later.

The following diagram shows the workflow carried out in Step Functions, using VS Code integrations with Step Functions.

The following JSON based snippet of Amazon States Language describes the workflow visualized in the preceding diagram.

    "Comment": "This Step Function starts machine learning pipeline, once the custom model has been uploaded to ECR. Two parameters are expected by Step Functions are git commitID and the sagemaker ECR custom container URI",
    "StartAt": "SageMaker Create Training Job",
    "States": {
        "SageMaker Create Training Job": {
            "Type": "Task",
            "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync",
            "Parameters": {
                "TrainingJobName.$": "$.commitID",
                "ResourceConfig": {
                    "InstanceCount": 1,
                    "InstanceType": "ml.c4.2xlarge",
                    "VolumeSizeInGB": 20
                "HyperParameters": {
                    "mode": "batch_skipgram",
                    "epochs": "5",
                    "min_count": "5",
                    "sampling_threshold": "0.0001",
                    "learning_rate": "0.025",
                    "window_size": "5",
                    "vector_dim": "300",
                    "negative_samples": "5",
                    "batch_size": "11"
                "AlgorithmSpecification": {
                    "TrainingImage.$": "$.imageUri",
                    "TrainingInputMode": "File"
                "OutputDataConfig": {
                    "S3OutputPath": "s3://${S3ModelBucket}/output"
                "StoppingCondition": {
                    "MaxRuntimeInSeconds": 100000
                "RoleArn": "${SagemakerRoleArn}",
                "InputDataConfig": [
                        "ChannelName": "training",
                        "DataSource": {
                            "S3DataSource": {
                                "S3DataType": "S3Prefix",
                                "S3Uri": "s3://${S3ModelBucket}/iris.csv",
                                "S3DataDistributionType": "FullyReplicated"
            "Retry": [
                    "ErrorEquals": [
                    "IntervalSeconds": 1,
                    "MaxAttempts": 1,
                    "BackoffRate": 1.1
                    "ErrorEquals": [
                    "IntervalSeconds": 60,
                    "MaxAttempts": 1,
                    "BackoffRate": 1
            "Catch": [
                    "ErrorEquals": [
                    "ResultPath": "$.cause",
                    "Next": "FailState"
            "Next": "SageMaker Create Model"
        "SageMaker Create Model": {
            "Type": "Task",
            "Resource": "arn:aws:states:::sagemaker:createModel",
            "Parameters": {
                "ExecutionRoleArn": "${SagemakerRoleArn}",
                "ModelName.$": "$.TrainingJobName",
                "PrimaryContainer": {
                    "ModelDataUrl.$": "$.ModelArtifacts.S3ModelArtifacts",
                    "Image.$": "$.AlgorithmSpecification.TrainingImage"
            "ResultPath": "$.taskresult",
            "Next": "SageMaker Create Transform Job",
            "Catch": [
                "ErrorEquals": ["States.ALL" ],
                "Next": "FailState"
        "SageMaker Create Transform Job": {
            "Type": "Task",
            "Resource": "arn:aws:states:::sagemaker:createTransformJob.sync",
            "Parameters": {
                "ModelName.$": "$.TrainingJobName",
                "TransformInput": {
                    "SplitType": "Line",
                    "CompressionType": "None",
                    "ContentType": "text/csv",
                    "DataSource": {
                        "S3DataSource": {
                            "S3DataType": "S3Prefix",
                            "S3Uri": "s3://${S3ModelBucket}/iris.csv"
                "TransformOutput": {
                    "S3OutputPath.$": "States.Format('s3://${S3ModelBucket}/transform_output/{}/iris.csv', $.TrainingJobName)" ,
                    "AssembleWith": "Line",
                    "Accept": "text/csv"
                "DataProcessing": {
                    "InputFilter": "$[1:]"
                "TransformResources": {
                    "InstanceCount": 1,
                    "InstanceType": "ml.m4.xlarge"
                "TransformJobName.$": "$.TrainingJobName"
            "ResultPath": "$.result",
            "Next": "Send Approve/Reject Email Request",
            "Catch": [
                "ErrorEquals": [
                "Next": "FailState"
        "Send Approve/Reject Email Request": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
            "Parameters": {
                "FunctionName": "${CreateAndEmailLinkFnName}",
                "Payload": {
            "ResultPath": "$.output",
            "Next": "Sagemaker Create Endpoint Config",
            "Catch": [
                    "ErrorEquals": [ "rejected" ],
                    "ResultPath": "$.output",
                    "Next": "FailState"
        "Sagemaker Create Endpoint Config": {
            "Type": "Task",
            "Resource": "arn:aws:states:::sagemaker:createEndpointConfig",
            "Parameters": {
                "EndpointConfigName.$": "$.TrainingJobName",
                "ProductionVariants": [
                        "InitialInstanceCount": 1,
                        "InitialVariantWeight": 1,
                        "InstanceType": "ml.t2.medium",
                        "ModelName.$": "$.TrainingJobName",
                        "VariantName": "AllTraffic"
            "ResultPath": "$.result",
            "Next": "Sagemaker Create Endpoint",
            "Catch": [
                  "ErrorEquals": [
                  "Next": "FailState"
        "Sagemaker Create Endpoint": {
            "Type": "Task",
            "Resource": "arn:aws:states:::sagemaker:createEndpoint",
            "Parameters": {
                "EndpointName.$": "$.TrainingJobName",
                "EndpointConfigName.$": "$.TrainingJobName"
            "Next": "Send Email With API Endpoint",
            "Catch": [
                  "ErrorEquals": [
                  "Next": "FailState"
        "Send Email With API Endpoint": {
            "Type": "Task",
            "Resource": "${UpdateSagemakerEndpointAPI}",
            "Catch": [
                  "ErrorEquals": [
                  "Next": "FailState"
             "Next": "SuccessState"
        "SuccessState": {
            "Type": "Succeed"            
        "FailState": {
            "Type": "Fail"          

Step Functions process to create the SageMaker workflow

In this section, we discuss the detailed steps involved in creating the SageMaker workflow using Step Functions.

Step Functions uses the commit ID passed by CodePipeline as a unique identifier to create a SageMaker training job. The training job can sometimes take a long time to complete; to wait for the job, you use .sync while specifying the resource section of the SageMaker training job.

When the training job is complete, Step Functions creates a model and saves the model in an S3 bucket.

Step Functions then uses a batch transform step to evaluate and test the model, based on batch data initially provided by the data scientist in an S3 bucket. When the evaluation step is complete, the output is stored in an S3 bucket.

Step Functions then enters a manual approval stage. To create this state, you use callback URLs. To implement this state in Step Functions, use .waitForTaskToken while calling a Lambda resource and pass a token to the Lambda function.

The Lambda function uses Amazon SNS or Amazon Simple Email Service (Amazon SES) to send an email to the subscribed party. You need to add your email address to the SNS topic to receive the accept/reject email while testing.

You receive an email, as in the following screenshot, with links to the data stored in the S3 bucket. This data has been batch transformed using the custom ML model created in the earlier step by SageMaker. You can choose Accept or Reject based on your findings.

If you choose Reject, Step Functions stops running the workflow. If you’re satisfied with the results, choose Accept, which triggers the API link. This link passes the embedded token and type to the API Gateway or Lambda endpoint as request parameters to progress to the next Step Functions step.

See the following Python code:

import json
import boto3
sf = boto3.client('stepfunctions')
def lambda_handler(event, context):
    type= event.get('queryStringParameters').get('type')
    token= event.get('queryStringParameters').get('token')    
    if type =='success':


    return {
        'statusCode': 200,
        'body': json.dumps('Responded to Step Function')

Step Functions then creates the final unique SageMaker endpoint configuration and inference endpoint. You can achieve this in Lambda code using special resource values, as shown in the following screenshot.

When the SageMaker endpoint is ready, an email is sent to the subscriber with a link to the API of the SageMaker inference endpoint.

Deploy the model with a CI/CD pipeline

In this section, you use the CI/CD pipeline to deploy a custom ML model.

The pipeline starts its run as soon as it detects updates to the source code of the custom model. The pipeline downloads the source code from the repository, builds and tags the Docker image, and uploads the Docker image to Amazon ECR. After uploading the Docker image, the pipeline triggers the Step Functions workflow to train and deploy the custom model to SageMaker. Finally, the pipeline sends an email to the specified users with details about the SageMaker inference endpoint.

We use Scikit Bring Your Own Container to build a custom container image and use the iris dataset to train and test the model.

When your Step Functions workflow is ready, build your full pipeline using the code provided in the GitHub repo.

After you download the code from the repo, the directory structure should look like the following:

. codepipeline-ecr-build-sf-execution
| — cfn
| —- params.json
| —- pipeline-cfn.yaml
| — container
| —- descision_trees
| —- local_test
| —- .dockerignore
| —- Dockerfile
| — scripts

In the params.json file in folder /cfn, provide in your GitHub token, repo name, the ARN of the Step Function state machine you created earlier.

You now create the necessary services and resources for the CI/CD pipeline. To create the CloudFormation stack, run the following code:

aws cloudformation create-stack --stack-name codepipeline-ecr-build-sf-execution --template-body file://cfn/pipeline-cfn.yaml  --parameters file://cfn/params.json --capabilities CAPABILITY_NAMED_IAM

Alternatively, to update the stack, run the following code:

aws cloudformation update-stack --stack-name codepipeline-ecr-build-sf-execution --template-body file://cfn/pipeline-cfn.yaml  --parameters file://cfn/params.json --capabilities CAPABILITY_NAMED_IAM

The CloudFormation template deploys a CodePipeline pipeline into your AWS account. The pipeline starts running as soon as code changes are committed to the repo. After the source code is downloaded by the pipeline stage, CodeBuild creates a Docker image and tags it with the commit ID and current timestamp before pushing the image to Amazon ECR. CodePipeline moves to the next stage to trigger a Step Functions step (which you created earlier).

When Step Functions is complete, a final email is generated with a link to the API Gateway URL that references the newly created SageMaker inference endpoint.

Test the workflow

To test your workflow, complete the following steps:

  1. Start the CodePipeline build by committing a code change to the codepipeline-ecr-build-sf-execution/container folder.
  2. On the CodePipeline console, check that the pipeline is transitioning through the different stages as expected.

When the pipeline reaches its final state, it starts the Step Functions workflow, which sends an email for approval.

  1. Approve the email to continue the Step Functions workflow.

When the SageMaker endpoint is ready, you should receive another email with a link to the API inference endpoint.

To test the iris dataset, you can try sending a single data point to the inference endpoint.

  1. Copy the inference endpoint link from the email and assign it to the bash variable INFERENCE_ENDPOINT as shown in the following code, then use the

curl --location --request POST ${INFERENCE_ENDPOINT}  --header 'Content-Type: application/json' --data-raw '{  "data": "4.5,1.3,0.3,0.3"
{"result": "setosa"}

curl --location --request POST ${INFERENCE_ENDPOINT}  --header 'Content-Type: application/json' --data-raw '{
  "data": "5.9,3,5.1,1.8"
{"result": "virginica"}

By sending different data, we get different sets of inference results back.

Clean up

To avoid ongoing charges, delete the resources created in the previous steps by deleting the CloudFormation templates. Additionally, on the SageMaker console, delete any unused models, endpoint configurations, and inference endpoints.


This post demonstrated how to create an ML pipeline for custom SageMaker ML models using some of the latest AWS service integrations.

You can extend this ML pipeline further by adding a layer of authentication and encryption while sending approval links. You can also add more steps to CodePipeline or Step Functions as deemed necessary for your project’s workflow.

The sample files are available in the GitHub repo. To explore related features of SageMaker and further reading, see the following:

About the Author

Sachin Doshi is a Senior Application Architect working in the AWS Professional Services team. He is based out of New York metropolitan area. Sachin helps customers optimize their applications using cloud native AWS services.

Read More

How Digitata provides intelligent pricing on mobile data and voice with Amazon Lookout for Metrics

How Digitata provides intelligent pricing on mobile data and voice with Amazon Lookout for Metrics

This is a guest post by Nico Kruger (CTO of Digitata) and Chris King (Sr. ML Specialist SA at AWS). In their own words, “Digitata intelligently transforms pricing and subscriber engagement for mobile operators, empowering operators to make better and more informed decisions to meet and exceed business objectives.”

As William Gibson said, “The future is here. It’s just not evenly distributed yet.” This is incredibly true in many emerging markets for connectivity. Users often pay 100 times more for data, voice, and SMS services than their counterparts in Europe or the US. Digitata aims to better democratize access to telecommunications services through dynamic pricing solutions for mobile network operators (MNOs) and to help deliver optimal returns on their large capital investments in infrastructure.

Our pricing models are classically based on supply and demand. We use machine learning (ML) algorithms to optimize two main variables: utilization (of infrastructure), and revenue (fees for telco services). For example, at 3:00 AM when a tower is idle, it’s better to charge a low rate for data than waste this fixed capacity and have no consumers. Comparatively, for a very busy tower, it’s prudent to raise the prices at certain times to reduce congestion, thereby reducing the number of dropped calls or sluggish downloads for customers.

Our models attempt to optimize utilization and revenue according to three main features, or dimensions: location, time, and user segment. Taking the traffic example further, the traffic profile over time for a tower located in a rural or suburban area is very different from a tower in a central downtown district. In general, the suburban tower is busier early in the mornings and later at night than the tower based in the central business district, which is much busier during traditional working hours.

Our customers (the MNOs) trust us to be their automated, intelligent pricing partner. As such, it’s imperative that we keep on top of any anomalous behavior patterns when it comes to their revenue or network utilization. If our model charges too little for data bundles (or even makes it free), it could lead to massive network congestion issues as well as the obvious lost revenue impact. Conversely, if we charge too much for services, it could lead to unhappy customers and loss of revenue, through the principles of supply and demand.

It’s therefore imperative that we have a robust, real-time anomaly detection system in place to alert us whenever there is anomalous behavior on revenue and utilization. It also needs to be aware of the dimensions we operate under (location, user segment, and time).

History of anomaly detection at Digitata

We have been through four phases of anomaly detection at Digitata in the last 13 years:

  1. Manually monitoring our KPIs in reports on a routine basis.
  2. Defining routine checks using static thresholds that alert if the threshold is exceeded.
  3. Using custom anomaly detection models to track basic KPIs over time, such as total unique customers per tower, revenue per GB, and network congestion.
  4. Creating complex collections of anomaly detection models to track even more KPIs over time.

Manual monitoring continued to grow to consume more of our staff hours and was the most error-prone, which led to the desire to automate in Phase 2. The automated alarms with static alert thresholds ensured that routine checks were actually and accurately performed, but not with sufficient sophistication. This led to alert fatigue, and pushed us to custom modeling.

Custom modeling can work well for a simple problem, but the approach for one particular problem doesn’t translate perfectly to the next problem. This leads to a number of models that must be operating in the field to provide relevant insights. The operational complexity of orchestrating these begins to scale beyond the means of our in-house developers and tooling. The cost of expansion also prohibits other teams from running experiments and identifying other opportunities for us to leverage ML-backed anomaly detection.

Additionally, although you can now detect anomalies via ML, you still need to do frequent deep-dive analysis to find other combinations of dimensions that may point to underlying anomalies. For example, when a competitor is strongly targeting a certain location or segment of users, it may have an adverse impact on sales that may not necessarily be reflected, depending on how deep you have set up your anomaly detection models to actively track the different dimensions.

The problem that remains to be solved

Given our earlier problem statement, it means that we have, at least, the following dimensions under which products are being sold:

  • Thousands of locations (towers).
  • Hundreds of products and bundles (different data bundles such as social or messaging).
  • Hundreds of customer segments. Segments are based on clusters of users according to hundreds of attributes that are system calculated from MNO data feeds.
  • Hourly detection for each day of the week.

We can use traditional anomaly detection methods to have anomaly detection on a measure, such as revenue or purchase count. We don’t, however, have the necessary insights on a dimension-based level to answer questions such as:

  • How is product A selling compared to product B?
  • What does revenue look like at location A vs. location B?
  • What do sales look like for customer segment A vs. customer segment B?
  • When you start combining dimensions, what does revenue look like on product A, segment A, vs. product A, segment B; product B, segment A; and product B, segment B?

The number of dimensions quickly add up. It becomes impractical to create anomaly detection models for each dimension and each combination of dimensions. And that is only with the four dimensions mentioned! What if we want to quickly add two or three additional dimensions to our anomaly detection systems? It requires time and resource investment, even to use existing off-the-shelf tools to create additional anomaly models, notwithstanding the weeks to months of investment required to build it in-house.

That is when we looked for a purpose-built tool to do exactly this, such as the dimension-aware managed anomaly detection service, Amazon Lookout for Metrics.

Amazon Lookout for Metrics

Amazon Lookout for Metrics uses ML to automatically detect and diagnose anomalies (outliers from the norm) in business and operational time series data, such as a sudden dip in sales revenue or customer acquisition rates.

In a couple of clicks, you can connect Amazon Lookout for Metrics to popular data stores like Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Amazon Relational Database Service (Amazon RDS), as well as third-party SaaS applications, such as Salesforce, ServiceNow, Zendesk, and Marketo, and start monitoring metrics that are important to your business.

Amazon Lookout for Metrics automatically inspects and prepares the data from these sources and builds a custom ML model—informed by over 20 years of experience at Amazon—to detect anomalies with greater speed and accuracy than traditional methods used for anomaly detection. You can also provide feedback on detected anomalies to tune the results and improve accuracy over time. Amazon Lookout for Metrics makes it easy to diagnose detected anomalies by grouping together anomalies that are related to the same event and sending an alert that includes a summary of the potential root cause. It also ranks anomalies in order of severity so that you can prioritize your attention to what matters the most to your business.

How we used Amazon Lookout for Metrics

Inside Amazon Lookout for Metrics, you need to describe your data in terms of measures and dimensions. Measures are variables or key performance indicators on which you want to detect anomalies, and dimensions are metadata that represent categorical information about the measures.

To detect outliers, Amazon Lookout for Metrics builds an ML model that is trained with your source data. This model, called a detector, is automatically trained with the ML algorithm that best fits your data and use case. You can either provide your historical data for training, if you have any, or get started with real-time data, and Amazon Lookout for Metrics learns as it goes.

We used Amazon Lookout for Metrics to convert our anomaly detection tracking on two of our most important datasets: bundle revenue and voice revenue.

For bundle revenue, we track the following measures:

  • Total revenue from sales
  • Total number of sales
  • Total number of sales to distinct users
  • Average price at which the product was bought

Additionally, we track the following dimensions:

  • Location (tower)
  • Product
  • Customer segment

For voice revenue, we track the following measures:

  • Total calls made
  • Total revenue from calls
  • Total distinct users that made a call
  • The average price at which a call was made

Additionally, we track the following dimensions:

  • Location (tower)
  • Type of call (international, on-net, roaming, off-net)
  • Whether the user received a discount or not
  • Customer spend

This allows us to have coverage on these two datasets, using only two anomaly detection models with Amazon Lookout for Metrics.

Architecture overview

Apache Nifi is an open-source data flow tool that we use for ETL tasks, both on premises and in AWS. We use it as a main flow engine for parsing, processing, and updating data we receive from the mobile network. This data ranges from call records, data usage records, airtime recharges, to network tower utilization and congestion information. This data is fed into our ML models to calculate the price on a product, location, time, and segment basis.

The following diagram illustrates our architecture.

The following diagram illustrates our architecture.

Because of the reality of the MNO industry (at the moment), it’s not always possible for us to leverage AWS for all of our deployments. Therefore, we have a mix of fully on-premises, hybrid, and fully native cloud deployments.

We use a setup where we leverage Apache Nifi, connected from AWS over VPC and VPN connections, to pull anonymized data on an event-based basis from all of our deployments (regardless of type) simultaneously. The data is then stored in Amazon S3 and in Amazon CloudWatch, from where we can use services such as Amazon Lookout for Metrics.

Results from our experiments

While getting to know Amazon Lookout for Metrics, we primarily focused the backtesting functionality within the service. This feature allows you to supply historical data, have Amazon Lookout for Metrics train on a large portion of your early historical data, and then identify anomalies in the remaining, more recent data.

We quickly discovered that this has massive potential, not only to start learning the service, but also to gather insights as to what other opportunities reside within your data, which you may have never thought of, or always expected to be there but never had the time to investigate.

For example, we quickly found the following very interesting example with one of our customers. We were tracking voice revenue as the measure, under the dimensions of call type (on-net, off-net, roaming), region (a high-level concept of an area, such as a province or big city), and timeband (after hours, business hours, weekends)

Amazon Lookout for Metrics identified an anomaly on international calls in a certain region, as shown in the following graph.

We quickly went to our source data, and saw the following visualization.

We quickly went to our source data, and saw the following visualization.

This graph looks at the total revenue for the days for international calls. As you can see, when looking at the global revenue, there is no real impact of the sort that Amazon Lookout for Metrics identified.

But when looking at the specific region that was identified, you see the following anomaly.

But when looking at the specific region that was identified, you see the following anomaly.

A clear spike in international calls took place on this day, in this region. We looked deeper into it and found that the specific city identified by this region is known as a tourist and conference destination. This begs the question: is there any business value to be found in an insight such as this? Can we react to anomalies like these in real time by using Amazon Lookout for Metrics and then providing specific pricing specials on international calls in the region, in order to take advantage of the influx of demand? The answer is yes, and we are! With stakeholders alerted to these notifications for future events and with exploratory efforts into our recent history, we’re prepared for future issues and are becoming more aware of operational gaps in our past.

In addition to the exploration using the back testing feature (which is still ongoing as of this writing), we also set up real-time detectors to work in parallel with our existing anomaly detection service.

Within two days, we found our first real operational issue, as shown in the following graph.

The graph shows revenue attributed to voice calls in another customer. In this case, we had a clear spike in our catchall NO LOCATION LOOKUP region. We map revenue from the towers to regions (such as city, province, or state) using a mapping table that we periodically refresh from within the MNO network, or by receiving such a mapping from the MNO themselves. When a tower isn’t mapped correctly by this table, it shows up as this catchall region in our data. In this case, there was a problem with the mapping feed from our customer.

The effect was that the number of towers that could not be classified was slowly growing. This could affect our pricing models, which could become less accurate at factoring the location aspect when generating the optimal price.

A very important operational anomaly to detect early!

Digitata in the future

We’re constantly evolving our ML and analytics capabilities, with the end goal of making connectivity more affordable for the entire globe. As we continue on this journey, we look to services such as Amazon Lookout for Metrics to help us ensure the quality of our services, find operational issues, and identify opportunities. It has made a dramatic difference in our anomaly detection capabilities, and has pointed us to some previously undiscovered opportunities. This all allows us to work on what really matters: getting everyone connected to the wonder of the internet at affordable prices!

Getting started

Amazon Lookout for Metrics is now available in preview in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland). Request preview access to get started today!

You can interact with the service using the AWS Management Console, the AWS SDKs, and the AWS Command Line Interface (AWS CLI). For more information, see the Amazon Lookout for Metrics Developer Guide.

About the Authors

Nico Kruger is the CTO of Digitata and is a fan of programming computers, reading things, listening to music, and playing games. Nico has 10+ years experience in telco. In his own words: “From C++ to Javascript, AWS to on-prem, as long as the tool is fit for the job, it works and the customer is happy; it’s all good. Automate all the things, plan for failure and be adaptable and everybody wins.”


Chris King is a Senior Solutions Architect in Applied AI with AWS. He has a special interest in launching AI services and helped grow and build Amazon Personalize and Amazon Forecast before focusing on Amazon Lookout for Metrics. In his spare time, he enjoys cooking, reading, boxing, and building models to predict the outcome of combat sports.

Read More