Automate Amazon Rekognition Custom Labels model training and deployment using AWS Step Functions

Automate Amazon Rekognition Custom Labels model training and deployment using AWS Step Functions

With Amazon Rekognition Custom Labels, you can have Amazon Rekognition train a custom model for object detection or image classification specific to your business needs. For example, Rekognition Custom Labels can find your logo in social media posts, identify your products on store shelves, classify machine parts in an assembly line, distinguish healthy and infected plants, or detect animated characters in videos.

Developing a Rekognition Custom Labels model to analyze images is a significant undertaking that requires time, expertise, and resources, often taking months to complete. Additionally, it often requires thousands or tens of thousands of hand-labeled images to provide the model with enough data to accurately make decisions. Generating this data can take months to gather and require large teams of labelers to prepare it for use in machine learning (ML).

With Rekognition Custom Labels, we take care of the heavy lifting for you. Rekognition Custom Labels builds off of the existing capabilities of Amazon Rekognition, which is already trained on tens of millions of images across many categories. Instead of thousands of images, you simply need to upload a small set of training images (typically a few hundred images or less) that are specific to your use case via our easy-to-use console. If your images are already labeled, Amazon Rekognition can begin training in just a few clicks. If not, you can label them directly within the Amazon Rekognition labeling interface, or use Amazon SageMaker Ground Truth to label them for you. After Amazon Rekognition begins training from your image set, it produces a custom image analysis model for you in just a few hours. Behind the scenes, Rekognition Custom Labels automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. You can then use your custom model via the Rekognition Custom Labels API and integrate it into your applications.

However, building a Rekognition Custom Labels model and hosting it for real-time predictions involves several steps: creating a project, creating the training and validation datasets, training the model, evaluating the model, and then creating an endpoint. After the model is deployed for inference, you might have to retrain the model when new data becomes available or if feedback is received from real-world inference. Automating the whole workflow can help reduce manual work.

In this post, we show how you can use AWS Step Functions to build and automate the workflow. Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and ML pipelines.

Solution overview

The Step Functions workflow is as follows:

  1. We first create an Amazon Rekognition project.
  2. In parallel, we create the training and the validation datasets using existing datasets. We can use the following methods:
    1. Import a folder structure from Amazon Simple Storage Service (Amazon S3) with the folders representing the labels.
    2. Use a local computer.
    3. Use Ground Truth.
    4. Create a dataset using an existing dataset with the AWS SDK.
    5. Create a dataset with a manifest file with the AWS SDK.
  3. After the datasets are created, we train a Custom Labels model using the CreateProjectVersion API. This could take from minutes to hours to complete.
  4. After the model is trained, we evaluate the model using the F1 score output from the previous step. We use the F1 score as our evaluation metric because it provides a balance between precision and recall. You can also use precision or recall as your model evaluation metrics. For more information on custom label evaluation metrics, refer to Metrics for evaluating your model.
  5. We then start to use the model for predictions if we are satisfied with the F1 score.

The following diagram illustrates the Step Functions workflow.


Before deploying the workflow, we need to create the existing training and validation datasets. Complete the following steps:

  1. First, create an Amazon Rekognition project.
  2. Then, create the training and validation datasets.
  3. Finally, install the AWS SAM CLI.

Deploy the workflow

To deploy the workflow, clone the GitHub repository:

git clone
cd rekognition-customlabels-automation-with-stepfunctions
sam build
sam deploy --guided

These commands build, package and deploy your application to AWS, with a series of prompts as explained in the repository.

Run the workflow

To test the workflow, navigate to the deployed workflow on the Step Functions console, then choose Start execution.

The workflow could take a few minutes to a few hours to complete. If the model passes the evaluation criteria, an endpoint for the model is created in Amazon Rekognition. If the model doesn’t pass the evaluation criteria or the training failed, the workflow fails. You can check the status of the workflow on the Step Functions console. For more information, refer to Viewing and debugging executions on the Step Functions console.

Perform model predictions

To perform predictions against the model, you can call the Amazon Rekognition DetectCustomLabels API. To invoke this API, the caller needs to have the necessary AWS Identity and Access Management (IAM) permissions. For more details on performing predictions using this API, refer to Analyzing an image with a trained model.

However, if you need to expose the DetectCustomLabels API publicly, you can front the DetectCustomLabels API with Amazon API Gateway. API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. API Gateway acts as the front door for your DetectCustomLabels API, as shown in the following architecture diagram.

API Gateway forwards the user’s inference request to AWS Lambda. Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. Lambda receives the API request and calls the Amazon Rekognition DetectCustomLabels API with the necessary IAM permissions. For more information on how to set up API Gateway with Lambda integration, refer to Set up Lambda proxy integrations in API Gateway.

The following is an example Lambda function code to call the DetectCustomLabels API:

client = boto3.client('rekognition', region_name="us-east-1")

def lambda_handler(event, context):
    image = json.dumps(event['body'])

    # Base64 decode the base64 encoded image body since API GW base64 encodes the image sent in and
    # Amazon Rekognition's detect_custom_labels API base64 encodes automatically ( since we are using the SDK)
    base64_decoded_image = base64.b64decode(image)

    min_confidence = 85

    # Call DetectCustomLabels
    response = client.detect_custom_labels(Image={'Bytes': base64_decoded_image},

    response_body = json.loads(json.dumps(response))

    statusCode = response_body['ResponseMetadata']['HTTPStatusCode']
    predictions = {}
    predictions['Predictions'] = response_body['CustomLabels']

    return {
        "statusCode": statusCode,
        "body": json.dumps(predictions)

Clean up

To delete the workflow, use the AWS SAM CLI:

sam delete —stack-name <your sam project name>

To delete the Rekognition Custom Labels model, you can either use the Amazon Rekognition console or the AWS SDK. For more information, refer to Deleting an Amazon Rekognition Custom Labels model.


In this post, we walked through a Step Functions workflow to create a dataset and then train, evaluate, and use a Rekognition Custom Labels model. The workflow allows application developers and ML engineers to automate the custom label classification steps for any computer vision use case. The code for the workflow is open-sourced.

For more serverless learning resources, visit Serverless Land. To learn more about Rekognition custom labels, visit Amazon Rekognition Custom Labels.

About the Author

Veda Raman is a Senior Specialist Solutions Architect for machine learning based in Maryland. Veda works with customers to help them architect efficient, secure and scalable machine learning applications. Veda is interested in helping customers leverage serverless technologies for Machine learning.

Read More

Build a machine learning model to predict student performance using Amazon SageMaker Canvas

Build a machine learning model to predict student performance using Amazon SageMaker Canvas

There has been a paradigm change in the mindshare of education customers who are now willing to explore new technologies and analytics. Universities and other higher learning institutions have collected massive amounts of data over the years, and now they are exploring options to use that data for deeper insights and better educational outcomes.

You can use machine learning (ML) to generate these insights and build predictive models. Educators can also use ML to identify challenges in learning outcomes, increase success and retention among students, and broaden the reach and impact of online learning content.

However, higher education institutions often lack ML professionals and data scientists. With this fact, they are looking for solutions that can be quickly adopted by their existing business analysts.

Amazon SageMaker Canvas is a low-code/no-code ML service that enables business analysts to perform data preparation and transformation, build ML models, and deploy these models into a governed workflow. Analysts can perform all these activities with a few clicks and without writing a single piece of code.

In this post, we show how to use SageMaker Canvas to build an ML model to predict student performance.

Solution overview

For this post, we discuss a specific use case: how universities can predict student dropout or continuation ahead of final exams using SageMaker Canvas. We predict whether the student will drop out, enroll (continue), or graduate at the end of the course. We can use the outcome from the prediction to take proactive action to improve student performance and prevent potential dropouts.

The solution includes the following components:

  • Data ingestion – Importing the data from your local computer to SageMaker Canvas
  • Data preparation – Clean and transform the data (if required) within SageMaker Canvas
  • Build the ML model – Build the prediction model inside SageMaker Canvas to predict student performance
  • Prediction – Generate batch or single predictions
  • Collaboration – Analysts using SageMaker Canvas and data scientists using Amazon SageMaker Studio can interact while working in their respective settings, sharing domain knowledge and offering expert feedback to improve models

The following diagram illustrates the solution architecture.

Solution Diagram


For this post, you should complete the following prerequisites:

  1. Have an AWS account.
  2. Set up SageMaker Canvas. For instructions, refer to Prerequisites for setting up Amazon SageMaker Canvas.
  3. Download the following student dataset to your local computer.

The dataset contains student background information like demographics, academic journey, economic background, and more. The dataset contains 37 columns, out of which 36 are features and 1 is a label. The label column name is Target, and it contains categorical data: dropout, enrolled, and graduate.

The dataset comes under the Attribution 4.0 International (CC BY 4.0) license and is free to share and adapt.

Data ingestion

The first step for any ML process is to ingest the data. Complete the following steps:

  1. On the SageMaker Canvas console, choose Import.
  2. Import the Dropout_Academic Success - Sheet1.csv dataset into SageMaker Canvas.
  3. Select the dataset and choose Create a model.
  4. Name the model student-performance-model.

Import dataset and create model

Data preparation

For ML problems, data scientists analyze the dataset for outliers, handle the missing values, add or remove fields, and perform other transformations. Analysts can perform the same actions in SageMaker Canvas using the visual interface. Note that major data transformation is out of scope for this post.

In the following screenshot, the first highlighted section (annotated as 1 in the screenshot) shows the options available with SageMaker Canvas. IT staff can apply these actions on the dataset and can even explore the dataset for more details by choosing Data visualizer.

The second highlighted section (annotated as 2 in the screenshot) indicates that the dataset doesn’t have any missing or mismatched records.

Data preparation

Build the ML model

To proceed with training and building the ML model, we need to choose the column that needs to be predicted.

  1. On the SageMaker Canvas interface, for Select a column to predict, choose Target.

As soon as you choose the target column, it will prompt you to validate data.

  1. Choose Validate, and within few minutes SageMaker Canvas will finish validating your data.

Now it’s the time to build the model. You have two options: Quick build and Standard build. Analysts can choose either of the options based on your requirements.

  1. For this post, we choose Standard build.

Build Model

Apart from speed and accuracy, one major difference between Standard build and Quick build is that Standard build provides the capability to share the model with data scientists, which Quick build doesn’t.

SageMaker Canvas took approximately 25 minutes to train and build the model. Your models may take more or less time, depending on factors such as input data size and complexity. The accuracy of the model was around 80%, as shown in the following screenshot. You can explore the bottom section to see the impact of each column on the prediction.

So far, we have uploaded the dataset, prepared the dataset, and built the prediction model to measure student performance. Next, we have two options:

  • Generate a batch or single prediction
  • Share this model with the data scientists for feedback or improvements


Choose Predict to start generating predictions. You can choose from two options:

  • Batch prediction – You can upload datasets here and let SageMaker Canvas predict the performance for the students. You can use these predictions to take proactive actions.
  • Single prediction – In this option, you provide the values for a single student. SageMaker Canvas will predict the performance for that particular student.



In some cases, you as an analyst might want to get feedback from expert data scientists on the model before proceeding with the prediction. To do so, choose Share and specify the Studio user to share with.

Share model

Then the data scientist can complete the following steps:

  1. On the Studio console, in the navigation pane, under Models, choose Shared models.
  2. Choose View model to open the model.

Shared model

They can update the model either of the following ways:

  • Share a new model – The data scientist can change the data transformations, retrain the model, and then share the model
  • Share an alternate model – The data scientist can select an alternate model from the list of trained Amazon SageMaker Autopilot models and share that back with the SageMaker Canvas user.

Share back model

For this example, we choose Share an alternate model and assume the inference latency as the key parameter shared the second-best model with the SageMaker Canvas user.

The data scientist can look for other parameters like F1 score, precision, recall, and log loss as decision criterion to share an alternate model with the SageMaker Canvas user.

In this scenario, the best model has an accuracy of 80% and inference latency of 0.781 seconds, whereas the second-best model has an accuracy of 79.9% and inference latency of 0.327 seconds.

Alternate model

  1. Choose Share to share an alternate model with the SageMaker Canvas user.
  2. Add the SageMaker Canvas user to share the model with.
  3. Add an optional note, then choose Share.
  4. Choose an alternate model to share.
  5. Add feedback and choose Share to share the model with the SageMaker Canvas user.

Data scientist share back model

After the data scientist has shared an updated model with you, you will get a notification and SageMaker Canvas will start importing the model into the console.

Canvas importing model

SageMaker Canvas will take a moment to import the updated model, and then the updated model will reflect as a new version (V3 in this case).

You can now switch between the versions and generate predictions from any version.

Switching model versions

If an administrator is worried about managing permissions for the analysts and data scientists, they can use Amazon SageMaker Role Manager.

Clean up

To avoid incurring future charges, delete the resources you created while following this post. SageMaker Canvas bills you for the duration of the session, and we recommend logging out of Canvas when you’re not using it. Refer to Logging out of Amazon SageMaker Canvas for more details.


In this post, we discussed how SageMaker Canvas can help higher learning institutions use ML capabilities without requiring ML expertise. In our example, we showed how an analyst can quickly build a highly accurate predictive ML model without writing any code. The university can now act on those insights by specifically targeting students at risk of dropping out of a course with individualized attention and resources, benefitting both parties.

We demonstrated the steps starting from loading the data into SageMaker Canvas, building the model in Canvas, and receiving the feedback from data scientists via Studio. The entire process was completed through web-based user interfaces.

To start your low-code/no-code ML journey, refer to Amazon SageMaker Canvas.

About the author

Ashutosh Kumar is a Solutions Architect with the Public Sector-Education Team. He is passionate about transforming businesses with digital solutions. He has good experience in databases, AI/ML, data analytics, compute, and storage.

Read More

Access Snowflake data using OAuth-based authentication in Amazon SageMaker Data Wrangler

Access Snowflake data using OAuth-based authentication in Amazon SageMaker Data Wrangler

In this post, we show how to configure a new OAuth-based authentication feature for using Snowflake in Amazon SageMaker Data Wrangler. Snowflake is a cloud data platform that provides data solutions for data warehousing to data science. Snowflake is an AWS Partner with multiple AWS accreditations, including AWS competencies in machine learning (ML), retail, and data and analytics.

Data Wrangler simplifies the data preparation and feature engineering process, reducing the time it takes from weeks to minutes by providing a single visual interface for data scientists to select and clean data, create features, and automate data preparation in ML workflows without writing any code. You can import data from multiple data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Amazon EMR, and Snowflake. With this new feature, you can use your own identity provider (IdP) such as Okta, Azure AD, or Ping Federate to connect to Snowflake via Data Wrangler.

Solution overview

In the following sections, we provide steps for an administrator to set up the IdP, Snowflake, and Studio. We also detail the steps that data scientists can take to configure the data flow, analyze the data quality, and add data transformations. Finally, we show how to export the data flow and train a model using SageMaker Autopilot.


For this walkthrough, you should have the following prerequisites:

  1. For admin:
    • A Snowflake user with permissions to create storage integrations, and security integrations in Snowflake.
    • An AWS account with permissions to create AWS Identity and Access Management (IAM) policies and roles.
    • Access and permissions to configure IDP to register Data Wrangler application and set up the authorization server or API.
  1. For data scientist:

Administrator setup

Instead of having your users directly enter their Snowflake credentials into Data Wrangler, you can have them use an IdP to access Snowflake.

The following steps are involved to enable Data Wrangler OAuth access to Snowflake:

  1. Configure the IdP.
  2. Configure Snowflake.
  3. Configure SageMaker Studio.

Configure the IdP

To set up your IdP, you must register the Data Wrangler application and set up your authorization server or API.

Register the Data Wrangler application within the IdP

Refer to the following documentation for the IdPs that Data Wrangler supports:

Use the documentation provided by your IdP to register your Data Wrangler application. The information and procedures in this section help you understand how to properly use the documentation provided by your IdP.

Specific customizations in addition to the steps in the respective guides are called out in the subsections.

  1. Select the configuration that starts the process of registering Data Wrangler as an application.
  2. Provide the users within the IdP access to Data Wrangler.
  3. Enable OAuth client authentication by storing the client credentials as a Secrets Manager secret.
  4. Specify a redirect URL using the following format:

You’re specifying the SageMaker domain ID and AWS Region that you’re using to run Data Wrangler. You must register a URL for each domain and Region where you’re running Data Wrangler. Users from a domain and Region that don’t have redirect URLs set up for them won’t be able to authenticate with the IdP to access the Snowflake connection.

  1. Make sure the authorization code and refresh token grant types are allowed for your Data Wrangler application.

Set up the authorization server or API within the IdP

Within your IdP, you must set up an authorization server or an application programming interface (API). For each user, the authorization server or the API sends tokens to Data Wrangler with Snowflake as the audience.

Snowflake uses the concept of roles that are distinct from IAM roles used in AWS. You must configure the IdP to use ANY Role to use the default role associated with the Snowflake account. For example, if a user has systems administrator as the default role in their Snowflake profile, the connection from Data Wrangler to Snowflake uses systems administrator as the role.

Use the following procedure to set up the authorization server or API within your IdP:

  1. From your IdP, begin the process of setting up the server or API.
  2. Configure the authorization server to use the authorization code and refresh token grant types.
  3. Specify the lifetime of the access token.
  4. Set the refresh token idle timeout.

The idle timeout is the time that the refresh token expires if it’s not used. If you’re scheduling jobs in Data Wrangler, we recommend making the idle timeout time greater than the frequency of the processing job. Otherwise, some processing jobs might fail because the refresh token expired before they could run. When the refresh token expires, the user must re-authenticate by accessing the connection that they’ve made to Snowflake through Data Wrangler.

Note that Data Wrangler doesn’t support rotating refresh tokens. Using rotating refresh tokens might result in access failures or users needing to log in frequently.

If the refresh token expires, your users must reauthenticate by accessing the connection that they’ve made to Snowflake through Data Wrangler.

  1. Specify session:role-any as the new scope.

For Azure AD, you must also specify a unique identifier for the scope.

After you’ve set up the OAuth provider, you provide Data Wrangler with the information it needs to connect to the provider. You can use the documentation from your IdP to get values for the following fields:

  • Token URL – The URL of the token that the IdP sends to Data Wrangler
  • Authorization URL – The URL of the authorization server of the IdP
  • Client ID – The ID of the IdP
  • Client secret – The secret that only the authorization server or API recognizes
  • OAuth scope – This is for Azure AD only

Configure Snowflake

To configure Snowflake, complete the instructions in Import data from Snowflake.

Use the Snowflake documentation for your IdP to set up an external OAuth integration in Snowflake. See the previous section Register the Data Wrangler application within the IdP for more information on how to set up an external OAuth integration.

When you’re setting up the security integration in Snowflake, make sure you activate external_oauth_any_role_mode.

Configure SageMaker Studio

You store the fields and values in a Secrets Manager secret and add it to the Studio Lifecycle Configuration that you’re using for Data Wrangler. A Lifecycle Configuration is a shell script that automatically loads the credentials stored in the secret when the user logs into Studio. For information about creating secrets, see Move hardcoded secrets to AWS Secrets Manager. For information about using Lifecycle Configurations in Studio, see Use Lifecycle Configurations with Amazon SageMaker Studio.

Create a secret for Snowflake credentials

To create your secret for Snowflake credentials, complete the following steps:

  1. On the Secrets Manager console, choose Store a new secret.
  2. For Secret type, select Other type of secret.
  3. Specify the details of your secret as key-value pairs.

Key names require lowercase letters due to case sensitivity. Data Wrangler gives a warning if you enter any of these incorrectly. Input the secret values as key-value pairs Key/value if you’d like, or use the Plaintext option.

The following is the format of the secret used for Okta. If you are using Azure AD, you need to add the datasource_oauth_scope field.

  1. Update the preceding values with your choice of IdP and information gathered after application registration.
  2. Choose Next.
  3. For Secret name, add the prefix AmazonSageMaker (for example, our secret is AmazonSageMaker-DataWranglerSnowflakeCreds).
  4. In the Tags section, add a tag with the key SageMaker and value true.
  5. Choose Next.
  6. The rest of the fields are optional; choose Next until you have the option to choose Store to store the secret.

After you store the secret, you’re returned to the Secrets Manager console.

  1. Choose the secret you just created, then retrieve the secret ARN.
  2. Store this in your preferred text editor for use later when you create the Data Wrangler data source.

Create a Studio Lifecycle Configuration

To create a Lifecycle Configuration in Studio, complete the following steps:

  1. On the SageMaker console, choose Lifecycle configurations in the navigation pane.
  2. Choose Create configuration.
  3. Choose Jupyter server app.
  4. Create a new lifecycle configuration or append an existing one with the following content:
    set -eux
    ## Script Body
    cat > ~/.snowflake_identity_provider_oauth_config <<EOL
        "secret_arn": "<secret_arn>"

The configuration creates a file with the name ".snowflake_identity_provider_oauth_config", containing the secret in the user’s home folder.

  1. Choose Create Configuration.
    Create configuration

Set the default Lifecycle Configuration

Complete the following steps to set the Lifecycle Configuration you just created as the default:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the Studio domain you’ll be using for this example.
  3. On the Environment tab, in the Lifecycle configurations for personal Studio apps section, choose Attach.
    the Environment tab, in the Lifecycle configurations for personal Studio apps section, choose Attach.
  4. For Source, select Existing configuration.
  5. Select the configuration you just made, then choose Attach to domain.
  6. Select the new configuration and choose Set as default, then choose Set as default again in the pop-up message.

Your new settings should now be visible under Lifecycle configurations for personal Studio apps as default.

  1. Shut down the Studio app and relaunch for the changes to take effect.

Data scientist experience

In this section, we cover how data scientists can connect to Snowflake as a data source in Data Wrangler and prepare data for ML.

Create a new data flow

To create your data flow, complete the following steps:

  1. On the SageMaker console, choose Amazon SageMaker Studio in the navigation pane.
  2. Choose Open Studio.
  3. On the Studio Home page, choose Import & prepare data visually. Alternatively, on the File drop-down, choose New, then choose SageMaker Data Wrangler Flow.

Creating a new flow can take a few minutes.

Creating a new flow

  1. On the Import data page, choose Create connection.
  2. Choose Snowflake from the list of data sources.
    list of data sources.
  3. For Authentication method, choose OAuth.

If you don’t see OAuth, verify the preceding Lifecycle Configuration steps.

  1. Enter details for Snowflake account name and Storage integration.
  2. Ener a connection name and choose Connect.
    Ener a connection name and choose Connect.

You’re redirected to an IdP authentication page. For this example, we’re using Okta.

  1. Enter your user name and password, then choose Sign in.
    Sign in to Okta

After the authentication is successful, you’re redirected to the Studio data flow page.

  1. On the Import data from Snowflake page, browse the database objects, or run a query for the targeted data.
  2. In the query editor, enter a query and preview the results.

In the following example, we load Loan Data and retrieve all columns from 5,000 rows.

  1. Choose Import.
    import columns
  2. Enter a dataset name (for this post, we use snowflake_loan_dataset) and choose Add.
    add dataset name

You’re redirected to the Prepare page, where you can add transformations and analyses to the data.

Data Wrangler makes it easy to ingest data and perform data preparation tasks such as exploratory data analysis, feature selection, and feature engineering. We’ve only covered a few of the capabilities of Data Wrangler in this post on data preparation; you can use Data Wrangler for more advanced data analysis such as feature importance, target leakage, and model explainability using an easy and intuitive user interface.

Analyze data quality

Use the Data Quality and Insights Report to perform an analysis of the data that you’ve imported into Data Wrangler. Data Wrangler creates the report from the sampled data.

  1. On the Data Wrangler flow page, choose the plus sign next to Data types, then choose Get data insights.
    Get Data Insights
  2. Choose Data Quality And Insights Report for Analysis type.
  3. For Target column, choose your target column.
  4. For Problem type, select Classification.
  5. Choose Create.
    Create analysis

The insights report has a brief summary of the data, which includes general information such as missing values, invalid values, feature types, outlier counts, and more. You can either download the report or view it online.

Insight report

Add transformations to the data

Data Wrangler has over 300 built-in transformations. In this section, we use some of these transformations to prepare the dataset for an ML model.

  1. On the Data Wrangler flow page, choose plus sign, then choose Add transform.

If you’re following the steps in the post, you’re directed here automatically after adding your dataset.

Add Transform

  1. Verify and modify the data type of the columns.

Looking through the columns, we identify that MNTHS_SINCE_LAST_DELINQ and MNTHS_SINCE_LAST_RECORD should most likely be represented as a number type rather than string.

Verify and modify the data type of the columns.

  1. After applying the changes and adding the step, you can verify the column data type is changed to float.

Update the data type of the columns.

Looking through the data, we can see that the fields EMP_TITLE, URL, DESCRIPTION, and TITLE will likely not provide value to our model in our use case, so we can drop them.

  1. Choose Add Step, then choose Manage columns.
  2. For Transform, choose Drop column.
  3. For Column to drop, specify EMP_TITLE, URL, DESCRIPTION, and TITLE.
  4. Choose Preview and Add.

Drop Columns

Next, we want to look for categorical data in our dataset. Data Wrangler has a built-in functionality to encode categorical data using both ordinal and one-hot encodings. Looking at our dataset, we can see that the TERM, HOME_OWNERSHIP, and PURPOSE columns all appear to be categorical in nature.

  1. Add another step and choose Encode categorical.
  2. For Transform, choose One-hot encode.
  3. For Input column, choose TERM.
  4. For Output style, choose Columns.
  5. Leave all other settings as default, then choose Preview and Add.

The HOME_OWNERSHIP column has four possible values: RENT, MORTGAGE, OWN, and other.

  1. Repeat the preceding steps to apply a one-hot encoding approach on these values.

Lastly, the PURPOSE column has several possible values. For this data, we use a one-hot encoding approach as well, but we set the output to a vector rather than columns.

  1. For Transform, choose One-hot encode.
  2. For Input column, choose PURPOSE.
  3. For Output style, choose Vector.
  4. For Output column, we call this column PURPOSE_VCTR.

This keeps the original PURPOSE column, if we decide to use it later.

  1. Leave all other settings as default, then choose Preview and Add.
    Preview and Add

Export the data flow

Finally, we export this whole data flow to a feature store with a SageMaker Processing job, which creates a Jupyter notebook with the code pre-populated.

  1. On the data flow page , choose the plus sign and Export to.
  2. Choose where to export. For our use case, we choose SageMaker Feature Store.

The exported notebook is now ready to run.

Notebook is ready to run

Export data and train a model with Autopilot

Now we can train the model using Amazon SageMaker Autopilot.

  1. On the data flow page, choose the Training tab.
  2. For Amazon S3 location, enter a location for the data to be saved.
  3. Choose Export and train.
    export and train
  4. Specify the settings in the Target and features, Training method, Deployment and advance settings, and Review and create sections.
  5. Choose Create experiment to find the best model for your problem.

Clean up

If your work with Data Wrangler is complete, shut down your Data Wrangler instance to avoid incurring additional fees.


In this post, we demonstrated connecting Data Wrangler to Snowflake using OAuth, transforming and analyzing a dataset, and finally exporting it to the data flow so that it could be used in a Jupyter notebook. Most notably, we created a pipeline for data preparation without having to write any code at all.

To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.

About the authors

Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.

Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience in working with database and analytics products from enterprise database vendors and cloud providers. He has helped large technology companies design data analytics solutions and has led engineering teams in designing and implementing data analytics platforms and data products.

Matt Marzillo is a Sr. Partner Sales Engineer at Snowflake. He has 10 years of experience in data science and machine learning roles both in consulting and with industry organizations. Matt has experience developing and deploying AI and ML models across many different organizations in areas such as marketing, sales, operations, clinical, and finance, as well as advising in consultative roles.

Huong Nguyen is a product leader for Amazon SageMaker Data Wrangler at AWS. She has 15 years of experience creating customer-obsessed and data-driven products for both enterprise and consumer spaces. In her spare time, she enjoys audio books, gardening, hiking, and spending time with her family and friends.

Read More

Remote monitoring of raw material supply chains for sustainability with Amazon SageMaker geospatial capabilities

Remote monitoring of raw material supply chains for sustainability with Amazon SageMaker geospatial capabilities

Deforestation is a major concern in many tropical geographies where local rainforests are at severe risk of destruction. About 17% of the Amazon rainforest has been destroyed over the past 50 years, and some tropical ecosystems are approaching a tipping point beyond which recovery is unlikely.

A key driver for deforestation is raw material extraction and production, for example the production of food and timber or mining operations. Businesses consuming these resources are increasingly recognizing their share of responsibility in tackling the deforestation issue. One way they can do this is by ensuring that their raw material supply is produced and sourced sustainably. For example, if a business uses palm oil in their products, they will want to ensure that natural forests were not burned down and cleared to make way for a new palm oil plantation.

Geospatial analysis of satellite imagery taken of the locations where suppliers operate can be a powerful tool to detect problematic deforestation events. However, running such analyses is difficult, time-consuming, and resource-intensive. Amazon SageMaker geospatial capabilities—now generally available in the AWS Oregon Region—provide a new and much simpler solution to this problem. The tool makes it easy to access geospatial data sources, run purpose-built processing operations, apply pre-trained ML models, and use built-in visualization tools faster and at scale.

In this post, you will learn how to use SageMaker geospatial capabilities to easily baseline and monitor the vegetation type and density of areas where suppliers operate. Supply chain and sustainability professionals can use this solution to track the temporal and spatial dynamics of unsustainable deforestation in their supply chains. Specifically, the guidance provides data-driven insights into the following questions:

  • When and over what period did deforestation occur – The guidance allows you to pinpoint when a new deforestation event occurred and monitor its duration, progression, or recovery
  • Which type of land cover was most affected – The guidance allows you to pinpoint which vegetation types were most affected by a land cover change event (for example, tropical forests or shrubs)
  • Where specifically did deforestation occur – Pixel-by-pixel comparisons between baseline and current satellite imagery (before vs. after) allow you to identify the precise locations where deforestation has occurred
  • How much forest was cleared – An estimate on the affected area (in km2) is provided by taking advantage of the fine-grained resolution of satellite data (for example, 10mx10m raster cells for Sentinel 2)

Solution overview

The solution uses SageMaker geospatial capabilities to retrieve up-to-date satellite imagery for any area of interest with just a few lines of code, and apply pre-built algorithms such as land use classifiers and band math operations. You can then visualize results using built-in mapping and raster image visualization tooling. To derive further insights from the satellite data, the guidance uses the export functionality of Amazon SageMaker to save the processed satellite imagery to Amazon Simple Storage Service (Amazon S3), where data is cataloged and shared for custom postprocessing and analysis in an Amazon SageMaker Studio notebook with a SageMaker geospatial image. Results of these custom analyses are subsequently published and made observable in Amazon QuickSight so that procurement and sustainability teams can review supplier location vegetation data in one place. The following diagram illustrates this architecture.

solution architecture diagram

The notebooks and code with a deployment-ready implementation of the analyses shown in this post are available at the GitHub repository Guidance for Geospatial Insights for Sustainability on AWS.

Example use case

This post uses an area of interest (AOI) from Brazil where land clearing for cattle production, oilseed growing (soybean and palm oil), and timber harvesting is a major concern. You can also generalize this solution to any other desired AOI.

The following screenshot displays the AOI showing satellite imagery (visible band) from the European Space Agency’s Sentinel 2 satellite constellation retrieved and visualized in a SageMaker notebook. Agricultural regions are clearly visible against dark green natural rainforest. Note also the smoke originating from inside the AOI as well as a larger area to the North. Smoke is often an indicator of the use of fire in land clearing.

Agricultural regions are clearly visible against dark green natural rainforest

NDVI as a measure for vegetation density

To identify and quantify changes in forest cover over time, this solution uses the Normalized Difference Vegetation Index (NDVI). . NDVI is calculated from the visible and near-infrared light reflected by vegetation. Healthy vegetation absorbs most of the visible light that hits it, and reflects a large portion of the near-infrared light. Unhealthy or sparse vegetation reflects more visible light and less near-infrared light. The index is computed by combining the red (visible) and near-infrared (NIR) bands of a satellite image into a single index ranging from -1 to 1.

Negative values of NDVI (values approaching -1) correspond to water. Values close to zero (-0.1 to 0.1) represent barren areas of rock, sand, or snow. Lastly, low and positive values represent shrub, grassland, or farmland (approximately 0.2 to 0.4), whereas high NDVI values indicate temperate and tropical rainforests (values approaching 1). Learn more about NDVI calculations here). NDVI values can therefore be mapped easily to a corresponding vegetation class:

          (-1,0]: "no vegetation (water, rock, artificial structures)",
          (0,0.5]:"light vegetation (shrubs, grass, fields)",
          (0.5,0.7]:"dense vegetation (plantations)",
          (0.7,1]:"very dense vegetation (rainforest)"

By tracking changes in NDVI over time using the SageMaker built-in NDVI model, we can infer key information on whether suppliers operating in the AOI are doing so responsibly or whether they’re engaging in unsustainable forest clearing activity.

Retrieve, process, and visualize NDVI data using SageMaker geospatial capabilities

One primary function of the SageMaker Geospatial API is the Earth Observation Job (EOJ), which allows you to acquire and transform raster data collected from the Earth’s surface. An EOJ retrieves satellite imagery from a specified data source (i.e., a satellite constellation) for a specified area of interest and time period, and applies one or several models to the retrieved images.

EOJs can be created via a geospatial notebook. For this post, we use an example notebook.

To configure an EOJ, set the following parameters:

  • InputConfig – The input configuration defines data sources and filtering criteria to be applied during data acquisition:
    • RasterDataCollectionArn – Defines which satellite to collect data from.
    • AreaOfInterest – The geographical AOI; defines Polygon for which images are to be collected (in GeoJSON format).
    • TimeRangeFilter – The time range of interest: {StartTime: <string>, EndTime: <string> }.
    • PropertyFilters – Additional property filters, such as maximum acceptable cloud cover.
  • JobConfig – The model configuration defines the processing job to be applied to the retrieved satellite image data. An NDVI model is available as part of the pre-built BandMath operation.

Set InputConfig

SageMaker geospatial capabilities support satellite imagery from two different sources that can be referenced via their Amazon Resource Names (ARNs):

  • Landsat Collection 2 Level-2 Science Products, which measures the Earth’s surface reflectance (SR) and surface temperature (ST) at a spatial resolution of 30m
  • Sentinel 2 L2A COGs, which provides large-swath continuous spectral measurements across 13 individual bands (blue, green, near-infrared, and so on) with resolution down to 10m.

You can retrieve these ARNs directly via the API by calling list_raster_data_collections().

This solution uses Sentinel 2 data. The Sentinel-2 mission is based on a constellation of two satellites. As a constellation, the same spot over the equator is revisited every 5 days, allowing for frequent and high-resolution observations. To specify Sentinel 2 as data source for the EOJ, simply reference the ARN:

#set raster data collection arn to sentinel 2
data_collection_arn = "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8"

Next, the AreaOfInterest (AOI) for the EOJ needs to be defined. To do so, you need to provide a GeoJSON of the bounding box that defines the area where a supplier operates. The following code snippet extracts the bounding box coordinates and defines the EOJ request input:

file_name = "../assets/aoi_samples/brazil_plantation_mato_grosso.geojson"
aoi_shape = gpd.read_file(geojson_file_name) #load geojson as shape file
aoi_shape = aoi_shape.set_crs(epsg=4326) #set projection, i.e. coordinate reference system (CRS)
aoi_coordinates = json.loads(aoi_shape.to_json())['features'][0]["geometry"]["coordinates"] #extract coordinates

#set aoi query parameters
selected_aoi_feature = {
    "AreaOfInterestGeometry": {
        "PolygonGeometry": {
            "Coordinates": aoi_coordinates

The time range is defined using the following request syntax:

start="2022-10-01T00:00:00" #time in UTC
end="2022-12-20T00:00:00" #time in UTC
time_filter = {
    "StartTime": start,
    "EndTime": end

Depending on the raster data collection selected, different additional property filters are supported. You can review the available options by calling get_raster_data_collection(Arn=data_collection_arn)["SupportedFilters"]. In the following example, a tight limit of 5% cloud cover is imposed to ensure a relatively unobstructed view on the AOI:

                    "LowerBound": 0,
                    "UpperBound": 5
    "LogicalOperator": "AND"

Review query results

Before you start the EOJ, make sure that the query parameters actually result in satellite images being returned as a response. In this example, the ApproximateResultCount is 3, which is sufficient. You may need to use a less restrictive PropertyFilter if no results are returned.

#consolidate query parameters
query_param = {
"AreaOfInterest": selected_aoi_feature,
"TimeRangeFilter": time_filter,
"PropertyFilters": property_filters
#review query results
query_results = sm_geo_client.search_raster_data_collection(
    Arn = data_collection_arn,
    RasterDataCollectionQuery = query_param
#get result count
result_count = query_results["ApproximateResultCount"]

You can review thumbnails of the raw input images by indexing the query_results object. For example, the raw image thumbnail URL of the last item returned by the query can be accessed as follows: query_results['Items'][-1]["Assets"]["thumbnail"]["Href"] .

Set JobConfig

Now that we have set all required parameters needed to acquire the raw Sentinel 2 satellite data, the next step is to infer vegetation density measured in terms of NDVI. This would typically involve identifying the satellite tiles that intersect the AOI and downloading the satellite imagery for the time frame in scope from a data provider. You would then have to go through the process of overlaying, merging, and clipping the acquired files, computing the NDVI per each raster cell of the combined image by performing mathematical operations on the respective bands (such as red and near-infrared), and finally saving the results to a new single-band raster image. SageMaker geospatial capabilities provide an end-to-end implementation of this workflow, including a built-in NDVI model that can be run with a simple API call. All you need to do is specify the job configuration and set it to the predefined NDVI model:

    "BandMathConfig": {
        'PredefinedIndices': [

Having defined all required inputs for SageMaker to acquire and transform the geospatial data of interest, you can now start the EOJ with a simple API call:

#set job name
job_name = "EOJ-Brazil-MatoGrosso-2022-Q4"

#set input config
input_config["RasterDataCollectionArn"] = data_collection_arn #add RasterDataCollectionArn to input_conf

#invoke EOJ
eoj = sm_geo_client.start_earth_observation_job(

After the EOJ is complete, you can start exploring the results. SageMaker geospatial capabilities provide built-in visualization tooling powered by Foursquare Studio, which natively works from within a SageMaker notebook via the SageMaker geospatial Map SDK. The following code snippet initializes and renders a map and then adds several layers to it:

#initialize and render map
geo_map = sagemaker_geospatial_map.create_map({"is_raster": True})

#add AOI layer
config = {"label": "EOJ AOI"}
aoi_layer = geo_map.visualize_eoj_aoi(
#add input layer
config = {"label": "EOJ Input"}
input_layer = geo_map.visualize_eoj_input(
    Arn=eoj["Arn"], config=config
#add output layer
config = {
    "label":"EOJ Output",
    "preset": "singleBand",
    "band_name": "ndvi"
output_layer = geo_map.visualize_eoj_output(
    Arn=eoj["Arn"], config=config

Once rendered, you can interact with the map by hiding or showing layers, zooming in and out, or modifying color schemes, among other options. The following screenshot shows the AOI bounding box layer superimposed on the output layer (the NDVI-transformed Sentinel 2 raster file). Bright yellow patches represent rainforest that is intact (NDVI=1), darker patches represent fields (0.5>NDVI>0), and dark-blue patches represent water (NDVI=-1).

AOI bounding box layer superimposed on the output layer

By comparing current period values vs. a defined baseline period, changes and anomalies in NDVI can be identified and tracked over time.

Custom postprocessing and QuickSight visualization for additional insights

SageMaker geospatial capabilities come with a powerful pre-built analysis and mapping toolkit that delivers the functionality needed for many geospatial analysis tasks. In some cases, you may require additional flexibility and want to run customized post-analyses on the EOJ results. SageMaker geospatial capabilities facilitate this flexibility via an export function. Exporting EOJ outputs is again a simple API call:

    OutputConfig={"S3Data": {"S3Uri":"s3://{}".format(main_bucket_name)+"/raw-eoj-output/"}},

Then you can download the output raster files for further local processing in a SageMaker geospatial notebook using common Python libraries for geospatial analysis such as GDAL, Fiona, GeoPandas, Shapely, and Rasterio, as well as SageMaker-specific libraries. With the analyses running in SageMaker, all AWS analytics tooling that natively integrate with SageMaker are also at your disposal. For example, the solution linked in the Guidance for Geospatial Insights for Sustainability on AWS GitHub repo uses Amazon S3 and Amazon Athena for querying the postprocessing results and making them observable in a QuickSight dashboard. All processing routines along with deployment code and instructions for the QuickSight dashboard are detailed in the GitHub repository.

The dashboard offers three core visualization components:

  • A time series plot of NDVI values normalized against a baseline period, which enables you to track the temporal dynamics in vegetation density
  • Full discrete distribution of NDVI values in the baseline period and the current period, providing transparency on which vegetation types have seen the largest change
  • NDVI-transformed satellite imagery for the baseline period, current period, and pixel-by-pixel differences between both periods, which allows you to identify the affected regions inside the AOI

As shown in the following example, over the period of 5 years (2017-Q3 to 2022-Q3), the average NDVI of AOI decreased by 7.6% against the baseline period (Q3 2017), affecting a total area of 250.21 km2. This reduction was primarily driven by changes in high-NDVI areas (forest, rainforest), which can be seen when comparing the NDVI distributions of the current vs. the baseline period.

comparing the NDVI distributions of the current vs. the baseline period

comparing the NDVI distributions of the current vs. the baseline period

The pixel-by-pixel spatial comparison against the baseline highlights that the deforestation event occured in an area right at the center of the AOI where previously untouched natural forest has been converted into farmland. Supply chain professionals can take these data points as basis for further investigation and a potential review of relationships with the supplier in question.


SageMaker geospatial capabilities can form an integral part in tracking corporate climate action plans by making remote geospatial monitoring easy and accessible. This blog post focused on just one specific use case – monitoring raw material supply chain origins. Other use cases are easily conceivable. For example, you could use a similar architecture to track forest restoration efforts for emission offsetting, monitor plant health in reforestation or farming applications, or detect the impact of droughts on water bodies, among many other applications.

About the Authors

Karsten Schroer is a Solutions Architect at AWS. He supports customers in leveraging data and technology to drive sustainability of their IT infrastructure and build cloud-native data-driven solutions that enable sustainable operations in their respective verticals. Karsten joined AWS following his PhD studies in applied machine learning & operations management. He is truly passionate about technology-enabled solutions to societal challenges and loves to dive deep into the methods and application architectures that underlie these solutions.

Tamara Herbert is an Application Developer with AWS Professional Services in the UK. She specializes in building modern & scalable applications for a wide variety of customers, currently focusing on those within the public sector. She is actively involved in building solutions and driving conversations that enable organizations to meet their sustainability goals both in and through the cloud.

Margaret O’Toole joined AWS in 2017 and has spent her time helping customers in various technical roles. Today, Margaret is the WW Tech Leader for Sustainability and leads a community of customer facing sustainability technical specialists. Together, the group helps customers optimize IT for sustainability and leverage AWS technology to solve some of the most difficult challenges in sustainability around the world. Margaret studied biology and computer science at the University of Virginia and Leading Sustainable Corporations at Oxford’s Saïd Business School.

Read More

Best practices for viewing and querying Amazon SageMaker service quota usage

Best practices for viewing and querying Amazon SageMaker service quota usage

Amazon SageMaker customers can view and manage their quota limits through Service Quotas. In addition, they can view near real-time utilization metrics and create Amazon CloudWatch metrics to view and programmatically query SageMaker quotas.

SageMaker helps you build, train, and deploy machine learning (ML) models with ease. To learn more, refer to Getting started with Amazon SageMaker. Service Quotas simplifies limit management by allowing you to view and manage your quotas for SageMaker from a central location.

With Service Quotas, you can view the maximum number of resources, actions, or items in your AWS account or AWS Region. You can also use Service Quotas to request an increase for adjustable quotas.

With the increasing usage of MLOps practices, and therefore the demand for resources designated for ML model experimentation and retraining, more customers need to run multiple instances, often of the same instance type at the same time.

Many data science teams often work in parallel, using several instances for processing, training, and tuning concurrently. Previously, users would sometimes reach an adjustable account limit for some particular instance type and have to manually request a limit increase from AWS.

To request quota increases manually from the Service Quotas UI, you can choose the quota from the list and choose Request quota increase. For more information, refer to Requesting a quota increase.

In this post, we show how you can use the new features to automatically request limit increases when a high level of instances is reached.

Solution overview

The following diagram illustrates the solution architecture.

This architecture includes the following workflow:

  1. A CloudWatch metric monitors the usage of the resource. A CloudWatch alarm triggers when the resource usage goes beyond a certain preconfigured threshold.
  2. A message is sent to Amazon Simple Notification Service (Amazon SNS).
  3. The message is received by an AWS Lambda function.
  4. The Lambda function requests the quota increase.

Aside from requesting for a quota increase for the specific account, the Lambda function can also add the quota increase to the organization template (up to 10 quotas). This way, any new account created under a given AWS Organization has the increased quota requests by default.


Complete the following prerequisite steps:

  1. Set up an AWS account and create an AWS Identity and Access Management (IAM) user. For instructions, refer to Secure Your AWS Account.
  2. Install the AWS SAM CLI.

Deploy using AWS Serverless Application Model

To deploy the application using the GitHub repo, run the following command in the terminal:

git clone
cd sagemaker-quotas-alarm
sam build && sam deploy --stack-name usage --region us-east-1 --resolve-s3 --capabilities CAPABILITY_IAM --parameter-overrides ResourceUsageThreshold=50 SecurityGroupIds=<SECURITY-GROUP-IDS> SubnetIds=<SUBNETS>

After the solution is deployed, you should have a new alarm on the CloudWatch console. This alarm monitors usage for SageMaker notebook instances for the ml.t3.medium instance.

alarm monitors usage for SageMaker notebook instances

If your resource usage reaches more than 50%, the alarm triggers and the Lambda function requests an increase.

alarm triggers

alarm triggers

If the account you have is part of an AWS Organization and you have the quota request template enabled, you should also see those increases on the template, if the template has available slots. This way, new accounts from that organization also have the increases configured upon creation.

increases on the template

Deploy using the CloudWatch console

To deploy the application using the CloudWatch console, complete the following steps:

  1. On the CloudWatch console, choose All alarms in the navigation pane.
  2. Choose Create alarm.
    create alarm
  3. Choose Select metric.
    select metric
  4. Choose Usage.
    choose usage
  5. Select the metric you want to monitor.
    select metric to monitor
  6. Select the condition of when you would like the alarm to trigger.

For more possible configurations when configuring the alarm, see Create a CloudWatch alarm based on a static threshold.

more possible configurations when configuring the alarm

  1. Configure the SNS topic to be notified about the alarm.

You can also use Amazon SNS to trigger a Lambda function when the alarm is triggered. See Using AWS Lambda with Amazon SNS for more information.

Configure the SNS topic

  1. For Alarm name, enter a name.
  2. Choose Next.
    choose next
  3. Choose Create alarm.
    creat alarm

Clean up

To clean up the resources created as part of this post, make sure to delete all the created stacks. To do that, run the following command:

sam delete --stack-name usage --region us-east-1


In this post, we showed how you can use the new integration from SageMaker with Service Quotas to automate the requests for quota increases for SageMaker resources. This way, data science teams can effectively work in parallel and reduce issues related to unavailability of instances.

You can learn more about Amazon SageMaker quotas by accessing the documentation. You can also learn more about Service Quotas here.

About the authors

Bruno Klein is a Machine Learning Engineer in the AWS ProServe team. He particularly enjoys creating automations and improving the lifecycle of models in production. In his free time, he likes to spend time outdoors and hiking.

Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon SageMaker Training and Processing. In his spare time, Paras enjoys spending time with his family and road biking around the Bay Area. You can find him on LinkedIn.

Read More

Build custom code libraries for your Amazon SageMaker Data Wrangler Flows using AWS Code Commit

Build custom code libraries for your Amazon SageMaker Data Wrangler Flows using AWS Code Commit

As organizations grow in size and scale, the complexities of running workloads increase, and the need to develop and operationalize processes and workflows becomes critical. Therefore, organizations have adopted technology best practices, including microservice architecture, MLOps, DevOps, and more, to improve delivery time, reduce defects, and increase employee productivity. This post introduces a best practice for managing custom code within your Amazon SageMaker Data Wrangler workflow.

Data Wrangler is a low-code tool that facilitates data analysis, preprocessing, and visualization. It contains over 300 built-in data transformation steps to aid with feature engineering, normalization, and cleansing to transform your data without having to write any code.

In addition to the built-in transforms, Data Wrangler contains a custom code editor that allows you to implement custom code written in Python, PySpark, or SparkSQL.

When using Data Wrangler custom transform steps to implement your custom functions, you need to implement best practices around developing and deploying code in Data Wrangler flows.

This post shows how you can use code stored in AWS CodeCommit in the Data Wrangler custom transform step. This provides you with additional benefits, including:

  • Improve productivity and collaboration across personnel and teams
  • Version your custom code
  • Modify your Data Wrangler custom transform step without having to log in to Amazon SageMaker Studio to use Data Wrangler
  • Reference parameter files in your custom transform step
  • Scan code in CodeCommit using Amazon CodeGuru or any third-party application for security vulnerabilities before using it in Data Wrangler flowssagemake

Solution overview

This post demonstrates how to build a Data Wrangler flow file with a custom transform step. Instead of hardcoding the custom function into your custom transform step, you pull a script containing the function from CodeCommit, load it, and call the loaded function in your custom transform step.

For this post, we use the bank-full.csv data from the University of California Irving Machine Learning Repository to demonstrate these functionalities. The data is related to the direct marketing campaigns of a banking institution. Often, more than one contact with the same client was required to assess if the product (bank term deposit) would be subscribed (yes) or not subscribed (no).

The following diagram illustrates this solution.

The workflow is as follows:

  1. Create a Data Wrangler flow file and import the dataset from Amazon Simple Storage Service (Amazon S3).
  2. Create a series of Data Wrangler transformation steps:
    • A custom transform step to implement a custom code stored in CodeCommit.
    • Two built-in transform steps.

We keep the transformation steps to a minimum so as not to detract from the aim of this post, which is focused on the custom transform step. For more information about available transformation steps and implementation, refer to Transform Data and the Data Wrangler blog.

  1. In the custom transform step, write code to pull the script and configuration file from CodeCommit, load the script as a Python module, and call a function in the script. The function takes a configuration file as an argument.
  2. Run a Data Wrangler job and set Amazon S3 as the destination.

Destination options also include Amazon SageMaker Feature Store.


As a prerequisite, we set up the CodeCommit repository, Data Wrangler flow, and CodeCommit permissions.

Create a CodeCommit repository

For this post, we use an AWS CloudFormation template to set up a CodeCommit repository and copy the required files into this repository. Complete the following steps:

  1. Choose Launch Stack:

  1. Select the Region where you want to create the CodeCommit repository.
  2. Enter a name for Stack name.
  3. Enter a name for the repository to be created for RepoName.
  4. Choose Create stack.

AWS CloudFormation takes a few seconds to provision your CodeCommit repository. After the CREATE_COMPLETE status appears, navigate to the CodeCommit console to see your newly created repository.

Set up Data Wrangler

Download the dataset from the University of California Irving Machine Learning Repository. Then, extract the contents of and upload bank-full.csv to Amazon S3.

To create a Data Wrangler flow file and import the bank-full.csv dataset from Amazon S3, complete the following steps:

  1. Onboard to SageMaker Studio using the quick start for users new to Studio.
  2. Select your SageMaker domain and user profile and on the Launch menu, choose Studio.

  1. On the Studio console, on the File menu, choose New, then choose Data Wrangler Flow.
  2. Choose Amazon S3 for Data sources.
  3. Navigate to your S3 bucket containing the file and upload the bank-full.csv file.

A Preview Error will be thrown.

  1. Change the Delimiter in the Details pane to the right to SEMICOLON.

A preview of the dataset will be displayed in the result window.

  1. In the Details pane, on the Sampling drop-down menu, choose None.

This is a relatively small dataset, so you don’t need to sample.

  1. Choose Import.

Configure CodeCommit permissions

You need to provide Studio with permission to access CodeCommit. We use a CloudFormation template to provision an AWS Identity and Access Management (IAM) policy that gives your Studio role permission to access CodeCommit. Complete the following steps:

  1. Choose Launch Stack:

  1. Select the Region you are working in.
  2. Enter a name for Stack name.
  3. Enter your Studio domain ID for SageMakerDomainID. The domain information is available on the SageMaker console Domains page, as shown in the following screenshot.

  1. Enter your Studio domain user profile name for SageMakerUserProfileName. You can view your user profile name by navigating into your Studio domain. If you have multiple user profiles in your Studio domain, enter the name for the user profile used to launch Studio.

  1. Select the acknowledgement box.

The IAM resources used by this CloudFormation template provide the minimum permissions to successfully create the IAM policy attached to your Studio role for CodeCommit access.

  1. Choose Create stack.

Transformation steps

Next, we add transformations to process the data.

Custom transform step

In this post, we calculate the Variance Inflation factor (VIF) for each numerical feature and drop features that exceed a VIF threshold. We do this in the custom transform step because Data Wrangler doesn’t have a built-in transform for this task as of this writing.

However, we don’t hardcode this VIF function. Instead, we pull this function from the CodeCommit repository into the custom transform step. Then we run the function on the dataset.

  1. On the Data Wrangler console, navigate to your data flow.
  2. Choose the plus sign next to Data types and choose Add transform.

  1. Choose + Add step.
  2. Choose Custom transform.
  3. Optionally, enter a name in the Name field.
  4. Choose Python (PySpark) on the drop-down menu.
  5. For Your custom transform, enter the following code (provide the name of the CodeCommit repository and Region where the repository is located):
# Table is available as variable `df`
import boto3
import os
import json
import numpy as np
from importlib import reload
import sys

# Initialize variables
repo_name= <Enter Name of Repository># Name of repository in CodeCommit
region= <Name of region where repository is located># Name of AWS region
script_name='' # Name of script in CodeCommit
config_name='parameter.json' # Name of configuration file in CodeCommit

# Creating directory to hold downloaded files
if not os.path.exists(newpath):

# Downloading configuration file to memory
client=boto3.client('codecommit', region_name=region)
response = client.get_file(

# Downloading script to directory
script = client.get_file(
with open(module_path,'w') as f:

# importing pyspark script as module
import pyspark_transform

# Executing custom function in pyspark script

The code uses the AWS SDK for Python (Boto3) to access CodeCommit API functions. We use the get_file API function to pull files from the CodeCommit repository into the Data Wrangler environment.

  1. Choose Preview.

In the Output pane, a table is displayed showing the different numerical features and their corresponding VIF value. For this exercise, the VIF threshold value is set to 1.2. However, you can modify this threshold value in the parameter.json file found in your CodeCommit repository. You will notice that two columns have been dropped (pdays and previous), bringing the total column count to 15.

  1. Choose Add.

Encode categorical features

Some feature types are categorical variables that need to be transformed into numerical forms. Use the one-hot encode built-in transform to achieve this data transformation. Let’s create numerical features representing the unique value in each categorical feature in the dataset. Complete the following steps:

  1. Choose + Add step.
  2. Choose the Encode categorical transform.
  3. On the Transform drop-down menu, choose One-hot encode.
  4. For Input column, choose all categorical features, including poutcome, y, month, marital, contact, default, education, housing, job, and loan.
  5. For Output style, choose Columns.
  6. Choose Preview to preview the results.

One-hot encoding might take a while to generate results, given the number of features and unique values within each feature.

  1. Choose Add.

For each numerical feature created with one-hot encoding, the name combines the categorical feature name appended with an underscore (_) and the unique categorical value within that feature.

Drop column

The y_yes feature is the target column for this exercise, so we drop the y_no feature.

  1. Choose + Add step.
  2. Choose Manage columns.
  3. Choose Drop column under Transform.
  4. Choose y_no under Columns to drop.
  5. Choose Preview, then choose Add.

Create a Data Wrangler job

Now that you have created all the transform steps, you can create a Data Wrangler job to process your input data and store the output in Amazon S3. Complete the following steps:

  1. Choose Data flow to go back to the Data Flow page.
  2. Choose the plus sign on the last tile of your flow visualization.
  3. Choose Add destination and choose Amazon S3.

  1. Enter the name of the output file for Dataset name.
  2. Choose Browse and choose the bucket destination for Amazon S3 location.
  3. Choose Add destination.
  4. Choose Create job.

  1. Change the Job name value as you see fit.
  2. Choose Next, 2. Configure job.
  3. Change Instance count to 1, because we work with a relatively small dataset, to reduce the cost incurred.
  4. Choose Create.

This will start an Amazon SageMaker Processing job to process your Data Wrangler flow file and store the output in the specified S3 bucket.


Now that you have created your Data Wrangler flow file, you can schedule your Data Wrangler jobs to automatically run at specific times and frequency. This is a feature that comes out of the box with Data Wrangler and simplifies the process of scheduling Data Wrangler jobs. Furthermore, CRON expressions are supported and provide additional customization and flexibility in scheduling your Data Wrangler jobs.

However, this post shows how you can automate the Data Wrangler job to run every time there is a modification to the files in the CodeCommit repository. This automation technique ensures that any changes to the custom code functions or changes to values in the configuration file in CodeCommit trigger a Data Wrangler job to reflect these changes immediately.

Therefore, you don’t have to manually start a Data Wrangler job to get the output data that reflects the changes you just made. With this automation, you can improve the agility and scale of your Data Wrangler workloads. To automate your Data Wrangler jobs, you configure the following:

  • Amazon SageMaker Pipelines – Pipelines helps you create machine learning (ML) workflows with an easy-to-use Python SDK, and you can visualize and manage your workflow using Studio
  • Amazon EventBridge – EventBridge facilitates connection to AWS services, software as a service (SaaS) applications, and custom applications as event producers to launch workflows.

Create a SageMaker pipeline

First, you need to create a SageMaker pipeline for your Data Wrangler job. Then complete the following steps to export your Data Wrangler flow to a SageMaker pipeline:

  1. Choose the plus sign on your last transform tile (the transform tile before the Destination tile).
  2. Choose Export to.
  3. Choose SageMaker Inference Pipeline (via Jupyter Notebook).

This creates a new Jupyter notebook prepopulated with code to create a SageMaker pipeline for your Data Wrangler job. Before running all the cells in the notebook, you may want to change certain variables.

  1. To add a training step to your pipeline, change the add_training_step variable to True.

Be aware that running a training job will incur additional costs on your account.

  1. Specify a value for the target_attribute_name variable to y_yes.

  1. To change the name of the pipeline, change the pipeline_name variable.

  1. Lastly, run the entire notebook by choosing Run and Run All Cells.

This creates a SageMaker pipeline and runs the Data Wrangler job.

  1. To view your pipeline, choose the home icon on the navigation pane and choose Pipelines.

You can see the new SageMaker pipeline created.

  1. Choose the newly created pipeline to see the run list.
  2. Note the name of the SageMaker pipeline, as you will use it later.
  3. Choose the first run and choose Graph to see a Directed Acyclic Graph (DAG) flow of your SageMaker pipeline.

As shown in the following screenshot, we didn’t add a training step to our pipeline. If you added a training step to your pipeline, it will display in your pipeline run Graph tab under DataWranglerProcessingStep.

Create an EventBridge rule

After successfully creating your SageMaker pipeline for the Data Wrangler job, you can move on to setting up an EventBridge rule. This rule listens to activities in your CodeCommit repository and triggers the run of the pipeline in the event of a modification to any file in the CodeCommit repository. We use a CloudFormation template to automate creating this EventBridge rule. Complete the following steps:

  1. Choose Launch Stack:

  1. Select the Region you are working in.
  2. Enter a name for Stack name.
  3. Enter a name for your EventBridge rule for EventRuleName.
  4. Enter the name of the pipeline you created for PipelineName.
  5. Enter the name of the CodeCommit repository you are working with for RepoName.
  6. Select the acknowledgement box.

The IAM resources that this CloudFormation template uses provide the minimum permissions to successfully create the EventBridge rule.

  1. Choose Create stack.

It takes a few minutes for the CloudFormation template to run successfully. When the Status changes to CREATE_COMPLTE, you can navigate to the EventBridge console to see the created rule.

Now that you have created this rule, any changes you make to the file in your CodeCommit repository will trigger the run of the SageMaker pipeline.

To test the pipeline edit a file in your CodeCommit repository, modify the VIF threshold in your parameter.json file to a different number, and go to the SageMaker pipeline details page to see a new run of your pipeline created.

In this new pipeline run, Data Wrangler drops numerical features that have a greater VIF value than the threshold you specified in your parameter.json file in CodeCommit.

You have successfully automated and decoupled your Data Wrangler job. Furthermore, you can add more steps to your SageMaker pipeline. You can also modify the custom scripts in CodeCommit to implement various functions in your Data Wrangler flow.

It’s also possible to store your scripts and files in Amazon S3 and download them into your Data Wrangler custom transform step as an alternative to CodeCommit. In addition, you ran your custom transform step using the Python (PyScript) framework. However, you can also use the Python (Pandas) framework for your custom transform step, allowing you to run custom Python scripts. You can test this out by changing your framework in the custom transform step to Python (Pandas) and modifying your custom transform step code to pull and implement the Python script version stored in your CodeCommit repository. However, the PySpark option for Data Wrangler provides better performance when working on a large dataset compared to the Python Pandas option.

Clean up

After you’re done experimenting with this use case, clean up the resources you created to avoid incurring additional charges to your account:

  1. Stop the underlying instance used to create your Data Wrangler flow.
  2. Delete the resources created by the various CloudFormation template.
  3. If you see a DELETE_FAILED state, when deleting the CloudFormation template, delete the stack one more time to successfully delete it.


This post showed you how to decouple your Data Wrangler custom transform step by pulling scripts from CodeCommit. We also showed how to automate your Data Wrangler jobs using SageMaker Pipelines and EventBridge.

Now you can operationalize and scale your Data Wrangler jobs without modifying your Data Wrangler flow file. You can also scan your custom code in CodeCommit using CodeGuru or any third-party application for vulnerabilities before implementing it in Data Wrangler. To know more about end-to-end machine learning operations (MLOps) on AWS, visit Amazon SageMaker for MLOps.

About the Author

Uchenna Egbe is an Associate Solutions Architect at AWS. He spends his free time researching about herbs, teas, superfoods, and how to incorporate them into his daily diet.

Read More

Accelerate Amazon SageMaker inference with C6i Intel-based Amazon EC2 instances

Accelerate Amazon SageMaker inference with C6i Intel-based Amazon EC2 instances

This is a guest post co-written with Antony Vance from Intel.

Customers are always looking for ways to improve the performance and response times of their machine learning (ML) inference workloads without increasing the cost per transaction and without sacrificing the accuracy of the results. Running ML workloads on Amazon SageMaker running Amazon Elastic Compute Cloud (Amazon EC2) C6i instances with Intel’s INT8 inference deployment can help boost the overall performance by up to four times per dollar spent while keeping the loss in inference accuracy less than 1% as compared to FP32 when applied to certain ML workloads. When it comes to running the models in embedded devices where form factor and size of the model is important, quantization can help.

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (INT8) instead of the usual 32-bit floating point (FP32). In the following example figure, we show INT8 inference performance in C6i for a BERT-base model.

The BERT-base was fine-tuned with SQuAD v1.1, with PyTorch (v1.11) being the ML framework used with Intel® Extension for PyTorch. A batch size of 1 was used for the comparison. Higher batch sizes will provide different cost per 1 million inferences.

In this post, we show you how to build and deploy INT8 inference with your own processing container for PyTorch. We use Intel extensions for PyTorch for an effective INT8 deployment workflow.

Overview of the technology

EC2 C6i instances are powered by third-generation Intel Xeon Scalable processors (also called Ice Lake) with an all-core turbo frequency of 3.5 GHz.

In the context of deep learning, the predominant numerical format used for research and deployment has so far been 32-bit floating point, or FP32. However, the need for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy.

EC2 C6i instances offer many new capabilities that result in performance improvements for AI and ML workloads. C6i instances provide performance advantages in FP32 and INT8 model deployments. FP32 inference is enabled with AVX-512 improvements, and INT8 inference is enabled by AVX-512 VNNI instructions.

C6i is now available on SageMaker endpoints, and developers should expect it to provide over two times price-performance improvements for INT8 inference over FP32 inference and up to four times performance improvement when compared with C5 instance FP32 inference. Refer to the appendix for instance details and benchmark data.

Deep learning deployment on the edge for real-time inference is key to many application areas. It significantly reduces the cost of communicating with the cloud in terms of network bandwidth, network latency, and power consumption. However, edge devices have limited memory, computing resources, and power. This means that a deep learning network must be optimized for embedded deployment. INT8 quantization has become a popular approach for such optimizations for ML frameworks like TensorFlow and PyTorch. SageMaker provides you with a bring your own container (BYOC) approach and integrated tools so that you can run quantization.

For more information, refer to Lower Numerical Precision Deep Learning Inference and Training.

Solution overview

The steps to implement the solution are as follows:

  1. Provision an EC2 C6i instance to quantize and create the ML model.
  2. Use the supplied Python scripts for quantization.
  3. Create a Docker image to deploy the model in SageMaker using the BYOC approach.
  4. Use an Amazon Simple Storage Service (Amazon S3) bucket to copy the model and code for SageMaker access.
  5. Use Amazon Elastic Container Registry (Amazon ECR) to host the Docker image.
  6. Use the AWS Command Line Interface (AWS CLI) to create an inference endpoint in SageMaker.
  7. Run the provided Python test scripts to invoke the SageMaker endpoint for both INT8 and FP32 versions.

This inference deployment setup uses a BERT-base model from the Hugging Face transformers repository (csarron/bert-base-uncased-squad-v1).


The following are prerequisites for creating the deployment setup:

  • A Linux shell terminal with the AWS CLI installed
  • An AWS account with access to EC2 instance creation (C6i instance type)
  • SageMaker access to deploy a SageMaker model, endpoint configuration, endpoint
  • AWS Identity and Access Management (IAM) access to configure an IAM role and policy
  • Access to Amazon ECR
  • SageMaker access to create a notebook with instructions to launch an endpoint

Generate and deploy a quantized INT8 model on SageMaker

Open an EC2 instance for creating your quantized model and push the model artifacts to Amazon S3. For endpoint deployment, create a custom container with PyTorch and Intel® Extension for PyTorch to deploy the optimized INT8 model. The container gets pushed into Amazon ECR and a C6i based endpoint is created to serve FP32 and INT8 models.

The following diagram illustrates the high-level flow.

To access the code and documentation, refer to the GitHub repo.

Example use case

The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of questions posed by crowd-workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

The following example is a question answering algorithm using a BERT-base model. Given a document as an input, the model will answer simple questions based on the learning and contexts from the input document.

The following is an example input document:

The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometers (2,700,000 sq mi), of which 5,500,000 square kilometers (2,100,000 sq mi) are covered by the rainforest.

For the question “Which name is also used to describe the Amazon rainforest in English?” we get the answer:

also known in English as Amazonia or the Amazon Jungle,Amazonia or the Amazon Jungle, Amazonia.

For the question “How many square kilometers of rainforest is covered in the basin?” we get the answer:

5,500,000 square kilometers (2,100,000 sq mi) are covered by the rainforest.5,500,000.

Quantizing the model in PyTorch

This section gives a quick overview of model quantization steps with PyTorch and Intel extensions.

The code snippets are derived from a SageMaker example.

Let’s go over the changes in detail for the function IPEX_quantize in the file

  1. Import intel extensions for PyTorch to help with quantization and optimization and import torch for array manipulations:
import intel_extension_for_pytorch as ipex
import torch
  1. Apply model calibration for 100 iterations. In this case, you are calibrating the model with the SQuAD dataset:
conf = ipex.quantization.QuantConf(qscheme=torch.per_tensor_affine)
print("Doing calibration...")
for step, batch in enumerate(eval_dataloader):
    print("Calibration step-", step)
    with torch.no_grad():
        with ipex.quantization.calibrate(conf):
    if step == 100:
  1. Prepare sample inputs:
jit_inputs = []
    example_batch = next(iter(eval_dataloader))
    for key in example_batch:
        example_tensor = torch.ones_like(example_batch[key])
    jit_inputs = tuple(jit_inputs)
  1. Convert the model to an INT8 model using the following configuration:
with torch.no_grad():
    model = ipex.quantization.convert(model, conf, jit_inputs)
  1. Run two iterations of forward pass to enable fusions:
    with torch.no_grad(): model(**example_batch) model(**example_batch)

  1. As a last step, save the TorchScript model:, ""))

Clean up

Refer to the Github repo for steps to clean up the AWS resources created.


New EC2 C6i instances in an SageMaker endpoint can accelerate the inference deployment up to 2.5 times greater with INT8 quantization. Quantizing the model in PyTorch is possible with a few APIs from Intel PyTorch extensions. It’s recommended to quantize the model in C6i instances so that model accuracy is maintained in endpoint deployment. The SageMaker examples GitHub repo now provides an end-to-end deployment example pipeline for quantizing and hosting INT8 models.

We encourage you to create a new model or migrate an existing model using INT8 quantization using the EC2 C6i instance type and see the performance gains for yourself.

Notice and disclaimers

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (0BSD)


New AWS instances in SageMaker with INT8 deployment support

The following table lists SageMaker instances with and without DL Boost support.

Instance Name Xeon Gen Codename INT8 Enabled? DL Boost Enabled?
ml.c5. xlarge – ml.c5.9xlarge Skylake/1st Yes No
ml.c5.18xlarge Skylake/1st Yes No
ml.c6i.1x – 32xlarge Ice Lake/3rd Yes Yes

To summarize, INT8 enabled supports the INT8 data type and computation; DL Boost enabled supports Deep Learning Boost.

Benchmark data

The following table compares the cost and relative performance between c5 and c6 instances.

Latency and throughput measured with 10000 Inference queries to Sage maker endpoints.

E2E Latency of Inference Endpoint and Cost analysis
P50(ms) P90(ms) Queries/Sec $/1M Queries Relative $/Performance
C5.2xLarge-FP32 76.6 125.3 11.5 $10.2 1.0x
c6i.2xLarge-FP32 70 110.8 13 $9.0 1.1x
c6i.2xLarge-INT8 35.7 48.9 25.56 $4.5 2.3x

INT8 models are expected to provide 2–4 times practical performance improvements with less than 1% accuracy loss for most of the models. Above table covers overhead latency (NW and demo application)

Accuracy for BERT-base model

The following table summarizes the accuracy for the INT8 model with the SQUaD v1.1 dataset.

Metric FP32 INT8
Exact Match 85.8751 85.5061
F1 92.0807 91.8728

The GitHub repo comes with the scripts to check the accuracy of the SQuAD dataset. Refer to and scripts for testing.

Intel Extension for PyTorch

Intel® Extension for PyTorch* (an open–source project at GitHub) extends PyTorch with optimizations for extra performance boosts on Intel hardware. Most of the optimizations will be included in stock PyTorch releases eventually, and the intention of the extension is to deliver up-to-date features and optimizations for PyTorch on Intel hardware. Examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX).

The following figure illustrates the Intel Extension for PyTorch architecture.

For more detailed user guidance (features, performance tuning, and more) for Intel® Extension for PyTorch, refer to Intel® Extension for PyTorch* user guidance.

About the Authors

Rohit Chowdhary is a Sr. Solutions Architect in the Strategic Accounts team at AWS.

Aniruddha Kappagantu is a Software Development Engineer in the AI Platforms team at AWS.

Antony Vance is an AI Architect at Intel with 19 years of experience in computer vision, machine learning, deep learning, embedded software, GPU, and FPGA.

Read More

Intelligently search your organization’s Microsoft Teams data source with the Amazon Kendra connector for Microsoft Teams

Intelligently search your organization’s Microsoft Teams data source with the Amazon Kendra connector for Microsoft Teams

Organizations use messaging platforms like Microsoft Teams to bring the right people together to securely communicate with each other and collaborate to get work done. Microsoft Teams captures invaluable organizational knowledge in the form of the information that flows through it as users collaborate. However, making this knowledge easily and securely available to users can be challenging due to the fragmented nature of conversations across groups, channels, and chats within an organization. Additionally, the conversational nature of Microsoft Teams communication renders a traditional keyword-based approach to search ineffective when trying to find accurate answers to questions from the content and therefore requires intelligent search capabilities that have the ability to process natural language queries.

You can now use the Amazon Kendra connector for Microsoft Teams to index Microsoft Teams messages and documents, and search this content using intelligent search in Amazon Kendra, powered by machine learning (ML).

This post shows how to configure the Amazon Kendra connector for Microsoft Teams and take advantage of the service’s intelligent search capabilities. We use an example of an illustrative Microsoft Teams instance where users discuss technical topics related to AWS.

Solution overview

Microsoft Teams content for active organizations is dynamic in nature due to continuous collaboration. Microsoft Teams includes public channels where any user can participate, and private channels where only those users who are members of these channels can communicate with each other. Furthermore, individuals can directly communicate with one another in one-on-one and ad hoc groups. This communication is in the form of messages and threads of replies, with optional document attachments.

In our solution, we configure Microsoft Teams as a data source for an Amazon Kendra search index using the Amazon Kendra connector for Microsoft Teams. Based on the configuration, when the data source is synchronized, the connector crawls and indexes all the content from Microsoft Teams that was created on or before a specific date. The connector also indexes the Access Control List (ACL) information for each message and document. When access control or user context filtering is enabled, the search results of a query made by a user includes results only from those documents that the user is authorized to read.

The Amazon Kendra connector for Microsoft Teams can integrate with AWS IAM Identity Center (Successor to AWS Single Sign-On). You first must enable IAM Identity Center and create an organization to sync users and groups from your active directory. The connector will use the user name and group lookup for the user context of the search queries.

With Amazon Kendra Experience Builder, you can build and deploy a low-code, fully functional search application to search your Microsoft Teams data source.


To try out the Amazon Kendra connector for Microsoft Teams using this post as a reference, you need the following:

Note that the Microsoft Graph API places throttling limits on the number of concurrent calls to a service to prevent overuse of resources.

Configure Microsoft Teams

The following screenshot shows our example Microsoft Teams instance with sample content and the PDF file AWS_Well-Architect_Framework.pdf that we will use for our Amazon Kendra search queries.

The following steps describe the configuration of a new Amazon Kendra connector application in the Azure portal. This will create a user OAuth token to be used in configuring the Amazon Kendra connector for Microsoft Teams.

  1. Log in to Azure Portal with your Microsoft credentials.
  2. Register an application with the Microsoft Identity platform.

  1. Next to Client credentials, choose Add a certificate or secret to add a new client secret.

  1. For Description, enter a description (for example, KendraConnectorSecret).
  2. For Expires, choose an expiry date (for example, 6 months).
  3. Choose Add.

  1. Save the secret ID and secret value to use later when creating an Amazon Kendra data source.

  1. Choose Add a permission.

  1. Choose Microsoft Graph to add all necessary Microsoft Graph permissions.

  1. Choose Application permissions.

The registered application should have the following API permissions to allow crawling all entities supported by the Amazon Kendra connector for Microsoft Teams:

  • ChannelMessage.Read.All
  • Chat.Read
  • Chat.Read.All
  • Chat.ReadBasic
  • Chat.ReadBasic.All
  • ChatMessage.Read.All
  • Directory.Read.All
  • Files.Read.All
  • Group.Read.All
  • Mail.Read
  • Mail.ReadBasic
  • User.Read
  • User.Read.All
  • TeamMember.Read.All

However, you can select a lesser scope based on the entities chosen to be crawled. The following lists are the minimum sets of permissions needed for each entity:

  • Channel Post:
    • ChannelMessage.Read.All
    • Group.Read.All
    • User.Read
    • User.Read.All
    • TeamMember.Read.All (user-group mapping for identity crawl)
  • Channel Attachment:
    • ChannelMessage.Read.All
    • Group.Read.All
    • User.Read
    • User.Read.All
    • TeamMember.Read.All (user-group mapping for identity crawl)
  • Channel Wiki:
    • Group.Read.All
    • User.Read
    • User.Read.All
    • TeamMember.Read.All (user-group mapping for identity crawl)
  • Chat Message:
    • Chat.Read.All
    • ChatMessage.Read.All
    • ChatMember.Read.All
    • User.Read
    • User.Read.All
    • Group.Read.All
    • TeamMember.Read.All (user-group mapping for identity crawl)
  • Meeting Chat:
    • Chat.Read.All
    • ChatMessage.Read.All
    • ChatMember.Read.All
    • User.Read
    • User.Read.All
    • Group.Read.All
    • TeamMember.Read.All (user-group mapping for identity crawl)
  • Chat Attachment:
    • Chat.Read.All
    • ChatMessage.Read.All
    • ChatMember.Read.All
    • User.Read
    • User.Read.All
    • Group.Read.All
    • Files.Read.All
    • TeamMember.Read.All (user-group mapping for identity crawl)
  • Meeting File:
    • Chat.Read.All
    • ChatMessage.Read.All
    • ChatMember.Read.All
    • User.Read
    • User.Read.All
    • Group.Read.All
    • Files.Read.All
    • TeamMember.Read.All (user-group mapping for identity crawl)
  • Calendar Meeting:
    • Calendars.Read
    • Group.Read.All
    • TeamMember.Read.All
    • User.Read
    • User.Read.All
    • TeamMember.Read.All (user-group mapping for identity crawl)
  • Meeting Notes:
    • Group.Read.All
    • User.Read
    • User.Read.All
    • Files.Read.All
    • TeamMember.Read.All (user-group mapping for identity crawl)
  1. Select your permissions and choose Add permissions.

Configure the data source using the Amazon Kendra connector for Microsoft Teams

To add a data source to your Amazon Kendra index using the Microsoft Teams connector, you can use an existing Amazon Kendra index, or create a new Amazon Kendra index. Then complete the steps in this section. For more information on this topic, refer to Microsoft Teams.

  1. On the Amazon Kendra console, open the index and choose Data sources in the navigation pane.
  2. Choose Add data source.
  3. Under Microsoft Teams connector, choose Add connector.

  1. In the Specify data source details section, enter the details of your data source and choose Next.
  2. In the Define access and security section, for Tenant ID, enter the Microsoft Teams tenant ID from the Microsoft account dashboard.
  3. Under Authentication, you can either choose Create to add a new secret with the client ID and client secret of the Microsoft Teams tenant, or use an existing AWS Secrets Manager secret that has the client ID and client secret of the Microsoft Teams tenant that you want the connector to access.
  4. Choose Save.

  1. Optionally, choose the appropriate payment model:
    • Model A payment models are restricted to licensing and payment models that require security compliance.
    • Model B payment models are suitable for licensing and payment models that don’t require security compliance.
    • Use Evaluation Mode (default) for limited usage evaluation purposes.
  2. For IAM role, you can choose Create a new role or choose an existing IAM role configured with appropriate IAM policies to access the Secrets Manager secret, Amazon Kendra index, and data source.
  3. Choose Next.

  1. In the Configure sync settings section, provide information regarding your sync scope.

  1. For Sync mode, choose your sync mode (for this post, select Full sync).

With the Full sync option, every time the sync runs, Amazon Kendra will crawl all documents and ingest each document even if ingested earlier. The full refresh enables you to reset your Amazon Kendra index without the need to delete and create a new data source. If you choose New or modified content sync or New, modified, or deleted content sync, every time the sync job runs, it will process only objects added, modified, or deleted since the last crawl. Incremental crawls can help reduce runtime and cost when used with datasets that append new objects to existing data sources on a regular basis.

  1. For Sync run schedule, choose Run on demand.
  2. Choose Next.

  1. In the Set field mappings section, you can optionally configure the field mappings, wherein Microsoft Teams field names may be mapped to a different Amazon Kendra attribute or facet.
  2. Choose Next.

  1. Review your settings and confirm to add the data source.
  2. After the data source is added, choose Data sources in the navigation pane, select the newly added data source, and choose Sync now to start data source synchronization with the Amazon Kendra index.

The sync process can take upwards of 30 minutes (depending on the amount of data to be crawled).

Now let’s enable access control for the Amazon Kendra index.

  1. In the navigation pane, choose your index.
  2. On the User access control tab, choose Edit settings and change the settings to look like the following screenshot.
  3. Choose Next, then choose Update.

Perform intelligent search with Amazon Kendra

Before you try searching on the Amazon Kendra console or using the API, make sure that the data source sync is complete. To check, view the data sources and verify if the last sync was successful.

Now we’re ready to search our index.

  1. On the Amazon Kendra console, navigate to the index and choose Search indexed content in the navigation pane.
  2. Let’s use the query “How do you detect security events” and not provide an access token.

Based on our access control settings, a valid access token is needed to access authenticated content; therefore, when we use this search query without setting any user name or group, no results are returned.

  1. Next, choose Apply token and set the user name to a user in the domain (for example, usertest4) that has access to the Microsoft Teams content.

In this example, the search will return a result from the PDF file uploaded in the Microsoft Teams chat message.

  1. Finally, choose Apply token and set the user name to a different user in the domain (for example, usertest) that has access to different Microsoft Teams content.

In this example, the search will return a different Microsoft Teams chat message.

This confirms that the ACLs ingested in Amazon Kendra by the connector for Microsoft Teams are being enforced in the search results based on the user name.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Microsoft Teams, delete that data source.


With the Amazon Kendra connector for Microsoft Teams, organizations can make invaluable information trapped in their Microsoft Teams instances available to their users securely using intelligent search powered by Amazon Kendra. Additionally, the connector provides facets for Microsoft Teams attributes such as channels, authors, and categories for the users to interactively refine the search results based on what they’re looking for.

To learn more about the Amazon Kendra connector for Microsoft Teams, refer to Microsoft Teams.

For more information on how you can create, modify, or delete metadata and content when ingesting your data from the Microsoft Teams, refer to Customizing document metadata during the ingestion process and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.

About the Authors

Praveen Edem is a Senior Solutions Architect at Amazon Web Services. He works with major financial services customers, architecting and modernizing their critical large-scale applications while adopting AWS services. He has over 20 years of IT experience in application development and software architecture.

Gunwant Walbe is a Software Development Engineer at Amazon Web Services. He is an avid learner and keen to adopt new technologies. He develops complex business applications, and Java is his primary language of choice.

Read More