Enable Amazon Kendra search for a scanned or image-based text document

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines search for your websites and applications so your employees and customers can easily find the content they’re looking for, even when it’s scattered across multiple locations and content repositories within your organization.

Amazon Kendra supports a variety of document formats, such as Microsoft Word, PDF, and text. While working with a leading Edtech customer, we were asked to build an enterprise search solution that also utilizes images and PPT files. This post focuses on extending the document support in Amazon Kendra so you can preprocess text images and scanned documents (JPEG, PNG, or PDF format)  to make them searchable. The solution combines Amazon Textract for document preprocessing and optical character recognition (OCR), and Amazon Kendra for intelligent search.

With the new Custom Document Enrichment feature in Amazon Kendra, you can now preprocess your documents during ingestion and augment your documents with new metadata. Custom Document Enrichment allows you to call external services like Amazon Comprehend, Amazon Textract, and Amazon Transcribe to extract text from images, transcribe audio, and analyze video. For more information about using Custom Document Enrichment, refer to Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.

In this post, we propose an alternate method of preprocessing the content prior to calling the ingestion process in Amazon Kendra.

Solution overview

Amazon Textract is an ML service that automatically extracts text, handwriting, and data from scanned documents and goes beyond basic OCR to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents like PDFs, images, tables, and forms through basic OCR software that requires manual configuration, which often requires reconfiguration when the form changes.

To overcome these manual and expensive processes, Amazon Textract uses machine learning to read and process a wide range of documents, accurately extracting text, handwriting, tables, and other data without any manual effort. You can quickly automate document processing and take action on the information extracted, whether it’s automating loans processing or extracting information from invoices and receipts.

Amazon Kendra is an easy-to-use enterprise search service that allows you to add search capabilities to your applications so that end-users can easily find information stored in different data sources within your company. This could include invoices, business documents, technical manuals, sales reports, corporate glossaries, internal websites, and more. You can harvest this information from storage solutions like Amazon Simple Storage Service (Amazon S3) and OneDrive; applications such as Salesforce, SharePoint, and ServiceNow; or relational databases like Amazon Relational Database Service (Amazon RDS).

The proposed solution enables you to unlock the search potential in scanned documents, extending the ability of Amazon Kendra to find accurate answers in a wider range of document types. The workflow includes the following steps:

  1. Upload a document (or documents of various types) to Amazon S3.
  2. The event triggers an AWS Lambda function that uses the synchronous Amazon Textract API (DetectDocumentText).
  3. Amazon Textract reads the document in Amazon S3, extracts the text from it, and returns the extracted text to the Lambda function.
  4. The data source on the new text file needs to be reindexed.
  5. When reindexing is complete, you can search the new dataset either via the Amazon Kendra console or API.

The following diagram illustrates the solution architecture.

In the following sections, we demonstrate how to configure the Lambda function, create the event trigger, process a document, and then reindex the data.

Configure the Lambda function

To configure your Lambda function, add the following code to the function Python editor:

import urllib
import boto3

textract = boto3.client('textract')
def handler(event, context):
	source_bucket = event['Records'][0]['s3']['bucket']['name']
	object_key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
	
	textract_result = textract.detect_document_text(
		Document={
			'S3Object': {
				'Bucket': source_bucket,
				'Name': object_key
			}
		})
	page=""
	blocks = [x for x in textract_result['Blocks'] if x['BlockType'] == "LINE"]
	for block in blocks:
		page += " " + block['Text']
        	
	print(page)
	s3 = boto3.resource('s3')
	object = s3.Object('demo-kendra-test', 'text/apollo11-summary.txt')
	object.put(Body=page)

We use the DetectDocumentText API to extract the text from an image (JPEG or PNG) retrieved in Amazon S3.

Create an event trigger at Amazon S3

In this step, we create an event trigger to start the Lambda function when a new document is uploaded to a specific bucket. The following screenshot shows our new function on the Amazon S3 console.

You can also verify the event trigger on the Lambda console.

Process a document

To test the process, we upload an image to the S3 folder that we defined for the S3 event trigger. We use the following sample image.

When the Lambda function is complete, we can go to the Amazon CloudWatch console to check the output. The following screenshot shows the extracted text, which confirms that the Lambda function ran successfully.

Reindex the data with Amazon Kendra

We can now reindex our data.

  1. On the Amazon Kendra console, under Data management in the navigation pane, choose Data sources.
  2. Select the data source demo-s3-datasource.
  3. Choose Sync now.

The sync state changes to Synching - crawling.

When the sync is complete, the sync status changes to Succeeded and the sync state changes to Idle.

Now we can go back to the search console and see our faceted search in action.

  1. In the navigation pane, choose Search console.

We added metadata for a few items; two of them are the ML algorithms XGBoost and BlazingText.

  1. Let’s try searching for Sagemaker.

Our search was successful, and we got a list of results. Let’s see what we have for facets.

  1. Expand Filter search results.

We have the category and tags facets that were part of our item metadata.

  1. Choose BlazingText to filter results just for that algorithm.
  2. Now let’s perform the search on newly uploaded image files. The following screenshot shows the search on new preprocessed documents.

Conclusion

This blog will be helpful in improving the effectiveness of search results and search experience. You can use Amazon Textract to extract text from scanned images that are added as metadata and later available as facets to interact with the search results. This is just an illustration of how you can use AWS native services to create a differentiated search experience for your users. This also helps in unlocking the full potential of your knowledge assets.

For a deeper dive into what you can achieve by combining other AWS services with Amazon Kendra, refer to Make your audio and video files searchable using Amazon Transcribe and Amazon KendraBuild an intelligent search solution with automated content enrichment, and other posts on the Amazon Kendra blog.


About of Author

Sanjay Tiwary is a Specialist Solutions Architect AI/ML. He spends his time working with strategic customers to define business requirements, provide L300 sessions around specific use cases, and design ML applications and services that are scalable, reliable, and performant. He has helped launch and scale the AI/ML powered Amazon SageMaker service and has implemented several proofs of concept using Amazon AI services. He has also developed the advanced analytics platform as a part of the digital transformation journey.

Read More

Interpret caller input using grammar slot types in Amazon Lex

Customer service calls require customer agents to have the customer’s account information to process the caller’s request. For example, to provide a status on an insurance claim, the support agent needs policy holder information such as the policy ID and claim number. Such information is often collected in the interactive voice response (IVR) flow at the beginning of a customer support call. IVR systems have typically used grammars based on the Speech Recognition Grammar Specification (SRGS) format to define rules and parse caller information (policy ID, claim number). You can now use the same grammars in Amazon Lex to collect information in a speech conversation. You can also provide semantic interpretation rules using ECMAScript tags within the grammar files. The grammar support in Amazon Lex provides granular control for collecting and postprocessing user input so you can manage an effective dialog.

In this post, we review the grammar support in Amazon Lex and author a sample grammar for use in an Amazon Connect contact flow.

Use grammars to collect information in a conversation

You can author the grammar as a slot type in Amazon Lex. First, you provide a set of rules in the SRGS format to interpret user input. As an optional second step, you can write an ECMA script that transforms the information collected in the dialog. Lastly, you store the grammar as an XML file in an Amazon Simple Storage Service (Amazon S3) bucket and reference the link in your bot definition. SRGS grammars are specifically designed for voice and DTMF modality. We use the following sample conversations to model our bot:

Conversation 1

IVR: Hello! How can I help you today?

User: I want to check my account balance.

IVR: Sure. Which account should I pull up?

User: Checking.

IVR: What is the account number?

User: 1111 2222 3333 4444

IVR: For verification purposes, what is your date of birth?

User: Jan 1st 2000.

IVR: Thank you. The balance on your checking account is $123 dollars.

Conversation 2

IVR: Hello! How can I help you today?

User: I want to check my account balance.

IVR: Sure. Which account should I pull up?

User: Savings.

IVR: What is the account number?

User: I want to talk to an agent.

IVR: Ok. Let me transfer the call. An agent should be able to help you with your request.

In the sample conversations, the IVR requests the account type, account number, and date of birth to process the caller’s requests. In this post, we review how to use the grammars to collect the information and postprocess it with ECMA scripts. The grammars for account ID and date cover multiple ways to provide the information. We also review the grammar in case the caller can’t provide the requested details (for example, their savings account number) and instead opts to speak with an agent.

Build an Amazon Lex chatbot with grammars

We build an Amazon Lex bot with intents to perform common retail banking functions such as checking account balance, transferring funds, and ordering checks. The CheckAccountBalance intent collects details such as account type, account ID, and date of birth, and provides the balance amount. We use a grammar slot type to collect the account ID and date of birth. If the caller doesn’t know the information or asks for an agent, the call is transferred to a human agent. Let’s review the grammar for the account ID:

<grammar version="1.0" xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" tag-format="semantics/1.0" root="captureAccount"><!-- Header definition for US language and the root rule "captureAccount" to start with-->

	<rule id="captureAccount" scope="public">
		<tag> out=""</tag>
		<one-of>
			<item><ruleref uri="#digit"/><tag>out += rules.digit.accountNumber</tag></item><!--Call the subrule to capture 16 digits--> 
			<item><ruleref uri="#agent"/><tag>out =rules.agent;</tag></item><!--Exit point to route the caller to an agent--> 
		</one-of>
	</rule>

	<rule id="digit" scope="public"> <!-- Capture digits from 1 to 9 -->
		<tag>out.accountNumber=""</tag>
		<item repeat="16"><!-- Repeat the rule exactly 16 times -->
			<one-of>
				<item>1<tag>out.accountNumber+=1;</tag></item>
				<item>2<tag>out.accountNumber+=2;</tag></item>
				<item>3<tag>out.accountNumber+=3;</tag></item>
				<item>4<tag>out.accountNumber+=4;</tag></item>
				<item>5<tag>out.accountNumber+=5;</tag></item>
				<item>6<tag>out.accountNumber+=6;</tag></item>
				<item>7<tag>out.accountNumber+=7;</tag></item>
				<item>8<tag>out.accountNumber+=8;</tag></item>
				<item>9<tag>out.accountNumber+=9;</tag></item>
				<item>0<tag>out.accountNumber+=0;</tag></item>
				<item>oh<tag>out.accountNumber+=0</tag></item>
				<item>null<tag>out.accountNumber+=0;</tag></item>
			</one-of>
		</item>
	</rule>
	
	<rule id="agent" scope="public"><!-- Exit point to talk to an agent-->
		<item>
			<item repeat="0-1">i</item>
			<item repeat="0-1">want to</item>
			<one-of>
				<item repeat="0-1">speak</item>
				<item repeat="0-1">talk</item>
			</one-of>
			<one-of>
				<item repeat="0-1">to an</item>
				<item repeat="0-1">with an</item>
			</one-of>
			<one-of>
				<item>agent<tag>out="agent"</tag></item>
				<item>employee<tag>out="agent"</tag></item>
			</one-of>
		</item>
    </rule>
</grammar>

The grammar has two rules to parse user input. The first rule interprets the digits provided by the caller. These digits are appended to the output via an ECMA script tag variable (out). The second rule manages the dialog if the caller wants to talk to an agent. In this case the out tag is populated with the word agent. After the rules are parsed, the out tag carries the account number (out.AccountNumber) or the string agent. The downstream business logic can now use the out tag handle the call.

Deploy the sample Amazon Lex bot

To create the sample bot and add the grammars, perform the following steps. This creates an Amazon Lex bot called BankingBot, and two grammar slot types (accountNumber, dateOfBirth).

  1. Download the Amazon Lex bot.
  2. On the Amazon Lex console, choose Actions, then choose Import.
  3. Choose the file BankingBot.zip that you downloaded, and choose Import. In the IAM Permissions section, for Runtime role, choose Create a new role with basic Amazon Lex permissions.
  4. Choose the bot BankingBot on the Amazon Lex console.
  5. Download the XML files for accountNumber and dateOfBirth. (Note: In some browsers you will have to “Save the link” to download the XML files)
  6. On the Amazon S3 console, upload the XML files.
  7. Navigate to the slot types on the Amazon Lex console, and click on the accountNumber slot type
  8. In the slot type grammar select the S3 bucket with the XML file and provide the object key. Click on Save slot type.
  9. Navigate to the slot types on the Amazon Lex console, and click on the dateOfBirth slot type
  10. In the slot type grammar select the S3 bucket with the XML file and provide the object key. Click on Save slot type.
  11. After the grammars are saved, choose Build.
  12. Download the supporting AWS Lambda and Navigate to the AWS Lambda console.
  13. On the create function page select Author from scratch. As basic information please provide the following: function name BankingBotEnglish, and Runtime Python 3.8.
  14. Click on Create function. In the Code source section, open lambda_funciton.py and delete the existing code. Download the code and open it in a text editor. Copy and paste the code into the empty lambda_funciton.py tab.
  15. Choose deploy.
  16. Navigate to the Amazon Lex Console and select BankingBot. Click on Deployment and then Aliases followed by TestBotAlias
  17. On the Aliases page select languages and navigate to English (US).
  18. For source select BankingBotEnglish, for Lambda version or alias select $LATEST
  19. Navigate to the Amazon Connect console, choose Contact flows.
  20. Download the contact flow to integrate with the Amazon Lex bot.
  21. In the Amazon Lex section, select your Amazon Lex bot and make it available for use in the Amazon Connect contact flows.
  22. Select the contact flow to load it into the application.
  23. Make sure the right bot is configured in the “Get Customer Input” block. Add a phone number to the contact flow.
  24. Choose a queue in the “Set working queue” block.
  25. Test the IVR flow by calling in to the phone number.
  26. Test the solution.

Test the solution

You can call in to the Amazon Connect phone number and interact with the bot. You can also test the solution directly on the Amazon Lex V2 console using voice and DTMF.

Conclusion

Custom grammar slots provide the ability to collect different types of information in a conversation. You have the flexibility to capture transitions such as handover to an agent. Additionally, you can postprocess the information before running the business logic. You can enable grammar slot types via the Amazon Lex V2 console or AWS SDK. The capability is available in all AWS Regions where Amazon Lex operates in the English (Australia), English (UK), and English (US) locales.

To learn more, refer to Using a custom grammar slot type. You can also view the Amazon Lex documentation for SRGS or ECMAScript for more information.


About the Authors

Kai Loreck is a professional services Amazon Connect consultant. He works on designing and implementing scalable customer experience solutions. In his spare time, he can be found playing sports, snowboarding, or hiking in the mountains.

Harshal Pimpalkhute is a Product Manager on the Amazon Lex team. He spends his time trying to get machines to engage (nicely) with humans.

Read More

Whitepaper: Machine Learning Best Practices in Healthcare and Life Sciences

For customers looking to implement a GxP-compliant environment on AWS for artificial intelligence (AI) and machine learning (ML) systems, we have released a new whitepaper: Machine Learning Best Practices in Healthcare and Life Sciences.

This whitepaper provides an overview of security and good ML compliance practices and guidance on building GxP-regulated AI/ML systems using AWS services. We cover the points raised by the FDA discussion paper and Good Machine Learning Practices (GMLP) while also drawing from AWS resources: the whitepaper GxP Systems on AWS and the Machine Learning Lens from the AWS Well-Architected Framework. The whitepaper was developed based on our experience with and feedback from AWS pharmaceutical and medical device customers, as well as AWS partners, who are currently using AWS services to develop ML models.

Healthcare and life sciences (HCLS) customers are adopting AWS AI and ML services faster than ever before, but they also face the following regulatory challenges during implementation:

  • Building a secure infrastructure that complies with stringent regulatory processes for working on the public cloud and aligning to the FDA framework for AI and ML.
  • Supporting AI/ML-enabled solutions for GxP workloads covering the following:
    • Reproducibility
    • Traceability
    • Data integrity
  • Monitoring ML models with respect to various changes to parameters and data.
  • Handling model uncertainty and confidence calibration.

In our whitepaper, you learn about the following topics:

  • How AWS approaches ML in a regulated environment and provides guidance on Good Machine Learning Practices using AWS services.
  • Our organizational approach to security and compliance that supports GxP requirements as part of the shared responsibility model.
  • How to reproduce the workflow steps, track model and dataset lineage, and establish model governance and traceability.
  • How to monitor and maintain data integrity and quality checks to detect drifts in data and model quality.
  • Security and compliance best practices for managing AI/ML models on AWS.
  • Various AWS services for managing ML models in a regulated environment.

AWS is dedicated to helping you successfully use AWS services in regulated life science environments to accelerate your research, development, and delivery of the next generation of medical, health, and wellness solutions.

Contact us with questions about using AWS services for AI/ML in GxP systems. To learn more about compliance in the cloud, visit AWS Compliance. You can also check out the following resources:


About the Authors

Susant Mallick is an Industry specialist and digital evangelist in AWS’ Global Healthcare and Life-Sciences practice. He has over 20+ years of experience in the Life Science industry working with biopharmaceutical and medical device companies across North America, APAC and EMEA regions. He has built many Digital Health Platform and Patient Engagement solutions using Mobile App, AI/ML, IoT and other technologies for customers in various Therapeutic Areas. He holds a B.Tech degree in Electrical Engineering and MBA in Finance. His thought leadership and industry expertise earned many accolades in Pharma industry forums.

Sai Sharanya Nalla is a Sr. Data Scientist at AWS Professional Services. She works with customers to develop and implement AI/ ML and HPC solutions on AWS. In her spare time, she enjoys listening to podcasts and audiobooks, taking long walks, and engaging in outreach activities.

Read More

Prepare data from Databricks for machine learning using Amazon SageMaker Data Wrangler

Data science and data engineering teams spend a significant portion of their time in the data preparation phase of a machine learning (ML) lifecycle performing data selection, cleaning, and transformation steps. It’s a necessary and important step of any ML workflow in order to generate meaningful insights and predictions, because bad or low-quality data greatly reduces the relevance of the insights derived.

Data engineering teams are traditionally responsible for the ingestion, consolidation, and transformation of raw data for downstream consumption. Data scientists often need to do additional processing on data for domain-specific ML use cases such as natural language and time series. For example, certain ML algorithms may be sensitive to missing values, sparse features, or outliers and require special consideration. Even in cases where the dataset is in a good shape, data scientists may want to transform the feature distributions or create new features in order to maximize the insights obtained from the models. To achieve these objectives, data scientists have to rely on data engineering teams to accommodate requested changes, resulting in dependency and delay in the model development process. Alternatively, data science teams may choose to perform data preparation and feature engineering internally using various programming paradigms. However, it requires an investment of time and effort in installation and configuration of libraries and frameworks, which isn’t ideal because that time can be better spent optimizing model performance.

Amazon SageMaker Data Wrangler simplifies the data preparation and feature engineering process, reducing the time it takes to aggregate and prepare data for ML from weeks to minutes by providing a single visual interface for data scientists to select, clean, and explore their datasets. Data Wrangler offers over 300 built-in data transformations to help normalize, transform, and combine features without writing any code. You can import data from multiple data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and Snowflake. You can now also use Databricks as a data source in Data Wrangler to easily prepare data for ML.

The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance and performance of data warehouses with the openness, flexibility and machine learning support of data lakes. With Databricks as a data source for Data Wrangler, you can now quickly and easily connect to Databricks, interactively query data stored in Databricks using SQL, and preview data before importing. Additionally, you can join your data in Databricks with data stored in Amazon S3, and data queried through Amazon Athena, Amazon Redshift, and Snowflake to create the right dataset for your ML use case.

In this post, we transform the Lending Club Loan dataset using Amazon SageMaker Data Wrangler for use in ML model training.

Solution overview

The following diagram illustrates our solution architecture.

The Lending Club Loan dataset contains complete loan data for all loans issued through 2007–2011, including the current loan status and latest payment information. It has 39,717 rows, 22 feature columns, and 3 target labels.

To transform our data using Data Wrangler, we complete the following high-level steps:

  1. Download and split the dataset.
  2. Create a Data Wrangler flow.
  3. Import data from Databricks to Data Wrangler.
  4. Import data from Amazon S3 to Data Wrangler.
  5. Join the data.
  6. Apply transformations.
  7. Export the dataset.

Prerequisites

The post assumes you have a running Databricks cluster. If your cluster is running on AWS, verify you have the following configured:

Databricks setup

Follow Secure access to S3 buckets using instance profiles for the required AWS Identity and Access Management (IAM) roles, S3 bucket policy, and Databricks cluster configuration. Ensure the Databricks cluster is configured with the proper Instance Profile, selected under the advanced options, to access to the desired S3 bucket.

After the Databricks cluster is up and running with required access to Amazon S3, you can fetch the JDBC URL from your Databricks cluster to be used by Data Wrangler to connect to it.

Fetch the JDBC URL

To fetch the JDBC URL, complete the following steps:

  1. In Databricks, navigate to the clusters UI.
  2. Choose your cluster.
  3. On the Configuration tab, choose Advanced options.
  4. Under Advanced options, choose the JDBC/ODBC tab.
  5. Copy the JDBC URL.

Make sure to substitute your personal access token in the URL.

Data Wrangler setup

This step assumes you have access to Amazon SageMaker, an instance of Amazon SageMaker Studio, and a Studio user.

To allow access to the Databricks JDBC connection from Data Wrangler, the Studio user requires following permission:

  • secretsmanager:PutResourcePolicy

Follow below steps to update the IAM execution role assigned to the Studio user with above permission, as an IAM administrative user.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose the role assigned to your Studio user.
  3. Choose Add permissions.
  4. Choose Create inline policy.
  5. For Service, choose Secrets Manager.
  6. On Actions, choose Access level.
  7. Choose Permissions management.
  8. Choose PutResourcePolicy.
  9. For Resources, choose Specific and select Any in this account.

Download and split the dataset

You can start by downloading the dataset. For demonstration purposes, we split the dataset by copying the feature columns id, emp_title, emp_length, home_owner, and annual_inc to create a second loans_2.csv file. We remove the aforementioned columns from the original loans file except the id column and rename the original file to loans_1.csv. Upload the loans_1.csv file to Databricks to create a table loans_1 and loans_2.csv in an S3 bucket.

Create a Data Wrangler flow

For information on Data Wrangler pre-requisites, see Get Started with Data Wrangler.

Let’s get started by creating a new data flow.

  1. On the Studio console, on the File menu, choose New.
  2. Choose Data Wrangler flow.
  3. Rename the flow as desired.

Alternatively, you can create a new data flow from the Launcher.

  • On the Studio console, choose Amazon SageMaker Studio in the navigation pane.
  • Choose New data flow.

Creating a new flow can take a few minutes to complete. After the flow has been created, you see the Import data page.

Import data from Databricks into Data Wrangler

Next, we set up Databricks (JDBC) as a data source in Data Wrangler. To import data from Databricks, we first need to add Databricks as a data source.

  1. On the Import data tab of your Data Wrangler flow, choose Add data source.
  2. On the drop-down menu, choose Databricks (JDBC).

On the Import data from Databricks page, you enter your cluster details.

  1. For Dataset name, enter a name you want to use in the flow file.
  2. For Driver, choose the driver com.simba.spark.jdbc.Driver.
  3. For JDBC URL, enter the URL of your Databricks cluster obtained earlier.

The URL should resemble the following format jdbc:spark://<serve- hostname>:443/default;transportMode=http;ssl=1;httpPath=<http- path>;AuthMech=3;UID=token;PWD=<personal-access-token>.

  1. In the SQL query editor, specify the following SQL SELECT statement:
    select * from loans_1

If you chose a different table name while uploading data to Databricks, replace loans_1 in the above SQL query accordingly.

In the SQL query section in Data Wrangler, you can query any table connected to the JDBC Databricks database. The pre-selected Enable sampling setting retrieves the first 50,000 rows of your dataset by default. Depending on the size of the dataset, unselecting Enable sampling may result in longer import time.

  1. Choose Run.

Running the query gives a preview of your Databricks dataset directly in Data Wrangler.

  1. Choose Import.

Data Wrangler provides the flexibility to set up multiple concurrent connections to the one Databricks cluster or multiple clusters if required, enabling analysis and preparation on combined datasets.

Import the data from Amazon S3 into Data Wrangler

Next, let’s import the loan_2.csv file from Amazon S3.

  1. On the Import tab, choose Amazon S3 as the data source.
  2. Navigate to the S3 bucket for the loan_2.csv file.

When you select the CSV file, you can preview the data.

  1. In the Details pane, choose Advanced configuration to make sure Enable sampling is selected and COMMA is chosen for Delimiter.
  2. Choose Import.

After the loans_2.csv dataset is successfully imported, the data flow interface displays both the Databricks JDBC and Amazon S3 data sources.

Join the data

Now that we have imported data from Databricks and Amazon S3, let’s join the datasets using a common unique identifier column.

  1. On the Data flow tab, for Data types, choose the plus sign for loans_1.
  2. Choose Join.
  3. Choose the loans_2.csv file as the Right dataset.
  4. Choose Configure to set up the join criteria.
  5. For Name, enter a name for the join.
  6. For Join type, choose Inner for this post.
  7. Choose the id column to join on.
  8. Choose Apply to preview the joined dataset.
  9. Choose Add to add it to the data flow.

Apply transformations

Data Wrangler comes with over 300 built-in transforms, which require no coding. Let’s use built-in transforms to prepare the dataset.

Drop column

First we drop the redundant ID column.

  1. On the joined node, choose the plus sign.
  2. Choose Add transform.
  3. Under Transforms, choose + Add step.
  4. Choose Manage columns.
  5. For Transform, choose Drop column.
  6. For Columns to drop, choose the column id_0.
  7. Choose Preview.
  8. Choose Add.

Format string

Let’s apply string formatting to remove the percentage symbol from the int_rate and revol_util columns.

  1. On the Data tab, under Transforms, choose + Add step.
  2. Choose Format string.
  3. For Transform, choose Strip characters from right.

Data Wrangler allows you to apply your chosen transformation on multiple columns simultaneously.

  1. For Input columns, choose int_rate and revol_util.
  2. For Characters to remove, enter %.
  3. Choose Preview.
  4. Choose Add.

Featurize text

Let’s now vectorize verification_status, a text feature column. We convert the text column into term frequency–inverse document frequency (TF-IDF) vectors by applying the count vectorizer and a standard tokenizer as described below. Data Wrangler also provides the option to bring your own tokenizer, if desired.

  1. Under Transformers, choose + Add step.
  2. Choose Featurize text.
  3. For Transform, choose Vectorize.
  4. For Input columns, choose verification_status.
  5. Choose Preview.
  6. Choose Add.

Export the dataset

After we apply multiple transformations on different columns types, including text, categorical, and numeric, we’re ready to use the transformed dataset for ML model training. The last step is to export the transformed dataset to Amazon S3. In Data Wrangler, you have multiple options to choose from for downstream consumption of the transformations:

In this post, we take advantage of the Export data option in the Transform view to export the transformed dataset directly to Amazon S3.

  1. Choose Export data.
  2. For S3 location, choose Browse and choose your S3 bucket.
  3. Choose Export data.

Clean up

If your work with Data Wrangler is complete, shut down your Data Wrangler instance to avoid incurring additional fees.

Conclusion

In this post, we covered how you can quickly and easily set up and connect Databricks as a data source in Data Wrangler, interactively query data stored in Databricks using SQL, and preview data before importing. Additionally, we looked at how you can join your data in Databricks with data stored in Amazon S3. We then applied data transformations on the combined dataset to create a data preparation pipeline. To explore more Data Wrangler’s analysis capabilities, including target leakage and bias report generation, refer to the following blog post Accelerate data preparation using Amazon SageMaker Data Wrangler for diabetic patient readmission prediction.

To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler, and see the latest information on the Data Wrangler product page.


About the Authors

Roop Bains is a Solutions Architect at AWS focusing on AI/ML. He is passionate about helping customers innovate and achieve their business objectives using Artificial Intelligence and Machine Learning. In his spare time, Roop enjoys reading and hiking.

Igor Alekseev is a Partner Solution Architect at AWS in Data and Analytics. Igor works with strategic partners helping them build complex, AWS-optimized architectures. Prior joining AWS, as a Data/Solution Architect, he implemented many projects in Big Data, including several data lakes in the Hadoop ecosystem. As a Data Engineer, he was involved in applying AI/ML to fraud detection and office automation. Igor’s projects were in a variety of industries including communications, finance, public safety, manufacturing, and healthcare. Earlier, Igor worked as full stack engineer/tech lead.

Huong Nguyen is a Sr. Product Manager at AWS. She is leading the user experience for SageMaker Studio. She has 13 years’ experience creating customer-obsessed and data-driven products for both enterprise and consumer spaces. In her spare time, she enjoys reading, being in nature, and spending time with her family.

Henry Wang is a software development engineer at AWS. He recently joined the Data Wrangler team after graduating from UC Davis. He has an interest in data science and machine learning and does 3D printing as a hobby.

Read More

Personalize cross-channel customer experiences with Amazon SageMaker, Amazon Personalize, and Twilio Segment

Today, customers interact with brands over an increasingly large digital and offline footprint, generating a wealth of interaction data known as behavioral data. As a result, marketers and customer experience teams must work with multiple overlapping tools to engage and target those customers across touchpoints. This increases complexity, creates multiple views of each customer, and makes it more challenging to provide an individual experience with relevant content, messaging, and product suggestions to each customer. In response, marketing teams use customer data platforms (CDPs) and cross-channel campaign management tools (CCCMs) to simplify the process of consolidating multiple views of their customers. These technologies provide non-technical users with an accelerated path to enable cross-channel targeting, engagement, and personalization, while reducing marketing teams’ dependency on technical teams and specialist skills to engage with customers.

Despite this, marketers find themselves with blind spots in customer activity when these technologies aren’t integrated with systems from other parts of the business. This is particularly true with non-digital channels, for example, in-store transactions or customer feedback from customer support. Marketing teams and their customer experience counterparts also struggle to integrate predictive capabilities developed by data scientists into their cross-channel campaigns or customer touchpoints. As a result, customers receive messaging and recommendations that aren’t relevant or are inconsistent with their expectations.

This post outlines how cross-functional teams can work together to address these challenges using an omnichannel personalization use case. We use a fictional retail scenario to illustrate how those teams interlock to provide a personalized experience at various points along the customer journey. We use Twilio Segment in our scenario, a customer data platform built on AWS. There are more than 12 CDPs in the market to choose from, many of which are also AWS partners, but we use Segment in this post because they provide a self-serve free tier that allows you to explore and experiment. We explain how to combine the output from Segment with in-store sales data, product metadata, and inventory information. Building on this, we explain how to integrate Segment with Amazon Personalize to power real-time recommendations. We also describe how we create scores for churn and repeat-purchase propensity using Amazon SageMaker. Lastly, we explore how to target new and existing customers in three ways:

  • With banners on third-party websites, also known as display advertising, using a propensity-to-buy score to attract similar customers.
  • On web and mobile channels presented with personalized recommendations powered by Amazon Personalize, which uses machine learning (ML) algorithms to create content recommendations.
  • With personalized messaging using Amazon Pinpoint, an outbound and inbound marketing communications service. These messages target disengaged customers and those showing a high propensity to churn.

Solution overview

Imagine you are a product owner leading the charge on cross-channel customer experience for a retail company. The company has a diverse set of online and offline channels, but sees digital channels as its primary opportunity for growth. They want to grow the size and value of their customer base with the following methods:

  • Attract new, highly qualified customers who are more likely to convert
  • Increase the average order value of all their customers
  • Re-attract disengaged customers to return and hopefully make repeat purchases

To ensure those customers receive a consistent experience across channels, you as a product owner need to work with teams such as digital marketing, front-end development, mobile development, campaign delivery, and creative agencies. To ensure customers receive relevant recommendations, you also need to work with data engineering and data science teams. Each of these teams are responsible for interacting with or developing features within the architecture illustrated in the following diagram.

The solution workflow contains the following high-level steps:

  1. Collect data from multiple sources to store in Amazon Simple Storage Service (Amazon S3).
  2. Use AWS Step Functions to orchestrate data onboarding and feature engineering.
  3. Build segments and predictions using SageMaker.
  4. Use propensity scores for display targeting.
  5. Send personalized messaging using Amazon Pinpoint.
  6. Integrate real-time personalized suggestions using Amazon Personalize.

In the following sections, we walk through each step, explain the activities of each team at a high level, provide references to related resources, and share hands-on labs that provide more detailed guidance.

Collect data from multiple sources

Digital marketing, front-end, and mobile development teams can configure Segment to capture and integrate web and mobile analytics, digital media performance, and online sales sources using Segment Connections. Segment Personas allows digital marketing teams to resolve the identity of users by stitching together interactions across these sources into a single user profile with one persistent identifier. These profiles, along with calculated metrics called Computed Traits and raw events, can be exported to Amazon S3. The following screenshot shows how identity rules are set up in Segment Personas.

In parallel, engineering teams can use AWS Data Migration Service (AWS DMS) to replicate in-store sales, product metadata, and inventory data sources from databases such as Microsoft SQL or Oracle and store the output in Amazon S3.

Data onboarding and feature engineering

After data is collected and stored in the landing zone on Amazon S3, data engineers can use components from the serverless data lake framework (SDLF) to accelerate data onboarding and build out the foundational structure of a data lake. With SDLF, engineers can automate the preparation of user-item data used to train Amazon Personalize or create a single view of customer behavior by joining online and offline behavioral data and sales data, using attributes such as customer ID or email address as a common identifier.

Step Functions is the key orchestrator driving these transformation jobs within SDLF. You can use Step Functions to build and orchestrate both scheduled and event-driven data workflows. The engineering team can orchestrate the tasks of other AWS services within a data pipeline. The outputs from this process are stored in a trusted zone on Amazon S3 to use for ML development. For more information on implementing the serverless data lake framework, see AWS serverless data analytics pipeline reference architecture.

Build segments and predictions

The process of building segments and predictions can be broken down into three steps: access the environment, build propensity models, and create output files.

Access the environment

After the engineering team has prepared and transformed the ML development data, the data science team can build propensity models using SageMaker. First, they build, train, and test an initial set of ML models. This allows them to see early results, decide which direction to go next, and reproduce experiments.

The data science team needs an active Amazon SageMaker Studio instance, an integrated development environment (IDE) for rapid ML experimentation. It unifies all the key features of SageMaker and offers an environment to manage the end-to-end ML pipelines. It removes complexity and reduces the time it takes to build ML models and deploy them into production. Developers can use SageMaker Studio notebooks, which are one-click Jupyter notebooks that you can quickly spin up to enable the entire ML workflow from data preparation to model deployment. For more information on SageMaker for ML, see Amazon SageMaker for Data Science.

Build the propensity models

To estimate churn and repeat-purchase propensity, the customer experience and data science teams should agree on the known driving factors for either outcome.

The data science team validates these known factors while also discovering unknown factors through the modeling process. An example of a factor driving churn can be the number of returns in the last 3 months. An example of a factor driving repurchases can be the number of items saved on the website or mobile app.

For our use case, we assume that the digital marketing team wants to create a target audience using lookalike modeling to find customers most likely to repurchase in the next month. We also assume that the campaign team wants to send an email offer to customers who will likely end their subscription in the next 3 months to encourage them to renew their subscription.

The data science team can start by analyzing the data (features) and summarizing the main characteristics of the dataset to understand the key data behaviors. They can then shuffle and split the data into training and test and upload these datasets into the trusted zone. You can use an algorithm such as the XGBoost classifier to train the model and automatically provide the feature selection, which is the best set of candidates to determine the propensity scores (or predicted values).

You can then tune the model by optimizing the algorithm metrics (such as hyperparameters) based on the ranges provided within the XGBoost framework. Test data is used to evaluate the model’s performance and estimate how well it generalizes to new data. For more information on evaluation metrics, see Tune an XGBoost Model.

Lastly, the propensity scores are calculated for each customer and stored in the trusted S3 zone to be accessed, reviewed, and validated by the marketing and campaign teams. This process also provides a prioritized evaluation of feature importance, which helps to explain how the scores were produced.

Create the output files

After the data science team has completed the model training and tuning, they work with the engineering team to deploy the best model to production. We can use SageMaker batch transform to run predictions as new data is collected and generate scores for each customer. The engineering team can orchestrate and automate the ML workflow using Amazon SageMaker Pipelines, a purpose-built continuous integration and continuous delivery (CI/CD) service for ML, which offers an environment to manage the end-to-end ML workflow. It saves time and reduces errors typically caused by manual orchestration.

The output of the ML workflow is imported by Amazon Pinpoint for sending personalized messaging and exported to Segment to use when targeting on display channels. The following illustration provides a visual overview of the ML workflow.

The following screenshot shows an example output file.

Use propensity scores for display targeting

The engineering and digital marketing teams can create the reverse data flow back to Segment to increase reach. This uses a combination of AWS Lambda and Amazon S3. Every time a new output file is generated by the ML workflow and saved in the trusted S3 bucket, a Lambda function is invoked that triggers an export to Segment. Digital marketing can then use regularly updated propensity scores as customer attributes to build and export audiences to Segment destinations (see the following screenshot). For more information on the file structure of the Segment export, see Amazon S3 from Lambda.

When the data is available in Segment, digital marketing can see the propensity scores developed in SageMaker as attributes when they create customer segments. They can generate lookalike audiences to target them with digital advertising. To create a feedback loop, digital marketing must ensure that impressions, clicks, and campaigns are being ingested back into Segment to optimize performance.

Send personalized outbound messaging

The campaign delivery team can implement and deploy AI-driven win-back campaigns to re-engage customers at risk of churn. These campaigns use the list of customer contacts generated in SageMaker as segments while integrating with Amazon Personalize to present personalized product recommendations. See the following diagram.

The digital marketing team can experiment using Amazon Pinpoint journeys to split win-back segments into subgroups and reserve a percentage of users as a control group that isn’t exposed to the campaign. This allows them to measure the campaign’s impact and creates a feedback loop.

Integrate real-time recommendations

To personalize inbound channels, the digital marketing and engineering teams work together to integrate and configure Amazon Personalize to provide product recommendations at different points in the customer’s journey. For example, they can deploy a similar item recommender on product detail pages to suggest complementary items (see the following diagram). Additionally, they can deploy a content-based filtering recommender in the checkout journey to remind customers of products they would typically buy before completing their order.

First, the engineering team needs to create RESTful microservices that respond to web, mobile, and other channel application requests with product recommendations. These microservices call Amazon Personalize to get recommendations, resolve product IDs into more meaningful information like name and price, check inventory stock levels, and determine which Amazon Personalize campaign endpoint to query based on the user’s current page or screen.

The front-end and mobile development teams need to add tracking events for specific customer actions to their applications. They can then use Segment to send those events directly to Amazon Personalize in real time. These tracking events are the same as the user-item data we extracted earlier. They allow Amazon Personalize solutions to refine recommendations based on live customer interactions. It’s essential to capture impressions, product views, cart additions, and purchases because these events create a feedback loop for the recommenders. Lambda is an intermediary, collecting user events from Segment and sending them to Amazon Personalize. Lambda also facilitates the reverse data exchange, relaying updated recommendations for the user back to Segment. For more information on configuring real-time recommendations with Segment and Amazon Personalize, see the Segment Real-time data and Amazon Personalize Workshop.

Conclusion

This post described how to deliver an omnichannel customer experience using a combination of Segment customer data platform and AWS services such as Amazon SageMaker, Amazon Personalize, and Amazon Pinpoint. We explored the role cross-functional teams play at each stage in the customer journey and in the data value chain. The architecture and approach discussed are focused on a retail environment, but you can apply it to other verticals such as financial services or media and entertainment. If you’re interested in trying out some of what we discussed, check out the Retail Demo Store, where you can find hands-on workshops that include Segment and other AWS partners.

Additional references

For additional information, see the following resources:

About Segment

Segment is an AWS Advanced Technology Partner and holder of the following AWS Independent Software Vendor (ISV) competencies: Data & Analytics, Digital Customer Experience, Retail, and Machine Learning. Brands such as Atlassian and Digital Ocean use real-time analytics solutions powered by Segment.


About the Authors

Dwayne Browne is a Principal Analytics Platform Specialist at AWS based in London. He is part of the Data-Driven Everything (D2E) customer program, where he helps customers become more data-driven and customer experience focused. He has a background in digital analytics, personalization, and marketing automation. In his spare time, Dwayne enjoys indoor climbing and exploring nature.

Hara Gavriliadi is a Senior Data Analytics Strategist at AWS Professional Services based in London. She helps customers transform their business using data, analytics, and machine learning. She specializes in customer analytics and data strategy. Hara loves countryside walks and enjoys discovering local bookstores and yoga studios in her free time.

Kenny Rajan is a Senior Partner Solution Architect. Kenny helps customers get the most from AWS and its partners by demonstrating how AWS partners and AWS services work better together. He’s interested in machine learning, data, ERP implementation, and voice-based solutions on the cloud. Outside of work, Kenny enjoys reading books and helping with charity activities.

Read More

Automated, scalable, and cost-effective ML on AWS: Detecting invasive Australian tree ferns in Hawaiian forests

This is blog post is co-written by Theresa Cabrera Menard, an Applied Scientist/Geographic Information Systems Specialist at The Nature Conservancy (TNC) in Hawaii.

In recent years, Amazon and AWS have developed a series of sustainability initiatives with the overall goal of helping preserve the natural environment. As part of these efforts, AWS Professional Services establishes partnerships with organizations such as The Nature Conservancy (TNC), offering financial support and consulting services towards environmental preservation efforts. The advent of big data technologies is rapidly scaling up ecological data collection, while machine learning (ML) techniques are increasingly utilized in ecological data analysis. AWS is in a unique position to help with data storage and ingestion as well as with data analysis.

Hawaiian forests are essential as a source of clean water and for preservation of traditional cultural practices. However, they face critical threats from deforestation, species extinction, and displacement of native species by invasive plants. The state of Hawaii spends about half a billion dollars yearly fighting invasive species. TNC is helping to address the invasive plant problem through initiatives such as the Hawaii Challenge, which allows anyone with a computer and internet access to participate in tagging invasive weeds across the landscape. AWS has partnered with TNC to build upon these efforts and develop a scalable, cloud-based solution that automates and expedites the detection and localization of invasive ferns.

Among the most aggressive species invading the Hawaiian forests is the Australian tree fern, originally introduced as an ornamental, but now rapidly spreading across several islands by producing numerous spores that are easily transported by the wind. The Australian tree fern is fast growing and outcompetes other plants, smothering the canopy and affecting several native species, resulting in a loss of biological diversity.

Currently, detection of the ferns is accomplished by capturing images from fixed wing aircraft surveying the forest canopy. The imagery is manually inspected by human labelers. This process takes significant effort and time, potentially delaying the mitigation efforts by ground crews by weeks or longer. One of the advantages of utilizing a computer vision (CV) algorithm is the potential time savings because the inference time is expected to take only a few hours.

Machine learning pipeline

The following diagram shows the overall ML workflow of this project. The first goal of the AWS-TNC partnership was to automate the detection of ferns from aerial imagery. A second goal was to evaluate the potential of CV algorithms to reliably classify ferns as either native or invasive. The CV model inference can then form the basis of a fully automated AWS Cloud-native solution that enhances the capacity of TNC to efficiently and in a timely manner detect invasive ferns and direct resources to highly affected areas. The following diagram illustrates this architecture.

In the following sections, we cover the following topics:

  • The data processing and analysis tools utilized.
  • The fern detection model pipeline, including training and evaluation.
  • How native and invasive ferns are classified.
  • The benefits TNC experienced through this implementation.

Data processing and analysis

Aerial footage is acquired by TNC contractors by flying fixed winged aircraft above affected areas within the Hawaiian Islands. Heavy and persistent cloud cover prevents use of satellite imagery. The data available to TNC and AWS consists of raw images and metadata allowing the geographical localization of the inferred ferns.

Images and geographical coordinates

Images received from aerial surveys are in the range of 100,000 x 100,000 pixels and are stored in the JPEG2000 (JP2) format, which incorporates geolocation and other metadata. Each pixel can be associated to specific Universal Transverse Mercator (UTM) geospatial coordinates. The UTM coordinate system divides the world into north-south zones, each 6 degrees of longitude wide. The first UTM coordinate (northing) refers to the distance between a geographical position and the equator, measured with the north as the positive direction. The second coordinated (easting) measures the distance, in meters, towards east, starting from a central meridian that is uniquely assigned for each zone. By convention, the central meridian in each region has a value of 500,000, and a meter east of the region central meridian therefore has the value 500,001. To convert between pixel coordinates and UTM coordinates, we utilize the affine transform as outlined in the following equation, where x’, y’ are UTM coordinates and x, y are pixel coordinates. The parameters a, b, c, d, e, and f of the affine transform are provided as part of the JP2 file metadata.

For the purposes of labeling, training and inference of the raw JP2 files are divided into non-overlapping 512 x 512-pixel JPG files. The extraction of smaller sub-images from the original JP2 necessitates the creation of an individual affine transform directly from each individual extracted JPG file. These operations were performed utilizing the rasterio and affine Python packages with AWS Batch and facilitated the reporting of the position of inferred ferns in UTM coordinates.

Data labeling

Visual identification of ferns in the aerial images is complicated by several factors. Most of the information is aggregated in the green channel, and there is a high density of foliage with frequent partial occlusion of ferns by both nearby ferns and other vegetation. The information of interest to TNC is the relative density of ferns per acre, therefore it’s important to count each individual fern even in the presence of occlusion. Given these goals and constraints, we chose to utilize an object detection CV framework.

To label the data, we set up an Amazon SageMaker Ground Truth  labeling job. Each bounding box was intended to be centered in the center of the fern, and to cover most of the fern branches, while at the same time attempting to minimize the inclusion of other vegetation. The labeling was performed by the authors following consultation with TNC domain experts. The initial labeled dataset included 500 images, each typically containing several ferns, as shown in the following example images. In this initial labeled set we did not distinguish between native and invasive ferns.

Fern Object detection model training and update

In this section we discuss training the initial fern detection model, data labeling in Ground Truth and the model update through retraining. We also discuss using Amazon Augmented AI (Amazon A2I) for the model update, and using AWS Step Functions for the overall fern detection inference pipeline.

Initial fern detection model training

We utilized the Amazon SageMaker object detection algorithm because it provides state-of-the-art performance and can be easily integrated with other SageMaker services such as Ground Truth, endpoints, and Batch Transform jobs. We utilized the Single Shot MultiBox Detector (SSD) framework and base network vgg-16. This network comes pre-trained on millions of images and thousands of classes from the ImageNet dataset. We break all the given TNC JP2 images into 512 x 512-pixel tiles as the training dataset. There are about 5,000 small JPG images, and we randomly selected 4,500 images as the training dataset and 500 images as the validation dataset. After hyperparameter tuning, we chose the following hyperparameters for the model training: class=1, overlap_threshold=0.3, learning_rate=0.001, and epochs=50. The initial model’s mean average precision (mAP) computed on the validation set  is 0.49. After checking the detection results and TNC labels, we discovered that many ferns that were detected as ferns by our object detection model were not labeled as ferns from TNC fern labels, as shown in the following images.

Therefore, we decided to use Ground Truth to relabel a subset of the fern dataset in the attempt to improve model performance and then compare ML inference results with this initial model to check which approach is better.

Data labeling in Ground Truth

To label the fern dataset, we set up a Ground Truth job of 500 randomly selected 512 x 512-pixel images. Each bounding box was intended to be centered in the center of the fern, and to cover most of the fern branches, while at the same time attempting to minimize the inclusion of other vegetation. The labeling was performed by AWS data scientists following consultation with TNC domain experts. In this labeled dataset, we didn’t distinguish between native and invasive ferns.

Retraining the fern detection model

The first model training iteration utilized a set of 500 labeled images, of which 400 were in the training set and 100 in the validation set. This model achieved a mAP (computed on the validation set) score of 0.46, which isn’t very high. We next used this initial model to produce predictions on a larger set of 3,888 JPG images extracted from the available JP2 data. With this larger image set for training, the model achieved a mAP score of 0.87. This marked improvement (as shown in the following example images) illustrates the value of automated labeling and model iteration.

Based on these findings we determined that Ground Truth labeling plus automated labeling and model iteration appear to significantly increase prediction performance. To further quantify the performance of the resulting model, a set of 300 images were randomly selected for an additional round of validation. We found that when utilizing a threshold of 0.3 for detection confidence, 84% of the images were deemed by the labeler to have the correct number of predicted ferns, with 6.3% being overcounts and 9.7% being undercounts. In most cases, the over/undercounting was off by only one or two ferns out of five or six present in an image, and is therefore not expected to significantly affect the overall estimation of fern density per acre.

Amazon A2I for fern detection model update

One challenge for this project is that the images coming in every year are taken from aircraft, so the altitude, angles, and light condition of the images may be different. The model trained on the previous dataset needs to be retrained to maintain good performance, but labeling ferns for a new dataset is labor-intensive. Therefore, we used Amazon A2I to integrate human review to ensure accuracy with new data. We used 360 images as a test dataset; 35 images were sent back to for review because these images didn’t have predictions with a confidence score over 0.3. We relabeled these 35 images and retrained the model using incremental learning in Amazon A2I. The retrained model showed significant improvement from the previous model on many aspects, such as detections under darker light conditions, as shown in the following images. These improvements made the new model perform fairly well on new dataset with very few human reviews and relabeling work.

Fern detection inference pipeline

The overall goal of the TNC-AWS partnership is the creation of an automated pipeline that takes as input the JP2 files and produces as output UTM coordinates of the predicted ferns. There are three main tasks:

  • The first is the ingestion of the large JP2 file and its division into smaller 512 x 512 JPG files. Each of these has an associated affine transform that can generate UTM coordinates from the pixel coordinates.
  • The second task is the actual inference and detection of potential ferns and their locations.
  • The final task assembles the inference results into a single CSV file that is delivered to TNC.

The orchestration of the pipeline was implemented using Step Functions. As is the case for the inference, this choice automates many of the aspects of provisioning and releasing computing resources on an as-needed basis. Additionally, the pipeline architecture can be visually inspected, which enhances dissemination to the customer. Finally, as updated models potentially become available in the future, they can be swapped in with little or no disruption to the workflow. The following diagram illustrates this workflow.

When the inference pipeline was used in batch mode on a source image of 10,000 x 10,000 pixels, and allocating an m4.large instance to SageMaker batch transform, the whole inference workflow ran within 25 minutes. Of these, 10 minutes was taken by the batch transform, and the rest by Step Functions steps and AWS Lambda functions. TNC expects sets up to 24 JP2 images at one time, about twice a year. By adjusting the size and number of instances to be used by the batch transform, we expect that the inference pipeline can be fully run within 24 hours.

Fern classification

In this section, we discuss how we applied the SageMaker Principal Component Analysis (PCA) algorithm to the bounding boxes and validated the classification results.

Application of PCA to fern bounding boxes

To determine whether it is possible to distinguish between the Australian tree fern and native ferns without the substantial effort of labeling a large set of images, we implemented an unsupervised image analysis procedure. For each predicted fern, we extracted the region inside the bounding box and saved it as a separate image. Next, these images were embedded in a high dimensional vector space by utilizing the img2vec approach. This procedure generated a 2048-long vector for each input image. These vectors were analyzed by utilizing Principal Component Analysis as implemented in the SageMaker PCA algorithm. We retained for further analysis the top three components, which together accounted for more than 85% of the variance in the vector data.

For each of the top three components, we extracted the associated images with the highest and lowest scores along the component. These images were visually inspected by AWS data scientists and TNC domain experts, with the goal of identifying whether the highest and lowest scores are associated with native or invasive ferns. We further quantified the classification power of each principal component by manually labeling a small set of 100 fern images as either invasive or native and utilizing the scikit-learn utility to obtain metrics such as area under the precision-recall curve for each of the three PCA components. When the PCA scores were used as inputs to a binary classifier (see the following graph), we found that PCA2 was the most discriminative, followed by PCA3, with PCA1 displaying only modest performance in distinguishing between native and invasive ferns.

Validation of classification results

We then examined images with the biggest and smallest PCA2 values with TNC domain experts to check if the algorithm can differentiate native and invasive ferns effectively. After going over 100 sample fern images, TNC experts determined that the images with the smallest PCA2 values are very likely to be native ferns, and the images with the largest PCA2 values are very likely to be invasive ferns (see the following example images). We would like to further investigate this approach with TNC in the near future.

Conclusion

The major benefits to TNC from adopting the inference pipeline proposed in this post are twofold. First, substantial cost savings is achieved by replacing months-long efforts by human labelers with an automatic pipeline that incurs minimal inference costs. Although exact costs can depend on several factors, we estimate the cost reductions to be at least of an order of magnitude. The second benefit is the reduction of time from data collection to the initiation of mitigation efforts. Currently, manual labeling for a dozen large JP2 files take several weeks to complete, whereas the inference pipeline is expected to take a matter of hours, depending on the number and size of inference instances allocated. A faster turnaround time would impact the capacity of TNC to plan routes for the crews responsible for treating the invasive ferns in a timely manner, and potentially find appropriate treatment windows considering the seasonality and weather patterns on the islands.

To get started using Ground Truth, see Build a highly accurate training dataset with Amazon SageMaker Ground Truth. Also learn more about Amazon ML by going to the Amazon SageMaker product page, and explore visual workflows for modern applications by going to the AWS Step Functions product page.


About the Authors

Dan Iancu is a data scientist with AWS. He has joined AWS three years ago and has worked with a variety of customers including in Health Care and Life Sciences, the space industry and the public sector. He believes in the importance of bringing value to the customer as well as contributing to environmental preservation by utilizing ML tools.

Kara Yang is a Data Scientist in AWS Professional Services. She is passionate about helping customers achieve their business goals with AWS cloud services. She has helped organizations build ML solutions across multiple industries such as manufacturing, automotive, environmental sustainability and aerospace.

Arkajyoti Misra is a Data Scientist at Amazon LastMile Transportation. He is passionate about applying Computer Vision techniques to solve problems that helps the earth. He loves to work with non-profit organizations and is a founding member of ekipi.org.

Annalyn Ng is a Senior Solutions Architect based in Singapore, where she designs and builds cloud solutions for public sector agencies. Annalyn graduated from the University of Cambridge, and blogs about machine learning at algobeans.com. Her book, Numsense! Data Science for the Layman, has been translated into multiple languages and is used in top universities as reference text.

Theresa Cabrera Menard is an Applied Scientist/Geographic Information Systems Specialist at The Nature Conservancy (TNC) in Hawai`i, where she manages a large dataset of high-resolution imagery from across the Hawaiian Islands.  She was previously involved with the Hawai`i Challenge that used armchair conservationists to tag imagery for weeds in the forests of Kaua`i.

Veronika Megler is a Principal Consultant, Big Data, Analytics & Data Science, for AWS Professional Services. She holds a PhD in Computer Science, with a focus on spatio-temporal data search. She specializes in technology adoption, helping customers use new technologies to solve new problems and to solve old problems more efficiently and effectively.

Read More