Intelligent governance of document processing pipelines for regulated industries

Processing large documents like PDFs and static images is a cornerstone of today’s highly regulated industries. From healthcare information like doctor-patient visits and bills of health, to financial documents like loan applications, tax filings, research reports, and regulatory filings, these documents are integral to how these industries conduct business. The mechanisms by which these documents are processed and analyzed, however, are often manual, time-consuming, error-prone, expensive, and not easily scalable.

Fortunately, recent innovations in this space are helping companies improve these methods. Machine learning (ML) techniques such as optical character recognition (OCR) and natural language processing (NLP) enable firms to digitize and extract text from millions of documents and understand the content, including contextual nuances of the language within them. Furthermore, you can then transform the extracted text by merging it with supplemental data to produce additional business insights.

This step-by-step method is called a document processing pipeline. The pipeline includes various components to extract, transform, enrich, and conform the data. New data domains are often introduced and used for numerous downstream business purposes. For example, in financial services, you could be identifying connected financial events, calculating environmental risk scores, and developing risk models. Because these documents help inform or even dictate such important data-driven decisions, it’s imperative for regulated industry companies to establish and maintain a robust data governance framework as part of these document processing pipelines. Without governance, pipelines become a dumping ground where documents are inconsistently stored, duplicated, and processed, and the business is unable to explain to potential auditors where the data that fed their decisions came from, or what that data was used for.

A data governance framework is made up of people, processes, and technology. It enables business users to work collaboratively with technologists to drive clean, certified, and trusted data. It consists of several components including data quality, data catalog, data ownership, data lineage, operation, and compliance. In this post, we discuss data catalog, data ownership, and data lineage, and how they tie together with building document processing pipelines for regulated industries.

For more information about design patterns on data quality, see How to Architect Data Quality on the AWS Cloud.

Data lineage

Data lineage is the part of data governance that refers to the practice of providing GPS services for data. At any point in time, it can explain where the data originated, what happened to it, what its latest status is, and where it’s headed from this point on.

It provides visibility while simplifying the ability to trace financial numbers back to their origin, and provides transparency on potential errors and their root cause analyses.

Furthermore, you can use data lineage captured over time as analytical inputs to drive accuracy scores.

It’s imperative for a document processing pipeline to have a well-defined data lineage framework. The framework should include an end-to-end lifecycle, responsibility model, and the technology to enable data transformation transparency. Without lineage, the data can’t be trusted.

To illustrate this end-to-end data lineage concept, we walk you through creating an NLP-powered document search engine with built-in lineage at each step. Every object and piece of data processed by this ML pipeline can be traced back to the original document.

Each processing component can be replaced by your choice of tooling or bespoke ML model. Furthermore, you can customize the solution to include other use cases, such as central document data lakes or supplemental tabular data feed to an online transaction processing (OLTP) application.

The solution follows an event-driven architecture in which the completion of each stage within the pipeline triggers the next step, while providing self-service lineage for traceability and monitoring. In addition, hooks have been included to provide capabilities to extend the pipeline to additional workloads.

This design uses the following AWS services (you can also follow along in the GitHub repo):

  • Amazon Comprehend – An NLP service that uses ML to find insights and relationships in text, and can do so in multiple languages.
  • Amazon DynamoDB – A key-value and document database that delivers single-digit millisecond performance at any scale.
  • Amazon DynamoDB Streams – A change data capture (CDC) service. It captures an ordered flow of information about changes to items in a DynamoDB table. When you enable a stream on a table, DynamoDB captures information about every modification to data items in the table.
  • Amazon Elasticsearch Service (Amazon ES) – A fully managed service that makes it possible for you to deploy, secure, and run Elasticsearch cost-effectively and at scale. You can build, monitor, and troubleshoot your applications using the tools you love, at the scale you need.
  • AWS Lambda – A serverless compute service that runs code in response to triggers such as changes in data, shifts in system state, or user actions. Because Amazon S3 can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
  • Amazon Simple Notification Service (Amazon SNS) – An AWS managed service for application-to-application communications, with a pub/sub model enabling high-throughput, low-latency message relaying.
  • Amazon Simple Queue Service (Amazon SQS) – A fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications.
  • Amazon Simple Storage Service (Amazon S3) – An object storage service to stores your documents and allows for central management with fine-tuned access controls.
  • Amazon Textract – A fully managed ML service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond OCR to identify, understand, and extract data from forms and tables.

Architecture

The overall design is grouped into five segments:

  • Metadata services module
  • Ingestion module
  • OCR module
  • NLP module
  • Analytics module

All components interact via asynchronous events to allow for scalability. The following diagram illustrates the conceptual design.

The following diagram illustrates the physical design.

Metadata services

This is an encapsulated module to register, track, and trace incoming documents. It’s designed to be used across many different document processing pipelines. In your organization, one team might decide to use the OCR and NLP modules designed in this post. Another team might decide to use a different pipeline. However, governance practices of each pipeline should be consistent, and documents should be registered one time with full transparency on movement and downstream usage. Each document can be processed several times. You can extend the catalog and lineage services designed in this post to keep track of many pipelines, from multiple sources of data.

At the core, the metadata services module contains four reference tables, an SNS topic, three SQS queues, and three self-contained Lambda functions. Tables are created in DynamoDB, and schemas can be easily extended to include additional data attributes deemed important for your pipeline.

In addition, you can extend this design to include additional data governance components such as data quality.

The tables are defined as follows.

Table Name Purpose DynamoDB Stream Enabled? Data Governance component Sample Use
Document Registry Keeps track of all incoming documents. Each document is assigned a unique document ID and registered one time in this table. Yes Catalog Provides the ability to quickly look up and understand the document source and context metadata.
Document Ownership Covers responsibility model of the data governance in which each document acquired to the pipeline has a defined owner. No Ownership Provides notification services and can be extended to manage data quality controls.
Document Lineage Keeps track of all data movements. It provides detailed lineage info that includes the source S3 bucket name, destination S3 bucket name, source file name, target file name, ARN ID of the AWS service that processed the document, and timestamp. No

A simple PartiQL query against this table based on the document ID provides a list of all steps the original document has taken. Query output can include the following columns:

·         Document ID

·         Original document name

·         Timestamp

·         Source S3 bucket

·         Source file name

·         Destination S3 bucket

·         Destination file name

Pipeline Operations Keeps a record of all pipeline actions taken on a document ID, including the current pipeline stage and its status, and keeps a timeline of the stages in chronological order. Yes An operational query on a document ID to determine where in the pipeline the current document processing is.

DynamoDB Streams allows downstream application code to react to updates to objects in DynamoDB. It provides a mechanism to keep an event-based microservices architecture in place by triggering subsequent steps of a workflow whenever new documents are written to our Document Registry table, and subsequently when new document references are created in the Pipeline Operations table.

In addition, DynamoDB Streams provides developer teams with an efficient way of connecting your application logic to various updates in the tables (for example, to keep track of a particular document ID based on owner tags, or alert when certain unexpected problems arise while processing some documents).

The Lambda functions provide microservices API call capabilities for the document pipeline to self-register its movements and actions undertaken by the pipeline code:

  • Document Arrival Register API – Registers the incoming document’s metadata and location within Document Registry table
  • Document Lineage API – Registers the lineage information within Document Lineage table
  • Pipeline Operations API – Provides up-to-date information on the state of the pipeline

The SNS topic is used as a sink for incoming messages from all pipeline movements and document registrations. It disseminates the messages to each downstream subscribed SQS queue according to what type of message was received. In this model, the number of consumers of the messages coming through the SNS topic could be greatly expanded as needed, and all messages are guaranteed to stay in order, because both the SNS topics and SQS queues are created in a First-In-First-Out (FIFO) configuration to prevent duplicates and maintain single-threaded processing in the pipeline.

Using Amazon SNS in the design provides scalability by creating a pub/sub architecture. A pub/sub architecture design is a pattern that provides a framework to decouple the services that produce an event from services that process the event. Many subscribers can subscribe to the same event and trigger different pipelines. For example, this design can easily be extended to process incoming XML file formats by subscribing an additional XML process pipeline for the same event.

The following table provides schema information. The document ID is identical and unique for each document and is part of the composite primary key used to identify movement of each document within the pipeline.

The following diagram shows the architecture of our metadata services.

Ingestion module

The ingestion workload is triggered when a new document is uploaded to the NLP/Raw S3 bucket (or the bucket where raw documents are placed from users or front-end applications).

The ingestion module follows a four-step process (as shown in the following diagram):

  1. A document is uploaded to the NLP/Raw S3 bucket.
  2. The Document Registrar Lambda function is invoked, which calls the metadata services API to register the document and receive a unique ID. This ID is added to the document as a tag, and the metadata is registered within the DynamoDB table Document Registry.
  3. After the document metadata is registered with Metadata Services, the DynamoDB Document Registration stream is invoked to start the Document Classification Lambda function. This function examines the metadata registered on the document and determines if the downstream OCR segment should be invoked on this document. The result of this examination is written back to the metadata services.
  4. The metadata registration of the previous step invokes a DynamoDB Pipeline Operations Stream, which invokes the Document Extension Detector Lambda function. This function examines the incoming file formats and separates the images files from PDF documents.

All steps are registered in metadata services. The red dotted lines in the following diagram represent the metadata asynchronous API calls.

OCR module

This module detects the incoming file format and uses Amazon Textract in this implementation to convert the incoming documents into text. Amazon Textract can process image files synchronously, and PDF and other documents asynchronously, to allow time for the service to complete its analysis.

The OCR module consists of the following process, as illustrated in the architecture diagram:

  1. Image files are uploaded to the NLP/image S3 bucket and the Sync Processor Lambda function is invoked. The function synchronously points Amazon Textract to the S3 location of the image file, and waits for a response.
  2. Amazon Textract transforms the format to text and deposits the text output in the NLP/Textract. This step concludes OCR processing of the image file types.
  3. PDF files are placed within the NLP/PDF S3 bucket. This bucket invokes the Async Processor Lambda function. This function feeds the document to Amazon Textract and completes its state, registering as such with the metadata services.
  4. When the Amazon Textract document analysis is complete, an SNS message is sent to a specified SNS topic, notifying downstream consumers of the job completion. In this implementation, an SQS queue captures that message.
  5. The SQS queue message is the event that triggers the Result Processor Lambda function.
  6. The function extracts the results of document analysis from Amazon Textract and formats it according to the type of text it analyzed (forms, tables, and raw text).
  7. The results are pushed to the NLP/Textract S3 bucket, page by page for every type of text, and as a complete JSON response.

All the progress is registered in metadata services. The red dotted lines in the diagram represent the metadata asynchronous API calls.

NLP module

This module detects key phrases and entities within the document by using the text output from the OCR module. A key phrase is a string containing a noun phrase that describes a particular thing. It generally consists of a noun and the modifiers that distinguish it. For example, “day” is a noun; “a beautiful day” is a noun phrase that includes an article (“a”) and an adjective (“beautiful”).

Once key phrases are understood, it’s quite likely that indexing them in an analytical tool would let you find this article quickly and accurately. For example, if you want to analyze corporate social responsibility (CSR) reports, you can find attributes such as “reducing carbon footprints,” “improving labor policies,” “participating in fair-trade,” and “charitable giving” by indexing results of this module.

We use Amazon Comprehend to perform this function in this pipeline. However, as we explained earlier, you can easily swap the tooling used for this design with your preferred tool. For example, you can replace Amazon Comprehend with an Amazon SageMaker custom model as an alternative to extract key phrases and entities in a more domain-focused way. SageMaker is an ML service that you can use to build, train, and deploy ML models for virtually any use case.

Amazon Comprehend is called on a synchronous basis to extract key phrases in the following steps (as illustrated in the following diagram):

  1. The incoming text file uploaded to the NLP/Textract S3 bucket invokes the Sync Comprehend Processor Lambda function.
  2. The function feeds the incoming file to Amazon Comprehend for processing.
  3. The results from Amazon Comprehend, in JSON format, are deposited in the NLP/JSON S3 bucket.
  4. The results from Amazon Comprehend are sent to Amazon ES, the service we incorporate as our document search engine.

All steps are being registered in metadata services. The red dotted lines in the diagram represent the metadata asynchronous API calls.

Analytics module

This module is responsible for the consumption and analytics segment of the pipeline. The steps are illustrated in the following diagram:

  1. The output from Amazon Comprehend, in JSON format, is fed to Amazon Neptune. Neptune allows end users to discover relationships across documents. This is an example of a downstream analytics application that is not implemented in this post.
  2. The end users have access to the original document in four formats (CSV, JSON, original, text), and can search key phrases using Amazon ES. They can identify relationships using Neptune. A JSON version of the document is available in the NLP/JSON S3 bucket. The original document is available in the NLP/Raw S3 bucket.
  3. Full lineage can be obtained from the Document Lineage table in DynamoDB.

The analytics module has many potential implementations. For example, you can use a relational datastore like Amazon Relational Database (Amazon RDS) or Amazon Aurora to analyze extracted tabular data using SQL.

Conclusion

In this post, we architected an end-to-end document processing pipeline using AWS managed ML services. In addition, we introduced metadata services to help organizations create a centralized document repository to store documents one time but process multiple times. A data governance framework as illustrated in this design provides you with necessary guardrails to ensure documents are governed in a standard fashion across the organization, while providing lines of business with autonomy to decide your NLP and OCR models and choice of tooling.

The architecture discussed in this post has been coded and is available for deployment in the GitHub repo. You can download the code and create your pipeline within a few days.


About the Authors

  David Kheyman is a Solutions Architect at Amazon Web Services based out of New York City, where he designs and implements repeatable AWS architecture patterns and solutions for large organizations.

 

 

Mojgan Ahmadi is a Principal Solutions Architect with Amazon Web Services based in New York, where she guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. She brings over 20 years of technology experience on Software Development and Architecture, Data Governance and Engineering, and IT Management.

 

Anirudh Menon is a Solutions Architect with Amazon Web Services based in New York, where he helps financial services customers drive innovation with AWS solutions and industry-specific patterns.

Read More

Announcing the AWS DeepComposer Chartbusters challenges 2021 season launch

We’re back with two new challenges for the AWS DeepComposer Chartbusters 2021 season! Chartbusters is a global challenge in which developers use AWS DeepComposer to create original compositions and compete in monthly challenges to showcase their machine learning (ML) and generative artificial intelligence (AI) skills. Regardless of your background in music or ML, one of the two new challenges will be right for you.

You can choose between two different challenges this season. In the basic challenge, Melody-Go-Round, you can use any of the generative AI models available in the AWS DeepComposer Music studio to create new compositions. In the advanced challenge, Melody Harvest, you train a custom generative AI model with your own dataset using Amazon SageMaker. In this challenge, you can dive deeper into the mechanics of data preparation, model training, and evaluation to teach a model to play your favorite style of music.

The 2021 season runs through October 31, 2021. Winners of each challenge are selected on the last day of each month, and we’ll feature the winners in an AWS Machine Learning Blog post. Monthly winners of the Melody Harvest challenge will also win a ticket to AWS re:Invent 2021. To participate, go to the AWS DeepComposer console and choose the Chartbusters challenge that’s right for you in the navigation pane.

Compete in the Melody-Go-Round challenge

You can compete in the AWS DeepComposer Chartbusters Melody-Go-Round challenge in just a few simple steps:

  1. In the AWS DeepComposer Music studio, record a track, import a track, or pick any of the available input tracks.

  1. Get creative and explore different combinations of available models. You can also explore advanced parameters under each model.

  1. Use the Edit melody feature to add or remove notes, or change the note duration and pitch. When finished, choose Apply changes. You can iterate by adjusting the advanced parameters and choosing Enhance again. Repeat these steps until you’re satisfied with the generated music.

You can also download the melody and import it into a digital audio workstation like GarageBand and further indulge your creativity.

  1. When your melody is complete, go to the submission form and choose an existing composition or import a post-processed audio track. Choose Melody-Go-Round for the competition type, register or sign in to SoundCloud, and choose Submit.

For more information on judging criteria, visit AWS DeepComposer Melody-Go-Round page.

Compete in the Melody Harvest challenge

  1. Explore our GitHub pages for Generative Adversarial Networks (GANs), Autoregressive Convolutional Neural Networks (AR-CNNs), and Transformers. Then train your own model and start composing your music.
  2. You can upload the generated MIDI file to a digital audio workstation like GarageBand and further improve it.
  3. When your melody is complete, go to the submission form, choose Melody Harvest for the competition type, import a postprocessed audio track, and add the link to your GitHub repository. Make sure your GitHub repository has your notebook and your model’s checkpoint files.

For more information on datasets and judging criteria visit AWS DeepComposer Melody Harvest page.

Conclusion

Congratulations! You have successfully submitted your composition to the AWS DeepComposer Chartbusters challenge. Now you can invite your friends and family to listen to your creation on SoundCloud, vote for their favorite, and join the fun by participating in the competition.

Although you don’t need a physical keyboard to compete, we’re offering the AWS DeepComposer keyboard at a special price of $69.00 (30% off) for a limited time on Amazon.com to improve your music generation experience. The pricing includes the keyboard and 3 months of the AWS DeepComposer free trial. To learn more about the different generative AI techniques supported by AWS DeepComposer, check out the learning capsules available on the AWS DeepComposer console.


About the Authors

Maryam Rezapoor is a Senior Product Manager with AWS AI Devices team. As a former biomedical researcher and entrepreneur, she finds her passion in working backward from customers’ needs to create new impactful solutions. Outside of work, she enjoys hiking, photography, and gardening.

 

 

 Chris Whittam is a Senior Product Manager on the AWS AI Devices team helping developers get hands on (literally) with machine learning.

Read More

AWS DeepRacer device software now open source

AWS DeepRacer is the fastest way to get started with machine learning (ML). You can train reinforcement learning (RL) models by using a 1/18th scale autonomous vehicle in a cloud-based virtual simulator and compete for prizes and glory in the global AWS DeepRacer League. Today, we’re expanding AWS DeepRacer’s ability to provide fun, hands-on learning by open-sourcing the AWS DeepRacer device software.

Why open source

The AWS DeepRacer virtual and in-person leagues have been a hit, but now developers want to go beyond league racing with their car. Because the AWS DeepRacer is an Ubuntu-based computer on wheels powered by the Robot Operating System (ROS) we are able to open source the code, making it straightforward for a developer with basic Linux coding skills to prototype new and interesting uses for their car. Now that the AWS DeepRacer device software is openly available, anyone with the car and an idea can make new uses for their device a reality.

We’ve compiled 6 sample projects from the AWS DeepRacer team and members of the global AWS DeepRacer community to help you get started exploring the possibilities that open source provides. As developers share new projects using #deepracerproject, we will highlight our favorites on the AWS DeepRacer robotics projects page. Whether you’re mounting a Nerf cannon on the car with the DeepBlaster project, creating visualizations of your home or office with the Mapping project, or coming up with new ways of racing your friends and colleagues with the DeepDriver project, you can do all that and more with the open source code and sample projects. Documentation is available in GitHub and open for collaboration with thousands of community members in the AWS DeepRacer Slack channel. The only limit to what you can do with AWS DeepRacer is your imagination (and, well, the laws of physics).

Let the experiments begin

With the open-sourcing of the AWS DeepRacer device code, you can quickly and easily change the default behavior of your currently track-obsessed race car. Want to block other cars from overtaking it by deploying countermeasures? Want to deploy your own custom algorithm to make the car go faster from point A to B? You just need to dream it and code it. We can’t wait to see the ideas that you come up with, from new racing formats to new uses for AWS DeepRacer.

Starting today, you can choose from six projects (Follow the leader, Mapping, and Off Road created by AWS, and RoboCat, DeepBlaster, and DeepDriver created by the open source community) or create your own. You can get started with the Follow the Leader sample project, which trains the car to detect and follow an object. It’s the quickest project to build and run, and in the next section we’ll demonstrate how easy it is to modify the default the behavior of your AWS DeepRacer car. To complete this setup, upgrade to the latest software version and access the car via SSH.

Download the Follow the Leader project

Connect to the car using SSH, switch to the root user, and create a working directory. Then clone the Follow the Leader GitHub repository:

sudo su
mkdir -p ~/deepracer_ws
cd ~/deepracer_ws
git clone https://github.com/aws-deepracer/aws-deepracer-follow-the-leader-sample-project.git

The process to fully clone the project repository to your car can take a few minutes (depending on the speed of your internet connection). The Follow the Leader project contains several installation scripts that help shortcut the process to get you up and running faster. You can also complete the next few steps manually if you’re more comfortable with running shell-based commands or want to learn more about the process using the links to the relevant documentation for each stage.

Download and convert the object detection model

First, we need to download and convert the object detection model. To do this, we run the script that came in the Follow the Leader repository:

sudo su
cd ~/deepracer_ws/aws-deepracer-follow-the-leader-sample-project/installers
/usr/bin/bash install_object_detection_model.sh

The installer script downloads and optimizes the model before copying the optimized artifacts to the model location. This process takes approximately 3–4 minutes to complete.

You can complete this stage manually using the detailed instructions to download and convert the object detection model.

Initialize rosdep if it’s not initialized previously

Rosdep helps to install the dependency packages. Initialize the rosdep if it’s not done before on the device:

sudo rosdep init
sudo rosdep update

Build the Follow the Leader packages

Next, we fetch the package dependencies needed for the project and build them:

sudo su
cd ~/deepracer_ws/aws-deepracer-follow-the-leader-sample-project/installers
/usr/bin/bash build_and_install_ftl_application.sh

When successful, you should see a screen similar to the following:

The script downloads and installs the required package dependencies and builds the packages. This process can take approximately 8–10 minutes to complete.

You can also complete this stage manually by following the steps 1–10 in “Download and Building” in the Follow the Leader README.md. The install script does the same steps (just saves you some typing).

Launch the Follow the Leader application

Now we run the Follow the Leader application:

sudo su
cd ~/deepracer_ws/aws-deepracer-follow-the-leader-sample-project/installers
/usr/bin/bash run_ftl_application.sh

Enable Follow the Leader mode

Finally, we need to open another SSH session to the car to enable Follow the Leader mode using the command line interface (CLI):

sudo su
cd ~/deepracer_ws/aws-deepracer-follow-the-leader-sample-project/installers
/usr/bin/bash enable_ftl_mode.sh

Now you, or a willing volunteer (or object), can move around and watch the car begin to follow! How cool is that?

Share your results

Congratulations! You completed your first sample project. Share your experience with friends and family on social media with the tag #deepracerproject so we can see what you’re up to. As the community invents new projects for AWS DeepRacer, we’ll be adding them to the AWS DeepRacer GitHub Organization as well as featuring them in future blog posts so that everyone can get inspired. Purchase an AWS DeepRacer car today to start experimenting with your first AWS DeepRacer robotics project today! We are offering a 25% discount on the AWS DeepRacer ($100 off) and AWS DeepRacer Evo ($150 off) till May 27th, 2021.


About the Author

David Smith is a Sr. Solutions Architect for AWS DeepRacer. He is passionate about AWS DeepRacer, technology as an enabler and learning. Outside of work he’s into Formula 1, flying (and crashing) drones, 3d printing, running (Parkrun), tinkering with code and spending time with the family.

Read More

Monitor and Manage Anomaly Detection Models on a fleet of Wind Turbines with Amazon SageMaker Edge Manager

In industrial IoT, running machine learning (ML) models on edge devices is necessary for many use cases, such as predictive maintenance, quality improvement, real-time monitoring, process optimization, and security. The energy industry, for instance, invests heavily in ML to automate power delivery, monitor consumption, optimize efficiency, and extend the lifetime of their equipment.

Wind energy is one of the most popular renewable energy sources. According to the Global Wind Energy Council, 22,893 wind turbines were installed globally in 2019, produced from 33 suppliers and accounting for over 63 GW of wind power capacity. With such scale, energy companies need an efficient platform to manage and maintain their wind turbine fleets, and the ML models running on the devices. A commercial wind turbine costs around $3–4 million. If a turbine is out of service, it costs $800–1,600 per day and results in a total loss of 7.5 megawatts, which is enough energy to power approximately 2,500 homes.

A wind turbine is a complex piece of engineering and consists of many sensors that can be used by a monitoring mechanism to capture data such as vibration, temperature, wind speed, and air humidity. You could train an ML model with this data, deploy it to an edge device connected to the turbine’s sensors, and predict anomalies in real time at the edge. It would reduce the operational cost of your fleet of turbines. But imagine the effort to maintain this solution on a fleet of thousands or millions of devices. How do you operate, secure, deploy, run, and monitor ML models on a fleet of devices at the edge?

Amazon SageMaker Edge Manager can help you to answer this question. The service allows you to optimize, secure, monitor, and maintain ML models on fleets of smart cameras, robots, personal computers, industrial equipment, mobile devices, and more. With Edge Manager, you can manage the lifecycle of each ML model on each device in your device fleets for up to thousands or millions of devices. The service provides a software agent that runs on edge devices and a management interface on the AWS Management Console.

In this post, we show how to use Edge Manager to create a robust end-to-end solution that manages the lifecycle of ML models deployed to a wind turbine fleet. But instead of using real wind turbines, you learn how to build your own fleet of mini 3D printed wind turbines. This is a DIY open-source, open-hardware project created to demonstrate how to build an ML at the edge solution with Amazon SageMaker. You can use to it as a platform to learn, experiment, and get inspired.

The next sections cover the following topics:

  • The specifications of the wind turbine farm
  • How to configure each Jetson Nano
  • How to build an anomaly detection model using SageMaker
  • How to run your own mini wind turbine farm

The wind turbine farm

The wind turbine farm created for this project has five mini 3D printed wind turbines connected to five distinct Jetson Nanos via USB. The Jetson Nanos are connected to the internet through Ethernet cables plugged to a cable modem. A fan, positioned in front of the farm, produces the wind to simulate an outdoor condition. The following image shows how the wind farm is organized.

The mini wind turbine

The mini wind turbine of this project is a mechanical device integrated with a microcontroller (Arduino) and some sensors. It was modeled using FreeCAD, an open-source tool for designing industrial parts. These parts were then 3D printed using PETG (plastic filament type) and assembled with the electronics components. Its base is static, which means that the turbine doesn’t align with the wind direction by itself. This restriction was important to simplify the project.

Each turbine has one voltage generator (small motor) and seven different sensors:

  • Vibration (MPU6050: 6 axis accelerometer/gyroscope)
  • Infrared rotation encoder (rotations per second)
  • Gearbox temperature (MPU6050)
  • Ambient temperature (BME680)
  • Atmospheric pressure (BME680)
  • Air humidity (BME680)
  • Air quality (BME680)

An Arduini Mini Pro is responsible for interfacing with these sensors and collecting data from them. This data is streamed through the serial pins (TX, RX). An FTDI device that converts this serial signal to USB is the bridge between the Arduino and the Jetson Nano. A Python application that runs on Jetson Nano receives the raw data from the sensors through this bridge.

A micro servo was modified and transformed into a voltage generator. Its internal gearbox increases the generator (motor) speed by five times to produce a (low) voltage between 0–3.3v. This generator is also connected to the Arduino through an analog input pin. This information is also sent with the sensor’s readings.

The frequency at which the data is collected depends on the sensor. All the signals from BME650 are collected each 150 milliseconds, the rotation encoder each 1 second, and the voltage generator and the vibration sensor each 50 milliseconds.

If you want to know more about these technical details and learn how to build your own mini wind turbine, see the GitHub repository.

The edge device

Each Jetson Nano has a built-in GPU with 128-core NVIDIA Maxwell™ and a Quad-core ARM® A57 CPU running at 1.43 GHz. This hardware is enough to run a Python application that collects and formats the data from the sensors of the turbine and then calls the Edge Manager agent API to get the predictions. This application compares the prediction with a threshold to check for anomalies in the data. The model is invoked in real time.

When SageMaker Neo compiles the ML model for Jetson Nano, a runtime (DLR) optimized for this target device is included in the deployment package. This runtime detects automatically that it’s running on a Jetson Nano and loads the model directly into the device’s GPU for maximum performance.

The Edge Manager agent is also distributed as a Linux (arm64) application that can be run as a background process (daemon) on your Jetson Nano. It uses the runtime SageMaker Neo includes in the compilation package to interface with the optimized model and expose it as a well-defined API. This API is integrated with the local application through a low latency protocol (grpc + unix socket).

The cloud services

Now that you know some details about the physical hardware used to develop the wind turbine farm, it’s time to see which AWS services support the solution on the cloud side. A minimal, standalone setup to get a model deployed and running on the Edge Manager agent requires only SageMaker and nothing more. However, other services were used in this project with two important features: a mechanism for over-the-air (OTA) deployment and a dashboard for monitoring the anomalies in near-real time.

In summary, the components required for this project are:

  • A device fleet (Edge Manager), which organizes and controls one or more registered devices through the agent (running on each device)
  • One IoT thing per device and IoT thing group, which is used by the OTA mechanism to communicate with the devices via MQTT
  • AWS IoT rules, and an AWS Lambda function to get and filter application logs and ingest them into Amazon Elasticsearch Service (Amazon ES)
  • A Lambda function to parse the model metrics captured by agent in ingest them into Amazon ES
  • An Elasticsearch server with Kibana, which has dashboards for monitoring the anomalies (optional)
  • SageMaker to build, compile, and package the ML model

The following diagram illustrates this architecture.

Putting everything together

Now that we have all the components of our wind turbine farm, it’s time to understand the steps we need to take to integrate all these moving parts, deploy a model to our edge devices, and keep an application running and predicting anomalies in real time.

The following diagram shows all the steps involved in the process.

The solution consists of the following steps:

  1. The data scientist explores the dataset and designs an anomaly detection model (autoencoder) with PyTorch, using SageMaker Studio.
  2. The model is trained with a SageMaker training job.
  3. With Neo, the model is optimized (compiled) to Jetson Nano.
  4. Edge Manager creates a deployment package with the compiled model.
  5. The data scientist creates an IoT job that sends a notification of the new model available to the edge devices.
  6. The application running on Jetson Nano performs the following:
    1. Receives this notification and downloads the model package from the Amazon Simple Storage Service (Amazon S3) bucket.
    2. Unpacks the model and loads it using the Edge Manager agent API (LoadModel).
    3. Reads the sensors from the wind turbine, prepares the data, invokes the ML model, and captures some model metrics using the Edge Manager agent API.
    4. Compares the prediction with a baseline to detect potential anomalies.
    5. Sends the raw sensor data to an AWS IoT topic.
  7. Through a rule, AWS IoT reads the app logs topic and exports the data to Amazon ES.
  8. A Lambda function captures the model metrics (mean average error) exported by the agent and ingests the data into Amazon ES.
  9. The operator uses a Kibana dashboard to check for any anomalies.

Configure your edge device

The Edge Manager agent uses certificates provided by AWS IoT Core to authenticate and call other AWS services. That way you need to create an IoT thing first and then an edge device fleet. But first, you need to prepare some basic resources to support your solution.

Create prerequisite resources

Before getting started, you need to configure AWS Command Line Interface in your workstation first (if necessary) and then to create the following resources:

  • An S3 bucket to store the captured data
  • An AWS Identity and Access Management (IAM) role for your devices
  • An IoT thing to map to your Edge Manager device
  • An IoT policy to control the permissions of the temporary credentials of the edge device
  1. Create a new bucket for the solution.

Each time you call CaptureData in the agent API, it uploads the tensors (input and predictions) into this bucket.

Next, you create your IAM role.

  1. On the IAM console, create a role named WindTurbineFarm so the devices can access resources in your account.
  2. Add permissions to this role to upload files to the S3 bucket you created.
  3. Add the following trusted entities to the role:
    1. amazonaws.com
    2. iot.amazonaws.com
    3. amazonaws.com

Use the following code (provide the name for the S3 bucket, your AWS account, and Region):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::<<S3_BUCKET_NAME>>",
                "arn:aws:s3:::<<S3_BUCKET_NAME>>/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "iot:CreateRoleAlias",
                "iot:DescribeRoleAlias",
                "iot:UpdateRoleAlias",
                "iot:TagResource",
                "iot:ListTagsForResource"
            ],
            "Resource": [
                "arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:rolealias/SageMakerEdge*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "iam:GetRole",
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/*SageMaker*",
                "arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/*Sagemaker*",
                "arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/*sagemaker*",
                "arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/WindTurbineFarm"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "sagemaker:GetDeviceRegistration",
                "sagemaker:SendHeartbeat",
		  "iot:DescribeEndpoint",
		  "s3:ListAllMyBuckets”
            ],
            "Resource": "*",
            "Effect": "Allow"
        }, 
        {
            "Action": [
                "sagemaker:DescribeDevice"
            ],
            "Resource": [
                "arn:aws:sagemaker:<<REGION>>:<<AWS_ACCOUNT_ID>>:device-fleet/windturbinefarm*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "iot:Publish",
                "iot:Receive"
            ],
            "Resource": [
                "arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topic/wind-turbine/*"
            ],
            "Effect": "Allow"
        }
    ]
}

You’re now ready to create your IoT thing, which you later map to your Edge Manager device.

  1. On the AWS IoT Core console, under Manage, choose Things
  2. Choose Create.
  3. Name your device (for this post, edge-device-0).
  4. Create a new group or choose an existing group (for this post, WindTurbineFarm).
  5. Create a certificate.
  6. Download the certificates, including the root CA.
  7. Activate the certificate.

You now create your policy, which controls the permissions of the temporary credentials of the edge device.

  1. On the AWS IoT Core console, under Secure, choose Policies.
  2. Choose Create.
  3. Name the policy (for this post, WindTurbine).
  4. Choose Advanced Mode.
  5. Enter the following policy, providing your AWS account and Region:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iot:Connect"
      ],
      "Resource": "arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:client/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "iot:Publish",
        "iot:Receive"
      ],
      "Resource": [
	"arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topic/wind-turbine/*",
	"arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topic/$aws/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "iot:Subscribe"
      ],
      "Resource": [
	"arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topicfilter/wind-turbine/*",
	"arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topicfilter/$aws/*",
          "arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topic/$aws/*"
    ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "iot:UpdateThingShadow"
      ],
      "Resource": [	
	"arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topicfilter/wind-turbine/*",
          "arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:thing/edge-device-*"

      ]
    },
    {
      "Effect": "Allow",
      "Action": "iot:AssumeRoleWithCertificate",
      "Resource": "arn:aws:iot: <<REGION>>:<<AWS_ACCOUNT_ID>>:rolealias/SageMakerEdge-WindTurbineFarm"
    }
  ]
}
  1. Choose Create.

Lastly, you attach the policy to the certificate.

  1. On the AWS IoT Core console, under Secure, choose Certificates.
  2. Select the certificate you created.
  3. On the Actions menu, choose Attach policy.
  4. Select the policy WindTurbine.
  5. Choose Attach.

Now your IoT thing is ready to be linked to an edge device. Repeat these steps (except for creating the policy) for each additional device in your device fleet. For a production environment with hundreds or thousands of devices, you just apply a different approach, using automated scripts and parameter files to provision all the IoT things.

Create the edge fleet

To create your edge fleet, complete the following steps:

  1. On the SageMaker console, under Edge Inference, choose Edge device fleets.
  2. Choose Create device fleet.

  1. Enter a name for the device (for this post, WindTurbineFarm).
  2. Enter the ARN of the IAM role you used in the previous steps (arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/WindTurbineFarm).
  3. Enter the output S3 bucket URI (s3://<<NAME_OF_YOUR_BUCKET>>/wind_turbine_data/).
  4. Choose Submit.

Now you need to add a new device to the fleet.

  1. On the SageMaker console, under Edge Inference, choose Edge devices.
  2. Choose Register devices.

  1. For Device Properties, enter the name of the device fleet you created (WindTurbineFarm).
  2. Choose Next.
  3. For Device name, enter any unique name for your device (for this post, we use the same name as our IoT thing, edge-device-wind-turbine-00000000000).
  4. For IoT name, enter the name of the thing you created earlier (edge-device-0).
  5. Choose Submit.

Repeat the registering process for all your other devices. Now you can SSH to your Jetson Nano and complete the configuration of your device.

Prepare the edge device

Before you start configuring your Jetson Nano, you need to install JetPack 4.4.1 in your Nano. This is the version you use to build, run, and test this demo.

The model preparation process for your target device is very sensitive in relation to the versions of the libraries installed in your device. For instance, because the target device is a Jetson Nano, Neo optimizes the model and runtime to a given version of the TensorRT and CUDA. The runtime (libdlr.so) is physically linked to the versions you specify in the compilation job. This means that if you compile your model using Neo for JetPack 4.4.1, it doesn’t work with JetPack 3.x. and vice versa.

  1. With JetPack 4.4.1 running on your Jetson Nano, you can start configuring your device with the following commands:
echo "export TVM_TENSORRT_MAX_WORKSPACE_SIZE=2147483647" >> ~/.bashrc
echo "export SM_EDGE_AGENT_HOME=/home/${USER}/agent" >> ~/.bashrc

# Also export the variables for the current session
export TVM_TENSORRT_MAX_WORKSPACE_SIZE=2147483647
export SM_EDGE_AGENT_HOME=/home/${USER}/agent


sudo apt install -y protobuf-compiler python3-serial 
sudo apt install -y python3-pip python3-joblib python3-boto3 libssl-dev
sudo apt install -y curl
sudo pip3 install grpcio-tools grpcio PyWavelets paho-mqtt
  1. Download the Linux ARMv8 version of the Edge Manager agent.
  2. Copy the package to your Jetson Nano (scp). Create a folder for the agent and unpack the package in your home directory:
mkdir -p ~/agent/certificates/iot
mkdir -p ~/agent/certificates/root
tar -xzvf <<agent_package>>.tgz -C ~/agent
  1. Copy the AWS IoT Core certificates you provisioned for your thing in the previous section to the directory ~/agent/certificates/iot in your Jetson Nano.

You should see the following files in this directory:

  • pem – CA root
  • <<CERT_PREFIX>>-public.pem.key – Public key
  • <<CERT_PREFIX>>-private.pem.key – Private key
  • <<CERT_PREFIX>>-certificate.pem.crt – Certificate
  1. Get the root certificate used to sign the deployment package created by Edge Manager. The agent uses this to validate the model.
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-linux-armv8/Certificates/<<AWS_REGION>>/<<AWS_REGION>>.pem .
  1. Copy this certificate to the directory ~/agent/certificates/root in your Jetson Nano.

Next, you create the Edge Manager agent configuration file.

  1. Open an empty file named ~/agent/sagemaker_edge_config.json and enter the following code:
{
    "sagemaker_edge_core_device_uuid": "<<SAGEMAKER_EDGE_DEVICE_NAME>>",
    "sagemaker_edge_core_device_fleet_name": "WindTurbineFarm",
    "sagemaker_edge_core_capture_data_buffer_size": 30,
    "sagemaker_edge_core_capture_data_batch_size": 10,
    "sagemaker_edge_core_capture_data_push_period_seconds": 4,
    "sagemaker_edge_core_folder_prefix": "wind_turbine_data",
    "sagemaker_edge_core_region": "<<AWS_REGION>>",
    "sagemaker_edge_core_root_certs_path": "/home/<<LINUX_USER>>/agent/certificates/root",
    "sagemaker_edge_provider_aws_ca_cert_file": "/home/<<LINUX_USER>>/agent/certificates/iot/AmazonRootCA1.pem",
    "sagemaker_edge_provider_aws_cert_file": "/home/<<LINUX_USER>>/agent/certificates/iot/<<CERT_PREFIX>>-certificate.pem.crt",
    "sagemaker_edge_provider_aws_cert_pk_file": "/home/<<LINUX_USER>>/agent/certificates/iot/<<CERT_PREFIX>>-private.pem.key",
    "sagemaker_edge_provider_aws_iot_cred_endpoint": "https://<<CREDENTIALS_ENDPOINT_HOST>>/role-aliases/SageMakerEdge-WindTurbineFarm/credentials",
    "sagemaker_edge_provider_provider": "Aws",
    "sagemaker_edge_provider_s3_bucket_name": "<<S3_BUCKET>>",
    "sagemaker_edge_core_capture_data_destination": "Cloud"
}

Provide the information for the following resources:

  • SAGEMAKER_EDGE_DEVICE_NAME – The unique name of your device you defined previously.
  • AWS_REGION – The Region where you created your edge device.
  • LINUX_USER – The Linux user name you’re using in Jetson Nano.
  • CERT_PREFIX – The prefix of the certificate files you created when you provisioned your IoT thing in the previous section.
  • CREDENTIALS_ENDPOINT_HOST – Your endpoint host. You can get this endpoint through the AWS Command Line Interface (AWS CLI). (Install the AWS CLI if you don’t have it already). Use credentials of the same account and the same Region you used in the previous sections (this isn’t the IoT thing shadow URL). Then run the following command to retrieve the endpoint host:
aws iot describe-endpoint --endpoint-type iot:CredentialProvider
  • S3_BUCKET – The name of the S3 bucket you used to configure your edge device fleet in the previous section.
  1. Save the file with all these modifications.

Now you’re ready to run the Edge Manager agent in your Jetson Nano.

  1. To test the agent, run the following commands:
cd ~/agent
rm -f /tmp/edge_agent
./bin/sagemaker_edge_agent_binary -c sagemaker_edge_config.json -a /tmp/edge_agent &

The following screenshot shows your output.

The agent is now running. After a few minutes, you can see the heartbeat of the device, reported on the console. To see it on the SageMaker console, under Edge Inference, choose Edge Devices and choose your device.

Configure the application

Now it’s time to set up the application that runs on the edge device. This application is responsible for the following:

  • Get the temporary credentials using the certificate
  • Listen to the OTA update topics to see whether a new model package is ready to deploy
  • Deploy the available model package to the edge device
  • Load the model to the agent if necessary
  • Perform an infinite loop:
    • Read the sensor data
    • Format the input data
    • Invoke the ML model and capture some metrics of the prediction
    • Compare the predictions MAE (mean average error) to the baseline
    • Publish raw data to an IoT topic (MQTT)

To install the application, first get the custom AWS IoT endpoint. On the AWS IoT Core console, choose Settings. Copy the endpoint and use it in the following code:

cd ~/
git clone https://github.com/aws-samples/amazon-sagemaker-edge-manager-demo wind_turbine
cd wind_turbine/04_EdgeApplication
## by the AWS IoT Endpoint host you just copied and save the file
chmod +x run.py
./run.py &

The application outputs something like the following screenshot.

Optional: run this application with the parameter –test-mode if you just want to run a test with no wind turbine connected to the edge device.

If everything went fine, the application keeps waiting for a new model. It’s time to train a new model and deploy it to the Jetson Nano.

Train and deploy the ML model

This post demonstrates how to detect anomalies in the components of a wind turbine. There are many ways of doing this with the data collected by its sensors. To keep this example as simple as possible, you prepare a model that analyzes vibration, wind speed, rotation (per second), and the produced voltage to determine whether an anomaly exists or not. For that purpose, we train an autoencoder using PyTorch on SageMaker and prepare it for deployment on your Jetson Nano.

This model architecture has two advantages: it’s unsupervised, so we don’t need to label our data, and you can collect data from wind turbines that are working perfectly. Therefore, your model is trained to detect what you consider normal behavior of your wind turbines. When a defect appears in any part of the turbine, a drift occurs on the sensors data, which the model interprets as abnormal behavior (an anomaly).

The following screenshot is a sample of the raw data captured by the turbine sensors.

The data has the following features:

  • nanoId – ID of the edge device that collected the data
  • turbineId – ID of the turbine that produced this data
  • arduino_timestamp – Timestamp of the Arduino that was operating this turbine
  • nanoFreemem: Amount of free memory in bytes
  • eventTime – Timestamp of the row
  • rps – Rotation of the rotor in rotations per second
  • voltage – Voltage produced by the generator in milivolts
  • qw, qx, qy, qz – Quaternion angular acceleration
  • gx, gy, gz – Gravity acceleration
  • ax, ay, az – Linear acceleration
  • gearboxtemp – Internal temperature
  • ambtemp – External temperature
  • humidity – Air humidity
  • pressure – Air pressure
  • gas – Air quality
  • wind_speed_rps – Wind speed in rotations per second

The selected features based on our goals are: qx,qx,qy,qz (angular acceleration), wind_speed_rps, rps, and voltage. The following image is a sample of the feature qx. The data produced by the accelerometer is too noisy so we need to clean it first.

The angular velocity (quaternion) is first converted to Euler Angles (roll, pitch, yaw). Then we denoise all the features with Wavelets (PyWavelets), and normalize them. The following screenshot shows the signals after these transformations.

Finally, we apply a sliding window to this resulting dataset (six features) to capture the temporal relationship between neighbor readings and create the input tensor of our ML model. The average interval between two sequential samples is approximately 50 milliseconds. Each time window (of our sliding window) is then converted into a tensor, using the following structure:

  • Tensor – 6 features x 10 steps (100 samples) = 6×100
    • Step – Group of time steps
    • Time step – Group of intervals (time_step=20 = ~5 seconds)
    • Interval – Group of samples (interval=5 = ~250 milliseconds)
  • Reshaped tensor – 6x10x10

Interval, time step and step are hyperparameters that you can adjust during training. The final result is a stream of data, encoded as a multidimensional tensor (representing a few seconds in the past). The trained autoencoder tries to recreate the input tensor as the output (prediction). By measuring the MAE between the input and output and comparing it with a pre-defined threshold, you can identify potential anomalies.

One important aspect of this approach is that it extracts the linear and non-linear correlations between the features, to better understand the impacts of one feature into another, such as wind speed on the rotation or produced voltage.

Now it’s time to run this experiment.

  1. First, you need to set up your Studio environment if you don’t have one yet.
  2. Clone the GitHub repo https://github.com/aws-samples/amazon-sagemaker-edge-manager-demo inside a Studio terminal.

The repository contains a folder named 03_Notebooks with two Jupyter notebooks.

  1. Follow the instructions in the first notebook to prepare the dataset – Because the accelerator data is a signal, it contains noise, so you run a denoise mechanism to clean the data.

The final dataset has only six features: roll, pitch, yaw (converted from a Quaternion to Euler angles), wind_speed_rps, rps (rotations per second), voltage (produced by the generator).

  1. Follow the instructions in the second notebook to train, package, and deploy the model:
    1. Use SageMaker to train your PyTorch autoencoder (CNN based).
    2. Run a batch prediction to compute MAE and threshold used by the app to determine whether the prediction is an anomaly or not.
    3. Compile the model to Jetson Nano using Neo.
    4. Create a deployment package with Edge Manager.
    5. Create an IoT job that publishes a JSON document to a topic listened to by the application that is running on your Jetson Nano.

The application gets the package, unpacks it, loads the model in the Edge Manager agent, and unblocks the application run.

Both notebooks are very detailed, so follow the steps carefully, after which you’ll have an anomaly detection model to deploy in your Jetson Nano.

Compilation job and model optimization

One of the most important steps of the whole process is the model optimization step in the second notebook. When you compile a model with SageMaker Neo, it not only optimizes the model to improve the prediction performance in the target device, it also converts the original model into an intermediate representation. After this conversion, you don’t need to use the original framework anymore (PyTorch, TensorFlow, MXNet). This representation is then interpreted by a light runtime (DLR), which is packaged with the model by Neo. Both the runtime and optimized model are libraries, compiled as native programs for a specific operational system and architecture. In the case of Jetson Nano, the OS is a Linux distro and the architecture: ARM8 64bits. The runtime in this case uses TensorRT for maximum performance on the Jetson’s GPU.

When you launch a compilation job on Neo, you need to specify some parameters related to the setup of your target device, for instance:

  • trt-ver – 7.1.3
  • cuda-ver – 10.2
  • gpu-code – sm_53

The Jetson Nano’s GPU is a NVIDIA Maxwell, architecture version 53, so the parameter gpu-code is the same for all compilation jobs. However, trt-ver and cuda-ver depend of the version of the TensorRT and CUDA installed on your Nano. When you were preparing your edge device, you set up your Jetson Nano with JetPack 4.4.1. This makes sure that the model you optimize using Neo is compatible with your Jetson Nano.

Visualize the results

The dashboard setup is out of scope for this post. For more information, see Analyze device-generated data with AWS IoT and Amazon Elasticsearch Service.

Now that you have your model deployed and running on your Jetson Nano, it’s time to look at the behavior of your wind turbines through a dashboard. The application you deployed to the Jetson Nano collects some logs and sends them to two different places:

  • The IoT MQTT topic wind-turbine/logs/<<iot_thing_name>> contains the app logs and raw data collected from the wind turbine sensors
  • The S3 bucket s3://<<S3_BUCKET>>/wind_turbine_data contains the metrics of the ML model

You can get this data and ingest it into Amazon ES or another database. Then you can use your preferred reporting to prepare dashboards.

The following visualization shows three different but correlated things for each one of the five turbines: the rotation speed (in RPS), the produced voltage, and the detected anomalies for voltage, rotation, and vibration.

Some noise was injected in the raw data from the turbines to simulate failures.

The following visualization shows an aggregation of the turbines’ speed and produced voltage anomalies over time.

Conclusion

Securely and reliably maintaining the lifecycle of an ML model deployed across a fleet of devices isn’t an easy task. However, with Edge Manager, you can reduce the implementation effort and operational cost of such a solution. Also, with a demo like the mini wind turbine farm, you can experiment, optimize, and automate your ML pipeline with the services and expertise provided by AWS.

To build a solution for your own needs, get the code and artifacts used in this project from the GitHub repo. If you want more practice using Edge Manager, check out the end-to-end workshop for Edge Manager on Studio.


About the Author

Samir Araújo is an AI/ML Solutions Architect at AWS. He helps customers creating AI/ML solutions which solve their business challenges using AWS. He has been working on several AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. He likes playing with hardware and automation projects in his free time, and he has a particular interest for robotics.

Read More

Build a medical sentence matching application using BERT and Amazon SageMaker

Determining the relevance of a sentence when compared to a specific document is essential for many different types of applications across various industries. In this post, we focus on a use case within the healthcare field to help determine the accuracy of information regarding patient health.

Frequently, during each patient visit, a new document is created with the information from the visit. This information often consists of a medical transcription that has been dictated by either the nurse or the physician. Such a document may contain a brief description statement (also known as a restatement) that explains the main details from that specific patient visit. In future visits, doctors may rely on previous visits’ restatements to quickly get an overview of the patient’s overall status. Such restatements may also be used during patient handoffs. However, this introduces the potential for errors to be made during patient handoffs to new medical teams if the restatements are difficult to understand or if they contain inadequate information (Staggers et. al. 2011). Therefore, having an accurate description of the patient’s status is important, because the cost of errors in such restatements can be high and may negatively affect the patient’s overall care (Garcia et. al. 2017).

This post walks you through how to deploy a machine learning (ML) model that aims to determine the top sentences from the document that best match the corresponding document restatement; this can be a first step to ensure the accuracy of the patient’s health records overall by determining the relevance of the restatement. We emphasize that this model determines the top ranking sentences that match the restatement; it does not generate the restatement itself.

When creating this solution, we were faced with a dual-sided challenge. Beyond the technical challenge of actually creating an AI/ML model, several surrounding components complicate actually using such models in the real world. Indeed, the actual ML code may be a very small part of the system as a whole (Sculley et al. 2015). This is especially so in complex architectures frequently deployed in the context of the healthcare and life science space.

We focused on one particular challenge: creating the ability to serve the model so that others (applications, services, or people) can use it. By serving a model, we mean to grant others the ability to pass new data to the model so they can get the predictions they need. This post provides a broad overview of the problem, the solution, and a few points to keep in mind if you plan to use a similar approach in your own use cases. A full technical write up, including a readme and a step-by-step deployment of the architecture, is available in the GitHub code repository. For more information about approaches to serving models, see Build, Train, and Deploy a Machine Learning Model With Amazon SageMaker and AWS Deep Learning Containers on Amazon ECS.

Background and use case

In the medical field (as well as other industries), documents are frequently associated with a shorter restatement text of the original document. We use the term restatement, but in fact this shorter text can be a summary, highlight, description, or other metadata about the document. For example, an after-visit clinical summary given to a patient summarizes the content of the patient visit to a physician.

For illustration purposes, the following is an example that’s unrelated to the medical industry.

Document:

On Monday morning, Joshua ate a large breakfast of bacon and eggs. He then went for a brisk walk. Finally, he returned home and sat at his desk.

Restatement:

Joshua went for a walk.

In this example, the restatement is just a rewording of the highlighted sentence in the full document. This example shows that, although the use case that we focus on in this post is specific to the medical field, you can use and modify this approach for many other text analysis applications.

Let’s now take a closer look at the use case for this post. We used data taken from MTSamples (which we downloaded from Kaggle). This data contains many different samples of transcribed medical texts. It includes documents with raw transcriptions of sample notes, as well as shorter descriptions of those notes (which we treat as restatements).

The following is an example from the MTSamples dataset.

Document:

HISTORY OF PRESENT ILLNESS: , I have seen ABC today. He is a very pleasant gentleman who is 42 years old, 344 pounds. He is 5’9″. He has a BMI of 51. He has been overweight for ten years since the age of 33, at his highest he was 358 pounds, at his lowest 260. He is pursuing surgical attempts of weight loss to feel good, get healthy, and begin to exercise again. He wants to be able to exercise and play volleyball. Physically, he is sluggish. He gets tired quickly. He does not go out often. When he loses weight he always regains it and he gains back more than he lost. His biggest weight loss is 25 pounds and it was three months before he gained it back. He did six months of not drinking alcohol and not taking in many calories. He has been on multiple commercial weight loss programs including Slim Fast for one month one year ago and Atkin’s Diet for one month two years ago.,PAST MEDICAL HISTORY: , He has difficulty climbing stairs, difficulty with airline seats, tying shoes, used to public seating, difficulty walking, high cholesterol, and high blood pressure. He has asthma and difficulty walking two blocks or going eight to ten steps. He has sleep apnea and snoring. He is a diabetic, on medication. He has joint pain, knee pain, back pain, foot and ankle pain, leg and foot swelling. He has hemorrhoids.,PAST SURGICAL HISTORY: , Includes orthopedic or knee surgery.,SOCIAL HISTORY: , He is currently single. He drinks alcohol ten to twelve drinks a week, but does not drink five days a week and then will binge drink. He smokes one and a half pack a day for 15 years, but he has recently stopped smoking for the past two weeks.,FAMILY HISTORY: , Obesity, heart disease, and diabetes. Family history is negative for hypertension and stroke.,CURRENT MEDICATIONS:, Include Diovan, Crestor, and Tricor.,MISCELLANEOUS/EATING HISTORY: ,He says a couple of friends of his have had heart attacks and have had died. He used to drink everyday, but stopped two years ago. He now only drinks on weekends. He is on his second week of Chantix, which is a medication to come off smoking completely. Eating, he eats bad food. He is single. He eats things like bacon, eggs, and cheese, cheeseburgers, fast food, eats four times a day, seven in the morning, at noon, 9 p.m., and 2 a.m. He currently weighs 344 pounds and 5’9″. His ideal body weight is 160 pounds. He is 184 pounds overweight. If he lost 70% of his excess body weight that would be 129 pounds and that would get him down to 215.,REVIEW OF SYSTEMS: , Negative for head, neck, heart, lungs, GI, GU, orthopedic, or skin. He also is positive for gout. He denies chest pain, heart attack, coronary artery disease, congestive heart failure, arrhythmia, atrial fibrillation, pacemaker, pulmonary embolism, or CVA. He denies venous insufficiency or thrombophlebitis. Denies shortness of breath, COPD, or emphysema. Denies thyroid problems, hip pain, osteoarthritis, rheumatoid arthritis, GERD, hiatal hernia, peptic ulcer disease, gallstones, infected gallbladder, pancreatitis, fatty liver, hepatitis, rectal bleeding, polyps, incontinence of stool, urinary stress incontinence, or cancer. He denies cellulitis, pseudotumor cerebri, meningitis, or encephalitis.,PHYSICAL EXAMINATION: ,He is alert and oriented x 3. Cranial nerves II-XII are intact. Neck is soft and supple. Lungs: He has positive wheezing bilaterally. Heart is regular rhythm and rate. His abdomen is soft. Extremities: He has 1+ pitting edema.,IMPRESSION/PLAN:, I have explained to him the risks and potential complications of laparoscopic gastric bypass in detail and these include bleeding, infection, deep venous thrombosis, pulmonary embolism, leakage from the gastrojejuno-anastomosis, jejunojejuno-anastomosis, and possible bowel obstruction among other potential complications. He understands. He wants to proceed with workup and evaluation for laparoscopic Roux-en-Y gastric bypass. He will need to get a letter of approval from Dr. XYZ. He will need to see a nutritionist and mental health worker. He will need an upper endoscopy by either Dr. XYZ. He will need to go to Dr. XYZ as he previously had a sleep study. We will need another sleep study. He will need H. pylori testing, thyroid function tests, LFTs, glycosylated hemoglobin, and fasting blood sugar. After this is performed, we will submit him for insurance approval.

Restatement:

Consult for laparoscopic gastric bypass.

Although the raw transcript document is quite long, only a few of the sentences actually appear to be related to the restatement “Consult for laparoscopic gastric bypass.” We highlighted two sentences within the document that you might intuitively think best match the restatement. The approach we deployed quantifies the similarities and reports the sentences in the document that best match the restatement. We did this by using a pretrained BERT language model trained specifically on clinical texts (published by Alsentzer et. al. 2019). The model itself is hosted by HuggingFace, a platform for sharing open-source natural language processing (NLP) projects. We used this model to calculate sentence-by-sentence similarities using the sentence-transform Python library.

It is important to note that in this example and in this solution, we are performing the sentence ranking without explicitly extracting and detecting the medical entities. However, many applications rely on explicitly extracting and analyzing diagnoses, medications, and other health information. For detecting medical entities such as medical conditions, medications, and other medical information in medical text, consider using Amazon Comprehend Medical, a HIPAA-eligible service built to extract medical information from unstructured medical text.

More information about this approach is available in our technical write-up.

Architecture diagram

In this section, we go over the architecture diagram for this solution at a very high level. For more details and to see the step-by-step framework, see our technical write-up.

In the model development and testing phase, we use Amazon SageMaker Studio. Studio is a powerful integrated development environment (IDE) for building, training, testing, and deploying ML models. Because we use a prebuilt model for this solution, we don’t need to use Studio’s full ability to train algorithms at scale. Instead, we use it for development and deployment purposes.

We created a Jupyter notebook that you can import into Studio. This notebook walks you through the entire development and deployment process. We start by writing the code for our model to a file. The model is then built using an NGINX/Flask framework, so that new data can be passed to it at inference time. Prior to deploying the model, we package it as a Docker container, build it using AWS CodeBuild, and push it to Amazon Elastic Container Registry (Amazon ECR). Then we deploy the model using Amazon Elastic Container Service (Amazon ECS).

The final result is a model that you can query using a simple API call. This is an important point: the ability to query models via an API capability is an essential component of designing scalable, easy-to-use interfaces. For more information, see Implementing Microservices on AWS.

After we deploy our model, we create a graphical user interface (using Streamlit) so that our model can be easily accessed through a webpage. Streamlit is an open-source library used to create front ends for ML applications. After we create our webpage, we deploy it in a similar way to how we deployed our model: we package it as a separate Docker container, build it using CodeBuild, push it to Amazon ECR, and deploy it using Amazon ECS.

By creating and deploying this webpage, we provide users with no programming experience the ability to use our model to test their own documents and restatements. The following screenshot shows what the webpage looks like.

After the user inputs their restatement and corresponding document, the top five results (the five sentences that best match the statement) are returned. If you deploy the entire solution using our original MTSamples example, the final result looks like the following screenshot.

The solution reports the following results:

  • The top five sentences within the document that best match the restatement.
  • The similarity distance between each sentence and the restatement. A lower distance means closer similarities between that sentence and the restatement sentence.

In this example, the best matching sentence is “He wants to proceed with workup and evaluation for laparoscopic Roux-en-Y gastric bypass” with a distance of .0672. Therefore, this approach has correctly identified a sentence within the document that matches the restatement.

Limitations

Like any algorithm, this approach has some limitations. For instance, this approach is not designed to handle cases where the restatement of the document is actually high-level metadata about the document not directly related to the text of the document itself. You can solve such use cases by using Amazon Comprehend custom models. For more information, see Comprehend Custom and Building a custom classifier using Amazon Comprehend.

Another limitation in our approach is that it doesn’t explicitly handle negation (words such as “not,” “no,” and “denies”), which may change the meaning of the text. AWS services such as Amazon Comprehend and Amazon Comprehend Medical use deep learning models to handle negation.

Conclusion

In this post, we walked through the high-level steps to deploy a pre-built NLP model to analyze medical texts. If you’re interested in deploying this yourself, see our step-by-step technical write-up.

References

For more information, see the following references:


About the Authors

Joshua Broyde is an AI/ML Specialist Solutions Architect on the Global Healthcare and Life Sciences team at Amazon Web Services. He works with customers in the healthcare and life sciences industry at all levels of the Machine Learning Lifecycle on a number of AI/ML fronts, including analyzing medical images and video, analyzing machine sensor data and performing natural language processing of medical and healthcare texts.

 

Claire Palmer is a Solutions Architect at Amazon Web Services. She is on the Global Account Development team, supporting healthcare and life sciences customers. Claire has a passion for driving innovation initiatives and developing solutions that are both secure and scalable. She is based out of Seattle, Washington and enjoys exploring the PNW in her free time.

Read More

Cognitive document processing for automated mortgage processing

This post was guest authored by AWS Advanced Consulting Partner Quantiphi.

The mortgage industry is highly complex and largely dependent on documents for the information required across different stages in their business value chain. Day-to-day operations for mortgage underwriting, property appraisal, and mortgage insurance underwriting are heavily dependent on the comprehension of different types of documents. The slow pace of document transfer between different business units of an organization slows down the overall approval process, leading to poor customer experience.

The mortgage loan approval process usually takes multiple weeks because a multitude of user-submitted documents are scrutinized at each stage to assess the underlying risk. Organizations need the right information at the right time to increase operational efficiency and better document management.

In the wake of COVID-19, the mortgage industry is reeling under pressure to undergo a digital transformation to provide a better customer experience. Large companies are cutting down capital and operational expenditure to sustain operations. The need for operational efficiency is higher than ever.

This post analyzes the role of machine learning (ML) solutions in document extraction in the mortgage industry to enhance business operations.

We highlight the key aspects of Quantiphi’s document processing solution built on AWS, and unveil how it helped a US-based mortgage insurance company address document management challenges through artificial intelligence (AI) and ML techniques.

Quantiphi is an AWS Partner Network (APN) Advanced Consulting Partner with AWS competencies in Machine Learning, Financial Services, Data & Analytics, and DevOps. Quantiphi also has multiple AWS Service Delivery designations, recognizing its expertise in leveraging specific AWS services.

ML-based document extraction for the mortgage industry

Lenders usually have to manually sieve through large volumes of loan packages containing structured and unstructured information to classify documents and identify key information. The identified information is further used for risk assessment. Most of this key information is usually contained in paragraphs, key-value pairs, and tables.

These lenders usually receive loan packages in bulk containing different types of documents such as W2, tax statements, 1008 forms, and so on. Currently, people have to first classify these documents manually and extract the relevant information. Therefore, mortgage firms are looking for meaningful ways of incorporating cognitive capabilities and solutions into their existing mortgage processing pipeline to automate the identification of key information and facilitate easy risk scoring in order to develop operational excellence and reduce manual efforts.

Quantiphi’s cognitive document processing solution combines state-of-the-art AI and ML services from AWS with Quantiphi’s custom document processing models to digitize a wide variety of mortgage documents. Quantiphi’s solution leverages services like Amazon Textract, Amazon SageMaker, Amazon Comprehend, Amazon Kendra, and Amazon Augmented AI (Amazon A2I) to help mortgage firms extract information from structured and unstructured documents, classify them into document types, and further address needs around risk assessment through ML.

Document classification and information extraction

Mortgage underwriting is done to assess the underlying risk for each application by analyzing the multitudes of user-submitted documents, such as W2 or I9 forms, tax returns, loan application (1003) forms, underwriting transmittal (1008) forms, demographic addendum, credit reports, bank account statements, and paycheck stubs. For example, the underwriting transmittal (1008) form contains the summary of the key information used during the risk assessment such as monthly income, qualifying rate, property details, and occupancy status. Paycheck stubs are another example of such documents, used to understand a borrower’s income in order to be sure that the borrower is able to repay the loan.

Similarly, property appraisal documents such as chain of title document and deed documents (assignment, trust, quitclaim) along with the property appraisal report are used to complement the property valuation process. Deed documents are processed to establish ownership and legal rights to a property. For example, if the lender sells a mortgage loan to another lender, they need to issue an assignment of deed of trust to give the new lender the same legal rights to the property.

Based on the inherent structure of the different types of mortgage documents, we have defined three broad segments to classify these documents:

  • Structured documents, which are standard documents with some variations over the years and across different states. All values, check boxes, and tables are usually contained in predefined areas of the document.
  • Semi-structured documents, which don’t follow a standardized template strictly but have a similar format.
  • Unstructured documents, which don’t follow any defined format.

Structured documents

Examples of structured documents include the loan application (1003) form, underwriting transmittal summary (1008) form, verification of employment (1005) form, and W2 form.

Consider underwriting the transmittal summary (1008) form. Quantiphi’s solution uses the standardized 1008 document as a reference for training, which is then used for extraction (see the following screenshot).

Key information that can be extracted from 1008 includes borrower and co-borrower names, property address, SSN, sales price, and appraised value.

Semi-structured documents

Semi-structured documents include pay stubs, bank statements, credit reports, and loan estimates. Here, Quantiphi’s solution uses a generic key-value pair and table detection model to extract the relevant features. Searching for certain common keywords results in a more efficient extraction.

The following screenshot shows data extraction from a pay stub.

Key information that can be extracted from includes paid period, deductions, net pay, 401K summary, and more.

Unstructured documents

Unstructured documents include deeds documents, appraisal reports, and more. Consider the assignment of deed of trust. Quantiphi’s Solution uses custom NLP techniques like entity recognition and syntax analysis to extract information (see the following screenshot).

Key information that can be extracted from the assignment of deed of trust includes the date of assignment, assignor, assignee, executor name, principal sum, and more.

Quantiphi’s cognitive document processing solution

Quantiphi’s cognitive document processing solution works across all types of structured and unstructured mortgage documents. Some key aspects of Quantiphi’s solution are as follows:

  • Capture – Powered by Amazon Textract and SageMaker. This feature uses deep-learning based OCR to identify and extract information such as key-value pairs, check boxes, tables, and signatures for further consumption by risk scoring models.
  • Categorize – Powered by SageMaker and Amazon Comprehend. This feature includes the automated classification of various types of mortgage documents like loan application (1003) forms, underwriting transmittal summary (1008) forms, W2 forms, pay stubs, bank statements, and credit reports.
  • Call out – Powered by Amazon Textract. This feature converts scanned applications and supporting documents into digital (searchable) PDFs and highlights the important information in the document along with the bookmarks to assist the loan underwriters with quick navigation through the document and to consume the relevant information for the further decision-making process.
  • Redaction – Powered by Amazon Textract and Amazon Comprehend. This feature enables identification and redaction of PII and PCI data like addresses, phone numbers, and names.
  • Interpret – Powered by Amazon Kendra and Amazon Elasticsearch Service (Amazon ES). This feature offers tools for easy consumption like a contextual and keyword-based search engine. Users can directly perform a search on a repository of processed documents to retrieve the relevant information along with the corresponding document link.
  • Human in the loop – Powered by Amazon A2I. This feature allows you to review and edit the extracted information based on the confidence score against the ground truth through an augmented AI workflow. This manual feedback is then used for continuous improvement of the extraction results through active learning.

The extracted information can be further fed into a risk assessment module to enable risk scoring of submissions in which low-risk applications are auto-approved and high-risk applications are marked for human review.

Quantiphi’s cognitive document processing solution is capable of achieving over 90% accuracy, provides substantial cost reductions, and facilitates better visibility of the mortgage processing workflow while assuring faster processing.

Let’s look at how Quantiphi built this solution by using a combination of AI and ML services provided by AWS.

Components used in the architecture ensure that the complete solution remains robust and scalable while providing high performance and reliability to process the incoming workload of documents in a cost-effective manner.

Solution architecture

The following diagram illustrates the architecture of Quantiphi’s solution.

The architecture consists of the following elements:

  1. The UI is hosted on Amazon Simple Storage Service (Amazon S3) and Amazon CloudFront is used for web distribution to ensure low latency.
  2. Through the UI, the user can upload multiple types of documents to Amazon S3 and select use cases like information extraction, document searchability, document classification, entity recognition, and insight generation.
  3. AWS Batch carries out preprocessing and stores these preprocessed images back into Amazon S3. The metadata information is captured in Amazon Aurora.
  4. AWS Lambda invokes Amazon Textract.
  5. Amazon Textract performs OCR to extract information. When the OCR job is complete, Amazon Textract triggers an Amazon Simple Notification Service (Amazon SNS) notification to add the completed job to an Amazon Simple Queue Service (Amazon SQS) queue.
  6. Lambda receives the Amazon Textract output and stores it in Amazon S3.
  7. An Auto Scaling Amazon Elastic Compute Cloud (Amazon EC2) instance converts these scanned images into a digital (searchable) PDF and writes the output to Amazon S3. Depending on the selected use cases, it writes the message into the respective four postprocessing SQS queues.
  8. If the uploaded PDFs are digital, AWS Batch skips the pipeline to write directly to the four postprocessing SQS queues.
  9. The solution uses a document classifier Docker container to classify pages and documents into categories.
  10. The enterprise search Docker enables the contextual search engine with the Q&A option on document content, content snippet generation, and document ranking.
  11. A document entity recognition and insights Docker is used for keyword and entity recognition and highlighting, masking of confidential data, chronological distribution of information, summarization, and topic modeling.
  12. The information extraction Docker has two functions:
    1. If the uploaded documents match any pre-trained documents, it extracts information from the documents and presents them to the user via the UI.
    2. If the documents don’t match any pre-trained model documents, it calls SageMaker via Amazon API Gateway and extracts key-value pairs, tables, check boxes, signatures, and stamps.

Customer use case: US-based mortgage insurance company

For this post, we present a use case in which the client is a leading US-based mortgage insurance company with a suite of mortgage, risk, real estate, and title services.

The client had millions of scanned pages of mortgage documents containing both handwritten and typed content, which were manually parsed to extract information and classify them accordingly. Processing of new mortgage loans was extremely time-consuming due to manual handling of over 400 different document types.

Quantiphi solution

Quantiphi developed an AI virtual assistant that takes user-uploaded documents and automatically classifies the documents and pages contained in them into different categories, such as bank statements, credit reports, tax returns, and property tax bills and statements.

To augment the consumption of information, the solution highlights key entities (with bounding boxes) present in them. The user can review and edit the extraction results via a custom reviewer UI tool, which is then used for accuracy benchmarking and re-training purposes.

Amazon QuickSight was used to build an interactive dashboard for presenting accuracy metrics and reconciliation.

The solution successfully digitized the processing of more than 50 million pages yearly, with an accuracy of over 97% in classification and 90% in the extraction of more than 40 different data points like borrower’s name, loan amount, and so on.

Quantiphi succeeded in expanding the customer’s profit margin by lowering its document processing costs. Their processing efficiency was enhanced through quick and accurate extraction and detection of data while eliminating manual efforts to greatly reduce the loan processing time.

Summary

Traditional methods of mortgage loan processing are manual in nature and highly time-consuming. Customers are often asked to provide a large number of documents that lenders have to manually go through for assessment.

Quantiphi’s cognitive document processing solution expedites the process by automating information extraction from the documents. Mortgage companies can use Quantiphi’s solution to increase their operational efficiency and significantly reduce their mortgage processing time.

 

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Arnav Gupta is AWS Practice Head at Quantiphi.

Bhaskar Kalita is FSI Head at Quantiphi.

Read More

Securing Amazon SageMaker Studio internet traffic using AWS Network Firewall

Amazon SageMaker Studio is a web-based fully integrated development environment (IDE) where you can perform end-to-end machine learning (ML) development to prepare data and build, train, and deploy models.

Like other AWS services, Studio supports a rich set of security-related features that allow you to build highly secure and compliant environments.

One of these fundamental security features allows you to launch Studio in your own Amazon Virtual Private Cloud (Amazon VPC). This allows you to control, monitor, and inspect network traffic within and outside your VPC using standard AWS networking and security capabilities. For more information, see Securing Amazon SageMaker Studio connectivity using a private VPC.

Customers in regulated industries, such as financial services, often don’t allow any internet access in ML environments. They often use only VPC endpoints for AWS services, and connect only to private source code repositories in which all libraries have been vetted both in terms of security and licensing. Customers may want to provide internet access but also have some controls such as domain name or URL filtering and allow access to only specific public repositories and websites, possibly packet inspection, or other network traffic-related security controls. For these cases, AWS Network Firewall and NAT gateway-based deployment may provide a suitable use case.

In this post, we show how you can use Network Firewall to build a secure and compliant environment by restricting and monitoring internet access, inspecting traffic, and using stateless and stateful firewall engine rules to control the network flow between Studio notebooks and the internet.

Depending on your security, compliance, and governance rules, you may not need to or cannot completely block internet access from Studio and your AI and ML workloads. You may have requirements beyond the scope of network security controls implemented by security groups and network access control lists (ACLs), such as application protocol protection, deep packet inspection, domain name filtering, and intrusion prevention system (IPS). Your network traffic controls may also require many more rules compared to what is currently supported in security groups and network ACLs. In these scenarios, you can use Network Firewall—a managed network firewall and IPS for your VPC.

Solution overview

When you deploy Studio in your VPC, you control how Studio accesses the internet with the parameter AppNetworkAccessType (via the Amazon SageMaker API) or by selecting your preference on the console when you create a Studio domain.

If you select Public internet Only (PublicInternetOnly), all the ingress and egress internet traffic from Amazon SageMaker notebooks flows through an AWS managed internet gateway attached to a VPC in your SageMaker account. The following diagram shows this network configuration.

Studio provides public internet egress through a platform-managed VPC for data scientists to download notebooks, packages, and datasets. Traffic to the attached Amazon Elastic File System (Amazon EFS) volume always goes through the customer VPC and never through the public internet egress.

To use your own control flow for the internet traffic, like a NAT or internet gateway, you must set the AppNetworkAccessType parameter to VpcOnly (or select VPC Only on the console). When you launch your app, this creates an elastic network interface in the specified subnets in your VPC. You can apply all available layers of security control—security groups, network ACLs, VPC endpoints, AWS PrivateLink, or Network Firewall endpoints—to the internal network and internet traffic to exercise fine-grained control of network access in Studio. The following diagram shows the VpcOnly network configuration.

In this mode, the direct internet access to or from notebooks is completely disabled, and all traffic is routed through an elastic network interface in your private VPC. This also includes traffic from Studio UI widgets and interfaces, such as Experiments, Autopilot, and Model Monitor, to their respective backend SageMaker APIs.

For more information about network access parameters when creating a domain, see CreateDomain.

The solution in this post uses the VpcOnly option and deploys the Studio domain into a VPC with three subnets:

  • SageMaker subnet – Hosts all Studio workloads. All ingress and egress network flow is controlled by a security group.
  • NAT subnet – Contains a NAT gateway. We use the NAT gateway to access the internet without exposing any private IP addresses from our private network.
  • Network Firewall subnet – Contains a Network Firewall endpoint. The route tables are configured so that all inbound and outbound external network traffic is routed via Network Firewall. You can configure stateful and stateless Network Firewall policies to inspect, monitor, and control the traffic.

The following diagram shows the overview of the solution architecture and the deployed components.

VPC resources

The solution deploys the following resources in your account:

  • A VPC with a specified Classless Inter-Domain Routing (CIDR) block
  • Three private subnets with specified CIDRs
  • Internet gateway, NAT gateway, Network Firewall, and a Network Firewall endpoint in the Network Firewall subnet
  • A Network Firewall policy and stateful domain list group with an allow domain list
  • Elastic IP allocated to the NAT gateway
  • Two security groups for SageMaker workloads and VPC endpoints, respectively
  • Four route tables with configured routes
  • An Amazon S3 VPC endpoint (type Gateway)
  • AWS service access VPC endpoints (type Interface) for various AWS services that need to be accessed from Studio

The solution also creates an AWS Identity and Access Management (IAM) execution role for SageMaker notebooks and Studio with preconfigured IAM policies.

Network routing for targets outside the VPC is configured in such a way that all ingress and egress internet traffic goes via the Network Firewall and NAT gateway. For details and reference network architectures with Network Firewall and NAT gateway, see Architecture with an internet gateway and a NAT gateway, Deployment models for AWS Network Firewall, and Enforce your AWS Network Firewall protections at scale with AWS Firewall Manager. The AWS re:Invent 2020 video Which inspection architecture is right for you? discusses which inspection architecture is right for your use case.

SageMaker resources

The solution creates a SageMaker domain and user profile.

The solution uses only one Availability Zone and is not highly available. A best practice is to use a Multi-AZ configuration for any production deployment. You can implement the highly available solution by duplicating the Single-AZ setup—subnets, NAT gateway, and Network Firewall endpoints—to additional Availability Zones.

You use Network Firewall and its policies to control entry and exit of the internet traffic in your VPC. You create an allow domain list rule to allow internet access to the specified network domains only and block traffic to any domain not on the allow list.

AWS CloudFormation resources

The source code and AWS CloudFormation template for solution deployment are provided in the GitHub repository. To deploy the solution on your account, you need:

Network Firewall is a Regional service; for more information on Region availability, see the AWS Region Table.

Your CloudFormation stack doesn’t have any required parameters. You may want to change the DomainName or *CIDR parameters to avoid naming conflicts with the existing resources and your VPC CIDR allocations. Otherwise, use the following default values:

  • ProjectName – sagemaker-studio-vpc-firewall
  • DomainName – sagemaker-anfw-domain
  • UserProfileName – anfw-user-profile
  • VPCCIDR – 10.2.0.0/16
  • FirewallSubnetCIDR – 10.2.1.0/24
  • NATGatewaySubnetCIDR – 10.2.2.0/24
  • SageMakerStudioSubnetCIDR – 10.2.3.0/24

Deploy the CloudFormation template

To start experimenting with the Network Firewall and stateful rules, you need first to deploy the provided CloudFormation template to your AWS account.

  1. Clone the GitHub repository:
git clone https://github.com/aws-samples/amazon-sagemaker-studio-vpc-networkfirewall.git
cd amazon-sagemaker-studio-vpc-networkfirewall 
  1. Create an S3 bucket in the Region where you deploy the solution:
aws s3 mb s3://<your s3 bucket name>

You can skip this step if you already have an S3 bucket.

  1. Deploy the CloudFormation stack:
make deploy CFN_ARTEFACT_S3_BUCKET=<your s3 bucket name>

The deployment procedure packages the CloudFormation template and copies it to the S3 bucket your provided. Then the CloudFormation template is deployed from the S3 bucket to your AWS account.

The stack deploys all the needed resources like VPC, network devices, route tables, security groups, S3 buckets, IAM policies and roles, and VPC endpoints, and also creates a new Studio domain and user profile.

When the deployment is complete, you can see the full list of stack output values by running the following command in terminal:

aws cloudformation describe-stacks 
    --stack-name sagemaker-studio-demo 
    --output table 
    --query "Stacks[0].Outputs[*].[OutputKey, OutputValue]"
  1. Launch Studio via the SageMaker console.

Experiment with Network Firewall

Now you can learn how to control the internet inbound and outbound access with Network Firewall. In this section, we discuss the initial setup, accessing resources not on the allow list, adding domains to the allow list, configuring logging, and additional firewall rules.

Initial setup

The solution deploys a Network Firewall policy with a stateful rule group with an allow domain list. This policy is attached to the Network Firewall. All inbound and outbound internet traffic is blocked now, except for the .kaggle.com domain, which is on the allow list.

Let’s try to access https://kaggle.com by opening a new notebook in Studio and attempting to download the front page from kaggle.com:

!wget https://kaggle.com

The following screenshot shows that the request succeeds because the domain is allowed by the firewall policy. Users can connect to this and only to this domain from any Studio notebook.

 

Access resources not on the allowed domain list

In the Studio notebook, try to clone any public GitHub repository, such as the following:

!git clone https://github.com/aws-samples/amazon-sagemaker-studio-vpc-networkfirewall.git

This operation times out after 5 minutes because any internet traffic except to and from the .kaggle.com domain isn’t allowed and is dropped by Network Firewall.

Add a domain to the allowed domain list

To be able to run the git clone command, you must allow internet traffic to the .github.com domain.

  1. On the Amazon VPC console, choose Firewall policies.
  2. Choose the policy network-firewall-policy-<ProjectName>.

  1. In the Stateful rule groups section, select the group rule domain-allow-sagemaker-<ProjectName>.

You can see the domain .kaggle.com on the allow list.

  1. Choose Add domain.

  1. Enter .github.com.
  2. Choose Save.

You now have two names on the allow domain list.

Firewall policy is propagated in real time to Network Firewall and your changes take effect immediately. Any inbound or outbound traffic from or to these domains is now allowed by the firewall and all other traffic is dropped.

To validate the new configuration, go to your Studio notebook and try to clone the same GitHub repository again:

!git clone https://github.com/aws-samples/amazon-sagemaker-studio-vpc-networkfirewall.git

The operation succeeds this time—Network Firewall allows access to the .github.com domain.

Network Firewall logging

In this section, you configure Network Firewall logging for your firewall’s stateful engine. Logging gives you detailed information about network traffic, including the time that the stateful engine received a packet, detailed information about the packet, and any stateful rule action taken against the packet. The logs are published to the log destination that you configured, where you can retrieve and view them.

  1. On the Amazon VPC console, choose Firewalls.
  2. Choose your firewall.

  1. Choose the Firewall details tab.

  1. In the Logging section, choose Edit.

  1. Configure your firewall logging by selecting what log types you want to capture and providing the log destination.

For this post, select Alert log type, set Log destination for alerts to CloudWatch Log group, and provide an existing or a new log group where the firewall logs are delivered.

  1. Choose Save.

To check your settings, go back to Studio and try to access pypi.org to install a Python package:

!pip install -U scikit-learn

This command fails with ReadTimeoutError because Network Firewall drops any traffic to any domain not on the allow list (which contains only two domains: .github.com and .kaggle.com).

On the Amazon CloudWatch console, navigate to the log group and browse through the recent log streams.

The pipy.org domain shows the blocked action. The log event also provides additional details such as various timestamps, protocol, port and IP details, event type, availability zone, and the firewall name.

You can continue experimenting with Network Firewall by adding .pypi.org and .pythonhosted.org domains to the allowed domain list.

Then validate your access to them via your Studio notebook.

Additional firewall rules

You can create any other stateless or stateful firewall rules and implement traffic filtering based on a standard stateful 5-tuple rule for network traffic inspection (protocol, source IP, source port, destination IP, destination port). Network Firewall also supports industry standard stateful Suricata compatible IPS rule groups. You can implement protocol-based rules to detect and block any non-standard or promiscuous usage or activity. For more information about creating and managing Network Firewall rule groups, see Rule groups in AWS Network Firewall.

Additional security controls with Network Firewall

In the previous section, we looked at one feature of the Network Firewall: filtering network traffic based on the domain name. In addition to stateless or stateful firewall rules, Network Firewall provides several tools and features for further security controls and monitoring:

Build secure ML environments

A robust security design normally includes multi-layer security controls for the system. For SageMaker environments and workloads, you can use the following AWS security services and concepts to secure, control, and monitor your environment:

  • VPC and private subnets to perform secure API calls to other AWS services and restrict internet access for downloading packages.
  • S3 bucket policies that restrict access to specific VPC endpoints.
  • Encryption of ML model artifacts and other system artifacts that are either in transit or at rest. Requests to the SageMaker API and console are made over a Secure Sockets Layer (SSL) connection.
  • Restricted IAM roles and policies for SageMaker runs and notebook access based on resource tags and project ID.
  • Restricted access to Amazon public services, such as Amazon Elastic Container Registry (Amazon ECR) to VPC endpoints only.

For a reference deployment architecture and ready-to-use deployable constructs for your environment, see Amazon SageMaker with Guardrails on AWS.

Conclusion

In this post, we showed how you can secure, log, and monitor internet ingress and egress traffic in Studio notebooks for your sensitive ML workloads using managed Network Firewall. You can use the provided CloudFormation templates to automate SageMaker deployment as part of your Infrastructure as Code (IaC) strategy.

For more information about other possibilities to secure your SageMaker deployments and ML workloads, see Building secure machine learning environments with Amazon SageMaker.


About the Author

Author

Yevgeniy Ilyin is a Solutions Architect at AWS. He has over 20 years of experience working at all levels of software development and solutions architecture and has used programming languages from COBOL and Assembler to .NET, Java, and Python. He develops and codes cloud native solutions with a focus on big data, analytics, and data engineering.

Read More

Perform medical transcription analysis in real-time with AWS AI services and Twilio Media Streams

Medical providers often need to analyze and dictate patient phone conversations, doctors’ notes, clinical trial reports, and patient health records. By automating transcription, providers can quickly and accurately provide patients with medical conditions, medication, dosage, strength, and frequency.

Generic artificial intelligence-based transcription models can be used to transcribe voice to text. However, medical voice data often uses complex medical terms and abbreviations. Transcribing such data needs medical/healthcare-specific machine learning (ML) models. To address this issue, AWS launched Amazon Transcribe Medical, an automatic speech recognition (ASR) service that makes it easy for you to add medical speech-to-text capabilities to your voice-enabled applications.

Additionally, Amazon Comprehend Medical is a HIPAA-eligible service that helps providers extract information from unstructured medical text accurately and quickly. To transcribe voice in real time, providers need access to raw audio from the call while in-progress. Twilio, an AWS partner, offers real-time telephone voice integration.

In this post, we show you how to integrate Twilio Media Streams with Amazon Transcribe Medical and Amazon Comprehend Medical to transcribe and analyze data from phone calls. For non-healthcare industries, you can use this same solution with Amazon Transcribe and Amazon Comprehend.

Twilio Media Streams works in the context of a traditional Twilio voice application, like an Interactive Voice Response (IVR), that serves customers directly, as well as a contact center, like Twilio Flex, where agents are serving consumers. You have discrete control over your voice data within your contact center to build the experience your customers prefer.

Amazon Transcribe Medical is an ML service that makes it easy to quickly create accurate transcriptions between patients and physicians. Amazon Comprehend Medical is a natural language processing (NLP) service that makes it easy to use ML to extract relevant medical information from unstructured text. You can quickly and accurately gather information (such as medical condition, medication, dosage, strength, and frequency), from a variety of sources (like doctors’ notes, clinical trial reports, and patient health records). Amazon Comprehend Medical can also link the detected information to medical ontologies such as ICD-10-CM or RxNorm so downstream healthcare applications can use it easily.

The following diagram illustrates how Amazon Comprehend Medical supports medical named entity and relationship extractions.

Amazon Transcribe Medical, Amazon Comprehend Medical, and Twilio Media Streams are all managed platforms. This means that data scientists and healthcare IT teams don’t need to build services from the ground up. Voice integration is provided by Twilio and  AWS ML services APIs, and only requires a simple plug-and-play with AWS and Twilio services to build the end-to-end workflow.

Solution overview

Our solution uses Twilio Media Streams to provide telephony service to the customer. This service provides a telephone number and backend to media services to integrate it with REST API-based web applications. In this solution, we build a Node.js web app and deploy it with AWS Amplify. Amplify helps front-end web and mobile developers build secure, scalable, full stack applications. The web app interfaces with Twilio Media Streams to receive phone calls in voice format, and uses Amazon Transcribe Medical to convert voice to text. Upon receiving the transcription, the application interfaces with Amazon Comprehend Medical to extract medical terms and insights from the transcription. The insights are displayed on the web app and stored in an Amazon DynamoDB table for further analysis. The solution also uses Amazon Simple Storage Service (Amazon S3) and an AWS Cloud9 environment.

The following diagram illustrates the solution architecture.

To implement the solution, we complete the following high-level steps:

  1. Create a trial Twilio account.
  2. Create an AWS Identity and Access Management (IAM) user.
  3. Create an AWS Cloud9 integrated development environment (IDE).
  4. Clone the GitHub repo.
  5. Create a secured HTTPS tunnel using ngrok and set up Twilio phone number’s voice configuration.
  6. Run the application.

Create a trial Twilio account

Before getting started, make sure to sign up for a trial Twilio account (https://www.twilio.com/try-twilio), if you don’t already have one.

Create an IAM user

To create an IAM user, complete the following steps:

  1. On the IAM console, under Access management, choose Users.
  2. Choose Add user.
  3. On the Set user details page, for User name¸ enter a name.
  4. For Access type, select Programmatic access.
  5. Choose Next: Permissions.

  1. On the Set permissions page, choose Attach existing policies directly.
  2. Select the following AWS Managed Policies, AmazonTranscribeFullAccess, ComprehendMedicalFullAccess, AmazonDyanmoDBFullAccess, and AmazonS3FullAccess.
  3. Choose Next: Tags.
  4. Skip adding tags and choose Next: Review.
  5. Review the IAM user details and attached policies and choose Create user.

  1. On the next page, copy the access key ID and secret access key to your clipboard or download the CSV file.

We use these credentials for testing the Node.js application.

Create an S3 Bucket

To create your Amazon S3 Bucket, complete the following steps.

  1. On the Amazon S3 console, choose Create bucket.
  2. For Bucket name, enter a name for the Amazon S3 bucket.
  3. For Block Public Access settings for this bucket check Block all public access.
  4. Review the settings and choose Create bucket.

Create an Amazon DynamoDB Table

To create your Amazon DynamoDB table, complete the following steps.

  1. On the Amazon DynamoDB console, choose Create table.
  2. For Table name, enter a name for the Amazon DynamoDB Table.
  3. For Primary key, enter ROWID for the primary key.

  1. Review the Amazon DynamoDB table settings and choose

Create an AWS Cloud9 environment

To create your AWS Cloud9 environment, complete the following steps.

  1. On the AWS Cloud9 console, choose Environments.
  2. Choose Create environment.
  3. For Name, enter a name for the environment.
  4. For Description, enter an optional description.
  5. Choose Next step.

  1. On the Configure Settings page, select Ubuntu Server 18.04 LTS for Platform and leave the other settings as default.

  1. Review the settings and choose Create environment.

The AWS Cloud9 IDE tab opens on your browser; you may have to wait a few minutes for the environment creation process to complete.

Clone the GitHub repo

In the AWS Cloud9 environment, close the Welcome and AWS Toolkit – QuickStart tabs. To clone the GitHub repository, on the bash terminal, enter the following code:

git clone https://github.com/aws-samples/amazon-transcribe-comprehend-medical-twilio

cd twilio-medical-transcribe && npm install --silent

Edit the config.json file under the project directory. Replace the values with your Amazon S3 Bucket and Amazon DynamoDB table.

Set up ngrok and the Twilio phone number

Before we start the Node.js application, we need to start a secured HTTPS tunnel using ngrok and set up the Twilio phone number’s voice configuration.

  1. On the terminal, choose the +
  2. Choose New Terminal.

  1. On the terminal, install ngrok:
    sudo snap install ngrok

  2. After ngrok is installed, run the following code to expose the local Express Node.js server to the internet:
    ngrok http 8080

  3. Copy the public HTTPS URL.

You use this URL for the Twilio phone number’s voice configuration.

  1. Sign in to your Twilio account.
  2. On the dashboard, choose the icon to open the Settings

  1. Choose Phone Numbers.

  1. On the Phone Numbers page, choose your Twilio phone number.

  1. In the Voice section, for A Call Comes In, choose Webhook.
  2. Enter the ngrok tunnel followed by /twiml.
  3. Save the configuration.

Run the application

Let’s now run the Twilio Media Streams, Amazon Transcribe Medical, and Amazon Comprehend Medical services by entering the following code:

npm start

We can preview the application in AWS Cloud9. In the environment, on the Preview menu, choose Preview Running Application.

You can copy the public URL to view the application in another browser tab.

Enter the IAM user access ID and secret key credentials, and your Twilio account SID, auth token, and phone number.

Demonstration

In this section, we use two sample recordings to demonstrate real-time audio transcription with Twilio Media Streams.

After you enter your IAM and Twilio credentials, choose Submit Credentials.

The following screenshot shows the transcription for our first audio file, sample-1.mp4.

The following screenshot shows the transcription for our second file, sample-3.mp4.

This application uses Amazon Transcribe Medical to transcribe media content in real time, and stores the output in Amazon S3 for further analysis. The application then uses Amazon Comprehend Medical to detect the following entities:

  • ANATOMY – Detects references to the parts of the body or body systems and the locations of those parts or systems
  • MEDICAL_CONDITION – Detects the signs, symptoms, and diagnosis of medical conditions
  • MEDICATION – Detects medication and dosage information for the patient
  • PROTECTED_HEALTH_INFORMATION – Detects the patient’s personal information
  • TEST_TREATMENT_PROCEDURE – Detects the procedures that are used to determine a medical condition
  • TIME_EXPRESSION – Detects entities related to time when they are associated with a detected entity

These entities are stored in the DynamoDB table. Healthcare providers can use this data to create patient diagnosis and treatment plan.

You can further analyze this data through services such as Amazon Elasticsearch Service (Amazon ES) and Amazon Kendra.

Clean up your resources

The AWS services used in this solution are part of the AWS Free Tier. If you’re not using the Free Tier, clean up the following resources to avoid incurring additional charges:

  • AWS Cloud9 environment
  • Amazon S3 Bucket
  • Amazon DynamoDB Table
  • IAM user

Conclusion

In this post, we showed how to integrate Twilio Media Streams with Amazon Transcribe Medical and Amazon Comprehend Medical to transcribe and analyze medical data from audio files. You can also use this solution in non-healthcare industries to transcribe information from audio.

We invite you to check out the code in the GitHub repo and try out the solution, and even expand on the data analysis with Amazon ES or Amazon Kendra.


About the Author

Mahendra Bairagi is a Principal Machine Learning Prototyping Architect at Amazon Web Services. He helps customers build machine learning solutions on AWS. He has extensive experience on ML, Robotics, IoT and Analytics services. Prior to joining Amazon Web Services, he had long tenure as entrepreneur, enterprise architect and software developer.

 

 

Jay Park is a Prototyping Solutions Architect for AWS. Jay is focused on helping AWS customers speed their adoption of cloud-native workloads through rapid prototyping

Read More