AWS DeepRacer device software now open source

AWS DeepRacer is the fastest way to get started with machine learning (ML). You can train reinforcement learning (RL) models by using a 1/18th scale autonomous vehicle in a cloud-based virtual simulator and compete for prizes and glory in the global AWS DeepRacer League. Today, we’re expanding AWS DeepRacer’s ability to provide fun, hands-on learning by open-sourcing the AWS DeepRacer device software.

Why open source

The AWS DeepRacer virtual and in-person leagues have been a hit, but now developers want to go beyond league racing with their car. Because the AWS DeepRacer is an Ubuntu-based computer on wheels powered by the Robot Operating System (ROS) we are able to open source the code, making it straightforward for a developer with basic Linux coding skills to prototype new and interesting uses for their car. Now that the AWS DeepRacer device software is openly available, anyone with the car and an idea can make new uses for their device a reality.

We’ve compiled 6 sample projects from the AWS DeepRacer team and members of the global AWS DeepRacer community to help you get started exploring the possibilities that open source provides. As developers share new projects using #deepracerproject, we will highlight our favorites on the AWS DeepRacer robotics projects page. Whether you’re mounting a Nerf cannon on the car with the DeepBlaster project, creating visualizations of your home or office with the Mapping project, or coming up with new ways of racing your friends and colleagues with the DeepDriver project, you can do all that and more with the open source code and sample projects. Documentation is available in GitHub and open for collaboration with thousands of community members in the AWS DeepRacer Slack channel. The only limit to what you can do with AWS DeepRacer is your imagination (and, well, the laws of physics).

Let the experiments begin

With the open-sourcing of the AWS DeepRacer device code, you can quickly and easily change the default behavior of your currently track-obsessed race car. Want to block other cars from overtaking it by deploying countermeasures? Want to deploy your own custom algorithm to make the car go faster from point A to B? You just need to dream it and code it. We can’t wait to see the ideas that you come up with, from new racing formats to new uses for AWS DeepRacer.

Starting today, you can choose from six projects (Follow the leader, Mapping, and Off Road created by AWS, and RoboCat, DeepBlaster, and DeepDriver created by the open source community) or create your own. You can get started with the Follow the Leader sample project, which trains the car to detect and follow an object. It’s the quickest project to build and run, and in the next section we’ll demonstrate how easy it is to modify the default the behavior of your AWS DeepRacer car. To complete this setup, upgrade to the latest software version and access the car via SSH.

Download the Follow the Leader project

Connect to the car using SSH, switch to the root user, and create a working directory. Then clone the Follow the Leader GitHub repository:

sudo su
mkdir -p ~/deepracer_ws
cd ~/deepracer_ws
git clone https://github.com/aws-deepracer/aws-deepracer-follow-the-leader-sample-project.git

The process to fully clone the project repository to your car can take a few minutes (depending on the speed of your internet connection). The Follow the Leader project contains several installation scripts that help shortcut the process to get you up and running faster. You can also complete the next few steps manually if you’re more comfortable with running shell-based commands or want to learn more about the process using the links to the relevant documentation for each stage.

Download and convert the object detection model

First, we need to download and convert the object detection model. To do this, we run the script that came in the Follow the Leader repository:

sudo su
cd ~/deepracer_ws/aws-deepracer-follow-the-leader-sample-project/installers
/usr/bin/bash install_object_detection_model.sh

The installer script downloads and optimizes the model before copying the optimized artifacts to the model location. This process takes approximately 3–4 minutes to complete.

You can complete this stage manually using the detailed instructions to download and convert the object detection model.

Initialize rosdep if it’s not initialized previously

Rosdep helps to install the dependency packages. Initialize the rosdep if it’s not done before on the device:

sudo rosdep init
sudo rosdep update

Build the Follow the Leader packages

Next, we fetch the package dependencies needed for the project and build them:

sudo su
cd ~/deepracer_ws/aws-deepracer-follow-the-leader-sample-project/installers
/usr/bin/bash build_and_install_ftl_application.sh

When successful, you should see a screen similar to the following:

The script downloads and installs the required package dependencies and builds the packages. This process can take approximately 8–10 minutes to complete.

You can also complete this stage manually by following the steps 1–10 in “Download and Building” in the Follow the Leader README.md. The install script does the same steps (just saves you some typing).

Launch the Follow the Leader application

Now we run the Follow the Leader application:

sudo su
cd ~/deepracer_ws/aws-deepracer-follow-the-leader-sample-project/installers
/usr/bin/bash run_ftl_application.sh

Enable Follow the Leader mode

Finally, we need to open another SSH session to the car to enable Follow the Leader mode using the command line interface (CLI):

sudo su
cd ~/deepracer_ws/aws-deepracer-follow-the-leader-sample-project/installers
/usr/bin/bash enable_ftl_mode.sh

Now you, or a willing volunteer (or object), can move around and watch the car begin to follow! How cool is that?

Share your results

Congratulations! You completed your first sample project. Share your experience with friends and family on social media with the tag #deepracerproject so we can see what you’re up to. As the community invents new projects for AWS DeepRacer, we’ll be adding them to the AWS DeepRacer GitHub Organization as well as featuring them in future blog posts so that everyone can get inspired. Purchase an AWS DeepRacer car today to start experimenting with your first AWS DeepRacer robotics project today! We are offering a 25% discount on the AWS DeepRacer ($100 off) and AWS DeepRacer Evo ($150 off) till May 27th, 2021.


About the Author

David Smith is a Sr. Solutions Architect for AWS DeepRacer. He is passionate about AWS DeepRacer, technology as an enabler and learning. Outside of work he’s into Formula 1, flying (and crashing) drones, 3d printing, running (Parkrun), tinkering with code and spending time with the family.

Read More

Conventions in Multi-Agent Collaboration

Humans are good at collaborating with each other — e.g., playing team sports — in part because we adapt to our teammates over multiple repeated interactions. Through these interactions, teammates build a shared understanding of the collaboration strategy, which we refer to as conventions. For example, when playing basketball, teams formulate conventions for signaling when to pass the ball, which offensive formation to take, which players on the opposing team to guard, and more. This ability to build conventions is critical to a team’s success.

The notion of conventions as applied to collaborative tasks has been well-studied, especially in the linguistics literature 12, where people have been shown to reduce their speech length when referring to the same objects with the same partners over repeated interactions. Even outside of linguistics, there are many cultural conventions (e.g., in the U.S., driving on the right side of an unmarked road) or personal conventions (e.g., personalized handshakes with friends) that we use.

It would be nice if we could apply the idea of conventions to human-AI collaboration, for example through assistive robotics. But before deploying robots and artificial agents into people’s homes (for cooking, cleaning, assembling furniture), we must be sure they can identify and learn such conventions in order to collaborate seamlessly with human partners.

In particular, collaboration in multi-agent tasks (e.g., team basketball) often involves two types of skills:

  • Task-specific: fundamental skills relevant to the task (e.g., dribbling/shooting basketball)
  • Partner-specific: shared strategy developed with the partner (e.g., when to pass the ball)

Task-specific skills are useful no matter who the partner is, such as learning the rules of the game. Partner-specific skills, on the other hand, refer to the shared strategy developed with the partner, i.e. conventions.

Breaking Symmetry

A challenge in collaborating with others is that of symmetry. When there is only one optimal strategy, players can unambiguously go for that strategy. However, when there exist many strategies that are optimal, players might go for different ones, resulting in a combined joint action that is suboptimal. Breaking symmetry, thus, is when players develop conventions as a mechanism to break ties between a set of equally optimal strategies.

For example, in the image above, Toronto Raptors point guard Kyle Lowry is giving a signal to coordinate an offensive play. To us, this signal could refer to almost anything, and there’s no way a priori to break symmetry between any of its possible meanings (e.g., does he want pick-and-roll, isolation, or something else?). Fortunately, his teammates can understand these signals based on conventions they’ve built through practice. As we can see, conventions are important in multi-agent collaboration since they solve the problem of breaking symmetry.

More concretely, consider a game of friendly Rock-Paper-Scissors, where the goal is for two friends to throw the same hand. The joint action space of the two friends is , and the joint actions are all optimal.

This problem seems trivial, but the two friends must make their actions independently without communicating, and without prior knowledge of the other person’s strategy. That is, their joint policy is factored: . Even though any of the 3 optimal joint actions are good, on their first attempt there is no way to know which of the 3 to pick! This is the symmetry breaking problem — there may be many optimal joint actions, but the players must still collectively decide on the same one.

Fortunately, by trial-and-error and building a history of repeated interactions, we can eventually converge on a convention (always pick Rock) with our partners and break symmetry with this shared strategy.

Generalizing to New Partners and New Tasks

So far, we’ve described task-specific skills as important for learning about the fundamentals of the task, and partner-specific skills (i.e. conventions) as important for breaking symmetry. Why make this distinction between the two types of skills? The point is that if we can learn separate representations for tasks and partners, then we can perhaps transfer our knowledge over to new partners and new tasks!

Let’s look at a block placing game that reveals the interplay between task-specific and partner-specific skills. There is a 2×2 grid with a target Goal configuration that a red player (Bob) and a blue player (Alice) have to construct together. Only the red player sees the Goal configuration. The players start with an empty grid, and take turns. On each turn, a player can choose to move/place a block of their own color.

Suppose that Alice and Bob have been playing this game many times, and through trial-and-error have converged on a signaling strategy where on turn 1 Bob always places the red block horizontally opposite the blue block location. Below we see a possible progression of their game (with 4 turns, from left to right). On turn 1 Bob places the red block at the top-right corner. On turn 2 Alice does nothing. On turn 3 Bob places the red block in the correct bottom-right corner. On turn 4, based on signaling conventions that they’ve established, Alice correctly deduces that the blue block should be at the top-left corner.

From Bob’s perspective, the task-specific skill is moving the red block to the bottom-left to match the Goal configuration, whereas the partner-specific skill is signaling the correct blue block location to Alice using his action on turn 1. With this example, we can see how task-specific skills (placing the red block correctly) can be transferred to new partners, and how partner-specific signals can be transferred to tasks with similar symmetries (e.g., if the rules change such that the red block must end up at one of the positions that is empty in the Goal configuration, Bob can still re-use the same conventions to signal to Alice the location of the blue block).

Building Convention-Aware Agents

Given the importance of building conventions, how can we build convention-aware artificial agents? In this work, we design an artificial agent to work well with new tasks and new partners based on the above intuition of separating task-specific and partner-specific representations. We consider two-player collaborative tasks with no external communication, where our agent plays as one of the players, and knows the identity of the task and the partner (e.g., a cooking robot might know if it is working in the same kitchen or with the same person that it has worked with before).

Modular Policy

We use a modular architecture that learns a task module for each task and a partner module for each partner. Given the task and the partner we are playing with, we use the corresponding modules to parameterize our policy. In our design, the task module first processes the input (the state observations of the task), and outputs a 1) latent representation and 2) an action distribution . Then the partner modules takes as input and predicts another action distribution , and the final policy is given by the multiple of the two actions distributions .

The intuition behind this sequential setup is that the task module action distribution assigns high probability to all the actions that are potentially good (roughly speaking, there exists a complementary partner action such that is good). If there is only one such action, then may be very sharp; if all actions are good then may be uniform. Then, the partner module action distribution outputs how to break the tie between the equally good actions, which we can interpret as the convention built with this partner.

Finally, to prevent the task module from being uninformative and pushing all the hard work to the partner module, we add a regularization term (Eq 1) so that the task module output distribution should match the marginal of the different partner module output distributions (that is, what we should do if we don’t know which partner we’re playing with).

(1)

When given a new partner, we train a new partner module with the same task module — this enables us to transfer over the marginal distribution of the good actions to make, and only worry about learning the tie-breaking preferences of the new partner.

When given a new task (with the same states/actions/dynamics, but different rewards), we train a new task module with the same partner module. This enables us to learn the complexities of the new task, while also recalling the preferences of the partner in terms of breaking ties between equally optimal actions.

Experiments

We ran experiments on multi-armed bandits, the block placing task described above, and a simplified 2-player version of Hanabi.

Multi-armed Bandit Block Placing Hanabi

We won’t go into details about the Multi-armed Bandit and the Hanabi tasks in this blogpost, but check out our paper for more details and results!

Here we show some plots from the block placing task for both transferring to new partners and new tasks. The max reward for the block placing task is 20. For transferring to new partners, we first train a single task module by playing with a pool of 6 partners, and then test with 6 new partners. Throughout, we use the same task module but use a different partner module for each partner. We compare with baselines BaselineAgg, which aggregates the gradients from all the training partners during training, and First-Order Model-Agnostic Meta-Learning (FOMAML). In contrast to these baselines, our modular setup allows us to reinitialize only the partner-specific representations while re-using the task-specific representations, and we see that this enables faster adaptation to new partners.

For transferring to a new task, we tweak the rule of the game such that the red player (Bob) must place the red block at one of the positions that is empty (white) in the Goal configuration. We train a task module for this tweaked task, and test if our modular architecture can directly generalize to the new task rules while remembering signalling conventions with old partners in a zero-shot manner. We compare with a baseline method that is similarly modular, but does not use a marginal regularization (see Equation 1 above) to push the task module to learn the right representations. Our results suggest that the marginal regularization term is important for transferring to new tasks.

Takeaways

We studied the role of task-specific skills and partner-specific skills (i.e., conventions) in multi-agent collaborative tasks. We explored the use of a modular architecture to train agents that can separate task-specific and partner-specific representations. With the modular setup, we are able to piece together new combinations of modules to adapt more quickly to novel combinations of tasks and partners!

For more details check out our ICLR 2021 paper “On the Critical Role of Conventions in Adaptive Human-AI Collaboration”.

Paper

Code

Acknowledgments

Thanks to Sidd Karamcheti and Jacob Schreiber for their helpful comments on this blogpost!

  1. Robert D Hawkins, Mike Frank, and Noah D Goodman. Convention-formation in iterated reference games. In Proceedings of the 39th Annual Meeting of the Cognitive Science Society, 2017. 

  2. Robert D. Hawkins, Michael Franke, Michael C. Frank, Kenny Smith, Thomas L. Griffiths, Noah D. Goodman. From partners to populations: A hierarchical Bayesian account of coordination and convention. 2021. 

Read More

Monitor and Manage Anomaly Detection Models on a fleet of Wind Turbines with Amazon SageMaker Edge Manager

In industrial IoT, running machine learning (ML) models on edge devices is necessary for many use cases, such as predictive maintenance, quality improvement, real-time monitoring, process optimization, and security. The energy industry, for instance, invests heavily in ML to automate power delivery, monitor consumption, optimize efficiency, and extend the lifetime of their equipment.

Wind energy is one of the most popular renewable energy sources. According to the Global Wind Energy Council, 22,893 wind turbines were installed globally in 2019, produced from 33 suppliers and accounting for over 63 GW of wind power capacity. With such scale, energy companies need an efficient platform to manage and maintain their wind turbine fleets, and the ML models running on the devices. A commercial wind turbine costs around $3–4 million. If a turbine is out of service, it costs $800–1,600 per day and results in a total loss of 7.5 megawatts, which is enough energy to power approximately 2,500 homes.

A wind turbine is a complex piece of engineering and consists of many sensors that can be used by a monitoring mechanism to capture data such as vibration, temperature, wind speed, and air humidity. You could train an ML model with this data, deploy it to an edge device connected to the turbine’s sensors, and predict anomalies in real time at the edge. It would reduce the operational cost of your fleet of turbines. But imagine the effort to maintain this solution on a fleet of thousands or millions of devices. How do you operate, secure, deploy, run, and monitor ML models on a fleet of devices at the edge?

Amazon SageMaker Edge Manager can help you to answer this question. The service allows you to optimize, secure, monitor, and maintain ML models on fleets of smart cameras, robots, personal computers, industrial equipment, mobile devices, and more. With Edge Manager, you can manage the lifecycle of each ML model on each device in your device fleets for up to thousands or millions of devices. The service provides a software agent that runs on edge devices and a management interface on the AWS Management Console.

In this post, we show how to use Edge Manager to create a robust end-to-end solution that manages the lifecycle of ML models deployed to a wind turbine fleet. But instead of using real wind turbines, you learn how to build your own fleet of mini 3D printed wind turbines. This is a DIY open-source, open-hardware project created to demonstrate how to build an ML at the edge solution with Amazon SageMaker. You can use to it as a platform to learn, experiment, and get inspired.

The next sections cover the following topics:

  • The specifications of the wind turbine farm
  • How to configure each Jetson Nano
  • How to build an anomaly detection model using SageMaker
  • How to run your own mini wind turbine farm

The wind turbine farm

The wind turbine farm created for this project has five mini 3D printed wind turbines connected to five distinct Jetson Nanos via USB. The Jetson Nanos are connected to the internet through Ethernet cables plugged to a cable modem. A fan, positioned in front of the farm, produces the wind to simulate an outdoor condition. The following image shows how the wind farm is organized.

The mini wind turbine

The mini wind turbine of this project is a mechanical device integrated with a microcontroller (Arduino) and some sensors. It was modeled using FreeCAD, an open-source tool for designing industrial parts. These parts were then 3D printed using PETG (plastic filament type) and assembled with the electronics components. Its base is static, which means that the turbine doesn’t align with the wind direction by itself. This restriction was important to simplify the project.

Each turbine has one voltage generator (small motor) and seven different sensors:

  • Vibration (MPU6050: 6 axis accelerometer/gyroscope)
  • Infrared rotation encoder (rotations per second)
  • Gearbox temperature (MPU6050)
  • Ambient temperature (BME680)
  • Atmospheric pressure (BME680)
  • Air humidity (BME680)
  • Air quality (BME680)

An Arduini Mini Pro is responsible for interfacing with these sensors and collecting data from them. This data is streamed through the serial pins (TX, RX). An FTDI device that converts this serial signal to USB is the bridge between the Arduino and the Jetson Nano. A Python application that runs on Jetson Nano receives the raw data from the sensors through this bridge.

A micro servo was modified and transformed into a voltage generator. Its internal gearbox increases the generator (motor) speed by five times to produce a (low) voltage between 0–3.3v. This generator is also connected to the Arduino through an analog input pin. This information is also sent with the sensor’s readings.

The frequency at which the data is collected depends on the sensor. All the signals from BME650 are collected each 150 milliseconds, the rotation encoder each 1 second, and the voltage generator and the vibration sensor each 50 milliseconds.

If you want to know more about these technical details and learn how to build your own mini wind turbine, see the GitHub repository.

The edge device

Each Jetson Nano has a built-in GPU with 128-core NVIDIA Maxwell™ and a Quad-core ARM® A57 CPU running at 1.43 GHz. This hardware is enough to run a Python application that collects and formats the data from the sensors of the turbine and then calls the Edge Manager agent API to get the predictions. This application compares the prediction with a threshold to check for anomalies in the data. The model is invoked in real time.

When SageMaker Neo compiles the ML model for Jetson Nano, a runtime (DLR) optimized for this target device is included in the deployment package. This runtime detects automatically that it’s running on a Jetson Nano and loads the model directly into the device’s GPU for maximum performance.

The Edge Manager agent is also distributed as a Linux (arm64) application that can be run as a background process (daemon) on your Jetson Nano. It uses the runtime SageMaker Neo includes in the compilation package to interface with the optimized model and expose it as a well-defined API. This API is integrated with the local application through a low latency protocol (grpc + unix socket).

The cloud services

Now that you know some details about the physical hardware used to develop the wind turbine farm, it’s time to see which AWS services support the solution on the cloud side. A minimal, standalone setup to get a model deployed and running on the Edge Manager agent requires only SageMaker and nothing more. However, other services were used in this project with two important features: a mechanism for over-the-air (OTA) deployment and a dashboard for monitoring the anomalies in near-real time.

In summary, the components required for this project are:

  • A device fleet (Edge Manager), which organizes and controls one or more registered devices through the agent (running on each device)
  • One IoT thing per device and IoT thing group, which is used by the OTA mechanism to communicate with the devices via MQTT
  • AWS IoT rules, and an AWS Lambda function to get and filter application logs and ingest them into Amazon Elasticsearch Service (Amazon ES)
  • A Lambda function to parse the model metrics captured by agent in ingest them into Amazon ES
  • An Elasticsearch server with Kibana, which has dashboards for monitoring the anomalies (optional)
  • SageMaker to build, compile, and package the ML model

The following diagram illustrates this architecture.

Putting everything together

Now that we have all the components of our wind turbine farm, it’s time to understand the steps we need to take to integrate all these moving parts, deploy a model to our edge devices, and keep an application running and predicting anomalies in real time.

The following diagram shows all the steps involved in the process.

The solution consists of the following steps:

  1. The data scientist explores the dataset and designs an anomaly detection model (autoencoder) with PyTorch, using SageMaker Studio.
  2. The model is trained with a SageMaker training job.
  3. With Neo, the model is optimized (compiled) to Jetson Nano.
  4. Edge Manager creates a deployment package with the compiled model.
  5. The data scientist creates an IoT job that sends a notification of the new model available to the edge devices.
  6. The application running on Jetson Nano performs the following:
    1. Receives this notification and downloads the model package from the Amazon Simple Storage Service (Amazon S3) bucket.
    2. Unpacks the model and loads it using the Edge Manager agent API (LoadModel).
    3. Reads the sensors from the wind turbine, prepares the data, invokes the ML model, and captures some model metrics using the Edge Manager agent API.
    4. Compares the prediction with a baseline to detect potential anomalies.
    5. Sends the raw sensor data to an AWS IoT topic.
  7. Through a rule, AWS IoT reads the app logs topic and exports the data to Amazon ES.
  8. A Lambda function captures the model metrics (mean average error) exported by the agent and ingests the data into Amazon ES.
  9. The operator uses a Kibana dashboard to check for any anomalies.

Configure your edge device

The Edge Manager agent uses certificates provided by AWS IoT Core to authenticate and call other AWS services. That way you need to create an IoT thing first and then an edge device fleet. But first, you need to prepare some basic resources to support your solution.

Create prerequisite resources

Before getting started, you need to configure AWS Command Line Interface in your workstation first (if necessary) and then to create the following resources:

  • An S3 bucket to store the captured data
  • An AWS Identity and Access Management (IAM) role for your devices
  • An IoT thing to map to your Edge Manager device
  • An IoT policy to control the permissions of the temporary credentials of the edge device
  1. Create a new bucket for the solution.

Each time you call CaptureData in the agent API, it uploads the tensors (input and predictions) into this bucket.

Next, you create your IAM role.

  1. On the IAM console, create a role named WindTurbineFarm so the devices can access resources in your account.
  2. Add permissions to this role to upload files to the S3 bucket you created.
  3. Add the following trusted entities to the role:
    1. amazonaws.com
    2. iot.amazonaws.com
    3. amazonaws.com

Use the following code (provide the name for the S3 bucket, your AWS account, and Region):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::<<S3_BUCKET_NAME>>",
                "arn:aws:s3:::<<S3_BUCKET_NAME>>/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "iot:CreateRoleAlias",
                "iot:DescribeRoleAlias",
                "iot:UpdateRoleAlias",
                "iot:TagResource",
                "iot:ListTagsForResource"
            ],
            "Resource": [
                "arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:rolealias/SageMakerEdge*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "iam:GetRole",
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/*SageMaker*",
                "arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/*Sagemaker*",
                "arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/*sagemaker*",
                "arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/WindTurbineFarm"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "sagemaker:GetDeviceRegistration",
                "sagemaker:SendHeartbeat",
		  "iot:DescribeEndpoint",
		  "s3:ListAllMyBuckets”
            ],
            "Resource": "*",
            "Effect": "Allow"
        }, 
        {
            "Action": [
                "sagemaker:DescribeDevice"
            ],
            "Resource": [
                "arn:aws:sagemaker:<<REGION>>:<<AWS_ACCOUNT_ID>>:device-fleet/windturbinefarm*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "iot:Publish",
                "iot:Receive"
            ],
            "Resource": [
                "arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topic/wind-turbine/*"
            ],
            "Effect": "Allow"
        }
    ]
}

You’re now ready to create your IoT thing, which you later map to your Edge Manager device.

  1. On the AWS IoT Core console, under Manage, choose Things
  2. Choose Create.
  3. Name your device (for this post, edge-device-0).
  4. Create a new group or choose an existing group (for this post, WindTurbineFarm).
  5. Create a certificate.
  6. Download the certificates, including the root CA.
  7. Activate the certificate.

You now create your policy, which controls the permissions of the temporary credentials of the edge device.

  1. On the AWS IoT Core console, under Secure, choose Policies.
  2. Choose Create.
  3. Name the policy (for this post, WindTurbine).
  4. Choose Advanced Mode.
  5. Enter the following policy, providing your AWS account and Region:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iot:Connect"
      ],
      "Resource": "arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:client/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "iot:Publish",
        "iot:Receive"
      ],
      "Resource": [
	"arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topic/wind-turbine/*",
	"arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topic/$aws/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "iot:Subscribe"
      ],
      "Resource": [
	"arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topicfilter/wind-turbine/*",
	"arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topicfilter/$aws/*",
          "arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topic/$aws/*"
    ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "iot:UpdateThingShadow"
      ],
      "Resource": [	
	"arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:topicfilter/wind-turbine/*",
          "arn:aws:iot:<<REGION>>:<<AWS_ACCOUNT_ID>>:thing/edge-device-*"

      ]
    },
    {
      "Effect": "Allow",
      "Action": "iot:AssumeRoleWithCertificate",
      "Resource": "arn:aws:iot: <<REGION>>:<<AWS_ACCOUNT_ID>>:rolealias/SageMakerEdge-WindTurbineFarm"
    }
  ]
}
  1. Choose Create.

Lastly, you attach the policy to the certificate.

  1. On the AWS IoT Core console, under Secure, choose Certificates.
  2. Select the certificate you created.
  3. On the Actions menu, choose Attach policy.
  4. Select the policy WindTurbine.
  5. Choose Attach.

Now your IoT thing is ready to be linked to an edge device. Repeat these steps (except for creating the policy) for each additional device in your device fleet. For a production environment with hundreds or thousands of devices, you just apply a different approach, using automated scripts and parameter files to provision all the IoT things.

Create the edge fleet

To create your edge fleet, complete the following steps:

  1. On the SageMaker console, under Edge Inference, choose Edge device fleets.
  2. Choose Create device fleet.

  1. Enter a name for the device (for this post, WindTurbineFarm).
  2. Enter the ARN of the IAM role you used in the previous steps (arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/WindTurbineFarm).
  3. Enter the output S3 bucket URI (s3://<<NAME_OF_YOUR_BUCKET>>/wind_turbine_data/).
  4. Choose Submit.

Now you need to add a new device to the fleet.

  1. On the SageMaker console, under Edge Inference, choose Edge devices.
  2. Choose Register devices.

  1. For Device Properties, enter the name of the device fleet you created (WindTurbineFarm).
  2. Choose Next.
  3. For Device name, enter any unique name for your device (for this post, we use the same name as our IoT thing, edge-device-wind-turbine-00000000000).
  4. For IoT name, enter the name of the thing you created earlier (edge-device-0).
  5. Choose Submit.

Repeat the registering process for all your other devices. Now you can SSH to your Jetson Nano and complete the configuration of your device.

Prepare the edge device

Before you start configuring your Jetson Nano, you need to install JetPack 4.4.1 in your Nano. This is the version you use to build, run, and test this demo.

The model preparation process for your target device is very sensitive in relation to the versions of the libraries installed in your device. For instance, because the target device is a Jetson Nano, Neo optimizes the model and runtime to a given version of the TensorRT and CUDA. The runtime (libdlr.so) is physically linked to the versions you specify in the compilation job. This means that if you compile your model using Neo for JetPack 4.4.1, it doesn’t work with JetPack 3.x. and vice versa.

  1. With JetPack 4.4.1 running on your Jetson Nano, you can start configuring your device with the following commands:
echo "export TVM_TENSORRT_MAX_WORKSPACE_SIZE=2147483647" >> ~/.bashrc
echo "export SM_EDGE_AGENT_HOME=/home/${USER}/agent" >> ~/.bashrc

# Also export the variables for the current session
export TVM_TENSORRT_MAX_WORKSPACE_SIZE=2147483647
export SM_EDGE_AGENT_HOME=/home/${USER}/agent


sudo apt install -y protobuf-compiler python3-serial 
sudo apt install -y python3-pip python3-joblib python3-boto3 libssl-dev
sudo apt install -y curl
sudo pip3 install grpcio-tools grpcio PyWavelets paho-mqtt
  1. Download the Linux ARMv8 version of the Edge Manager agent.
  2. Copy the package to your Jetson Nano (scp). Create a folder for the agent and unpack the package in your home directory:
mkdir -p ~/agent/certificates/iot
mkdir -p ~/agent/certificates/root
tar -xzvf <<agent_package>>.tgz -C ~/agent
  1. Copy the AWS IoT Core certificates you provisioned for your thing in the previous section to the directory ~/agent/certificates/iot in your Jetson Nano.

You should see the following files in this directory:

  • pem – CA root
  • <<CERT_PREFIX>>-public.pem.key – Public key
  • <<CERT_PREFIX>>-private.pem.key – Private key
  • <<CERT_PREFIX>>-certificate.pem.crt – Certificate
  1. Get the root certificate used to sign the deployment package created by Edge Manager. The agent uses this to validate the model.
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-linux-armv8/Certificates/<<AWS_REGION>>/<<AWS_REGION>>.pem .
  1. Copy this certificate to the directory ~/agent/certificates/root in your Jetson Nano.

Next, you create the Edge Manager agent configuration file.

  1. Open an empty file named ~/agent/sagemaker_edge_config.json and enter the following code:
{
    "sagemaker_edge_core_device_uuid": "<<SAGEMAKER_EDGE_DEVICE_NAME>>",
    "sagemaker_edge_core_device_fleet_name": "WindTurbineFarm",
    "sagemaker_edge_core_capture_data_buffer_size": 30,
    "sagemaker_edge_core_capture_data_batch_size": 10,
    "sagemaker_edge_core_capture_data_push_period_seconds": 4,
    "sagemaker_edge_core_folder_prefix": "wind_turbine_data",
    "sagemaker_edge_core_region": "<<AWS_REGION>>",
    "sagemaker_edge_core_root_certs_path": "/home/<<LINUX_USER>>/agent/certificates/root",
    "sagemaker_edge_provider_aws_ca_cert_file": "/home/<<LINUX_USER>>/agent/certificates/iot/AmazonRootCA1.pem",
    "sagemaker_edge_provider_aws_cert_file": "/home/<<LINUX_USER>>/agent/certificates/iot/<<CERT_PREFIX>>-certificate.pem.crt",
    "sagemaker_edge_provider_aws_cert_pk_file": "/home/<<LINUX_USER>>/agent/certificates/iot/<<CERT_PREFIX>>-private.pem.key",
    "sagemaker_edge_provider_aws_iot_cred_endpoint": "https://<<CREDENTIALS_ENDPOINT_HOST>>/role-aliases/SageMakerEdge-WindTurbineFarm/credentials",
    "sagemaker_edge_provider_provider": "Aws",
    "sagemaker_edge_provider_s3_bucket_name": "<<S3_BUCKET>>",
    "sagemaker_edge_core_capture_data_destination": "Cloud"
}

Provide the information for the following resources:

  • SAGEMAKER_EDGE_DEVICE_NAME – The unique name of your device you defined previously.
  • AWS_REGION – The Region where you created your edge device.
  • LINUX_USER – The Linux user name you’re using in Jetson Nano.
  • CERT_PREFIX – The prefix of the certificate files you created when you provisioned your IoT thing in the previous section.
  • CREDENTIALS_ENDPOINT_HOST – Your endpoint host. You can get this endpoint through the AWS Command Line Interface (AWS CLI). (Install the AWS CLI if you don’t have it already). Use credentials of the same account and the same Region you used in the previous sections (this isn’t the IoT thing shadow URL). Then run the following command to retrieve the endpoint host:
aws iot describe-endpoint --endpoint-type iot:CredentialProvider
  • S3_BUCKET – The name of the S3 bucket you used to configure your edge device fleet in the previous section.
  1. Save the file with all these modifications.

Now you’re ready to run the Edge Manager agent in your Jetson Nano.

  1. To test the agent, run the following commands:
cd ~/agent
rm -f /tmp/edge_agent
./bin/sagemaker_edge_agent_binary -c sagemaker_edge_config.json -a /tmp/edge_agent &

The following screenshot shows your output.

The agent is now running. After a few minutes, you can see the heartbeat of the device, reported on the console. To see it on the SageMaker console, under Edge Inference, choose Edge Devices and choose your device.

Configure the application

Now it’s time to set up the application that runs on the edge device. This application is responsible for the following:

  • Get the temporary credentials using the certificate
  • Listen to the OTA update topics to see whether a new model package is ready to deploy
  • Deploy the available model package to the edge device
  • Load the model to the agent if necessary
  • Perform an infinite loop:
    • Read the sensor data
    • Format the input data
    • Invoke the ML model and capture some metrics of the prediction
    • Compare the predictions MAE (mean average error) to the baseline
    • Publish raw data to an IoT topic (MQTT)

To install the application, first get the custom AWS IoT endpoint. On the AWS IoT Core console, choose Settings. Copy the endpoint and use it in the following code:

cd ~/
git clone https://github.com/aws-samples/amazon-sagemaker-edge-manager-demo wind_turbine
cd wind_turbine/04_EdgeApplication
## by the AWS IoT Endpoint host you just copied and save the file
chmod +x run.py
./run.py &

The application outputs something like the following screenshot.

Optional: run this application with the parameter –test-mode if you just want to run a test with no wind turbine connected to the edge device.

If everything went fine, the application keeps waiting for a new model. It’s time to train a new model and deploy it to the Jetson Nano.

Train and deploy the ML model

This post demonstrates how to detect anomalies in the components of a wind turbine. There are many ways of doing this with the data collected by its sensors. To keep this example as simple as possible, you prepare a model that analyzes vibration, wind speed, rotation (per second), and the produced voltage to determine whether an anomaly exists or not. For that purpose, we train an autoencoder using PyTorch on SageMaker and prepare it for deployment on your Jetson Nano.

This model architecture has two advantages: it’s unsupervised, so we don’t need to label our data, and you can collect data from wind turbines that are working perfectly. Therefore, your model is trained to detect what you consider normal behavior of your wind turbines. When a defect appears in any part of the turbine, a drift occurs on the sensors data, which the model interprets as abnormal behavior (an anomaly).

The following screenshot is a sample of the raw data captured by the turbine sensors.

The data has the following features:

  • nanoId – ID of the edge device that collected the data
  • turbineId – ID of the turbine that produced this data
  • arduino_timestamp – Timestamp of the Arduino that was operating this turbine
  • nanoFreemem: Amount of free memory in bytes
  • eventTime – Timestamp of the row
  • rps – Rotation of the rotor in rotations per second
  • voltage – Voltage produced by the generator in milivolts
  • qw, qx, qy, qz – Quaternion angular acceleration
  • gx, gy, gz – Gravity acceleration
  • ax, ay, az – Linear acceleration
  • gearboxtemp – Internal temperature
  • ambtemp – External temperature
  • humidity – Air humidity
  • pressure – Air pressure
  • gas – Air quality
  • wind_speed_rps – Wind speed in rotations per second

The selected features based on our goals are: qx,qx,qy,qz (angular acceleration), wind_speed_rps, rps, and voltage. The following image is a sample of the feature qx. The data produced by the accelerometer is too noisy so we need to clean it first.

The angular velocity (quaternion) is first converted to Euler Angles (roll, pitch, yaw). Then we denoise all the features with Wavelets (PyWavelets), and normalize them. The following screenshot shows the signals after these transformations.

Finally, we apply a sliding window to this resulting dataset (six features) to capture the temporal relationship between neighbor readings and create the input tensor of our ML model. The average interval between two sequential samples is approximately 50 milliseconds. Each time window (of our sliding window) is then converted into a tensor, using the following structure:

  • Tensor – 6 features x 10 steps (100 samples) = 6×100
    • Step – Group of time steps
    • Time step – Group of intervals (time_step=20 = ~5 seconds)
    • Interval – Group of samples (interval=5 = ~250 milliseconds)
  • Reshaped tensor – 6x10x10

Interval, time step and step are hyperparameters that you can adjust during training. The final result is a stream of data, encoded as a multidimensional tensor (representing a few seconds in the past). The trained autoencoder tries to recreate the input tensor as the output (prediction). By measuring the MAE between the input and output and comparing it with a pre-defined threshold, you can identify potential anomalies.

One important aspect of this approach is that it extracts the linear and non-linear correlations between the features, to better understand the impacts of one feature into another, such as wind speed on the rotation or produced voltage.

Now it’s time to run this experiment.

  1. First, you need to set up your Studio environment if you don’t have one yet.
  2. Clone the GitHub repo https://github.com/aws-samples/amazon-sagemaker-edge-manager-demo inside a Studio terminal.

The repository contains a folder named 03_Notebooks with two Jupyter notebooks.

  1. Follow the instructions in the first notebook to prepare the dataset – Because the accelerator data is a signal, it contains noise, so you run a denoise mechanism to clean the data.

The final dataset has only six features: roll, pitch, yaw (converted from a Quaternion to Euler angles), wind_speed_rps, rps (rotations per second), voltage (produced by the generator).

  1. Follow the instructions in the second notebook to train, package, and deploy the model:
    1. Use SageMaker to train your PyTorch autoencoder (CNN based).
    2. Run a batch prediction to compute MAE and threshold used by the app to determine whether the prediction is an anomaly or not.
    3. Compile the model to Jetson Nano using Neo.
    4. Create a deployment package with Edge Manager.
    5. Create an IoT job that publishes a JSON document to a topic listened to by the application that is running on your Jetson Nano.

The application gets the package, unpacks it, loads the model in the Edge Manager agent, and unblocks the application run.

Both notebooks are very detailed, so follow the steps carefully, after which you’ll have an anomaly detection model to deploy in your Jetson Nano.

Compilation job and model optimization

One of the most important steps of the whole process is the model optimization step in the second notebook. When you compile a model with SageMaker Neo, it not only optimizes the model to improve the prediction performance in the target device, it also converts the original model into an intermediate representation. After this conversion, you don’t need to use the original framework anymore (PyTorch, TensorFlow, MXNet). This representation is then interpreted by a light runtime (DLR), which is packaged with the model by Neo. Both the runtime and optimized model are libraries, compiled as native programs for a specific operational system and architecture. In the case of Jetson Nano, the OS is a Linux distro and the architecture: ARM8 64bits. The runtime in this case uses TensorRT for maximum performance on the Jetson’s GPU.

When you launch a compilation job on Neo, you need to specify some parameters related to the setup of your target device, for instance:

  • trt-ver – 7.1.3
  • cuda-ver – 10.2
  • gpu-code – sm_53

The Jetson Nano’s GPU is a NVIDIA Maxwell, architecture version 53, so the parameter gpu-code is the same for all compilation jobs. However, trt-ver and cuda-ver depend of the version of the TensorRT and CUDA installed on your Nano. When you were preparing your edge device, you set up your Jetson Nano with JetPack 4.4.1. This makes sure that the model you optimize using Neo is compatible with your Jetson Nano.

Visualize the results

The dashboard setup is out of scope for this post. For more information, see Analyze device-generated data with AWS IoT and Amazon Elasticsearch Service.

Now that you have your model deployed and running on your Jetson Nano, it’s time to look at the behavior of your wind turbines through a dashboard. The application you deployed to the Jetson Nano collects some logs and sends them to two different places:

  • The IoT MQTT topic wind-turbine/logs/<<iot_thing_name>> contains the app logs and raw data collected from the wind turbine sensors
  • The S3 bucket s3://<<S3_BUCKET>>/wind_turbine_data contains the metrics of the ML model

You can get this data and ingest it into Amazon ES or another database. Then you can use your preferred reporting to prepare dashboards.

The following visualization shows three different but correlated things for each one of the five turbines: the rotation speed (in RPS), the produced voltage, and the detected anomalies for voltage, rotation, and vibration.

Some noise was injected in the raw data from the turbines to simulate failures.

The following visualization shows an aggregation of the turbines’ speed and produced voltage anomalies over time.

Conclusion

Securely and reliably maintaining the lifecycle of an ML model deployed across a fleet of devices isn’t an easy task. However, with Edge Manager, you can reduce the implementation effort and operational cost of such a solution. Also, with a demo like the mini wind turbine farm, you can experiment, optimize, and automate your ML pipeline with the services and expertise provided by AWS.

To build a solution for your own needs, get the code and artifacts used in this project from the GitHub repo. If you want more practice using Edge Manager, check out the end-to-end workshop for Edge Manager on Studio.


About the Author

Samir Araújo is an AI/ML Solutions Architect at AWS. He helps customers creating AI/ML solutions which solve their business challenges using AWS. He has been working on several AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. He likes playing with hardware and automation projects in his free time, and he has a particular interest for robotics.

Read More

Build a medical sentence matching application using BERT and Amazon SageMaker

Determining the relevance of a sentence when compared to a specific document is essential for many different types of applications across various industries. In this post, we focus on a use case within the healthcare field to help determine the accuracy of information regarding patient health.

Frequently, during each patient visit, a new document is created with the information from the visit. This information often consists of a medical transcription that has been dictated by either the nurse or the physician. Such a document may contain a brief description statement (also known as a restatement) that explains the main details from that specific patient visit. In future visits, doctors may rely on previous visits’ restatements to quickly get an overview of the patient’s overall status. Such restatements may also be used during patient handoffs. However, this introduces the potential for errors to be made during patient handoffs to new medical teams if the restatements are difficult to understand or if they contain inadequate information (Staggers et. al. 2011). Therefore, having an accurate description of the patient’s status is important, because the cost of errors in such restatements can be high and may negatively affect the patient’s overall care (Garcia et. al. 2017).

This post walks you through how to deploy a machine learning (ML) model that aims to determine the top sentences from the document that best match the corresponding document restatement; this can be a first step to ensure the accuracy of the patient’s health records overall by determining the relevance of the restatement. We emphasize that this model determines the top ranking sentences that match the restatement; it does not generate the restatement itself.

When creating this solution, we were faced with a dual-sided challenge. Beyond the technical challenge of actually creating an AI/ML model, several surrounding components complicate actually using such models in the real world. Indeed, the actual ML code may be a very small part of the system as a whole (Sculley et al. 2015). This is especially so in complex architectures frequently deployed in the context of the healthcare and life science space.

We focused on one particular challenge: creating the ability to serve the model so that others (applications, services, or people) can use it. By serving a model, we mean to grant others the ability to pass new data to the model so they can get the predictions they need. This post provides a broad overview of the problem, the solution, and a few points to keep in mind if you plan to use a similar approach in your own use cases. A full technical write up, including a readme and a step-by-step deployment of the architecture, is available in the GitHub code repository. For more information about approaches to serving models, see Build, Train, and Deploy a Machine Learning Model With Amazon SageMaker and AWS Deep Learning Containers on Amazon ECS.

Background and use case

In the medical field (as well as other industries), documents are frequently associated with a shorter restatement text of the original document. We use the term restatement, but in fact this shorter text can be a summary, highlight, description, or other metadata about the document. For example, an after-visit clinical summary given to a patient summarizes the content of the patient visit to a physician.

For illustration purposes, the following is an example that’s unrelated to the medical industry.

Document:

On Monday morning, Joshua ate a large breakfast of bacon and eggs. He then went for a brisk walk. Finally, he returned home and sat at his desk.

Restatement:

Joshua went for a walk.

In this example, the restatement is just a rewording of the highlighted sentence in the full document. This example shows that, although the use case that we focus on in this post is specific to the medical field, you can use and modify this approach for many other text analysis applications.

Let’s now take a closer look at the use case for this post. We used data taken from MTSamples (which we downloaded from Kaggle). This data contains many different samples of transcribed medical texts. It includes documents with raw transcriptions of sample notes, as well as shorter descriptions of those notes (which we treat as restatements).

The following is an example from the MTSamples dataset.

Document:

HISTORY OF PRESENT ILLNESS: , I have seen ABC today. He is a very pleasant gentleman who is 42 years old, 344 pounds. He is 5’9″. He has a BMI of 51. He has been overweight for ten years since the age of 33, at his highest he was 358 pounds, at his lowest 260. He is pursuing surgical attempts of weight loss to feel good, get healthy, and begin to exercise again. He wants to be able to exercise and play volleyball. Physically, he is sluggish. He gets tired quickly. He does not go out often. When he loses weight he always regains it and he gains back more than he lost. His biggest weight loss is 25 pounds and it was three months before he gained it back. He did six months of not drinking alcohol and not taking in many calories. He has been on multiple commercial weight loss programs including Slim Fast for one month one year ago and Atkin’s Diet for one month two years ago.,PAST MEDICAL HISTORY: , He has difficulty climbing stairs, difficulty with airline seats, tying shoes, used to public seating, difficulty walking, high cholesterol, and high blood pressure. He has asthma and difficulty walking two blocks or going eight to ten steps. He has sleep apnea and snoring. He is a diabetic, on medication. He has joint pain, knee pain, back pain, foot and ankle pain, leg and foot swelling. He has hemorrhoids.,PAST SURGICAL HISTORY: , Includes orthopedic or knee surgery.,SOCIAL HISTORY: , He is currently single. He drinks alcohol ten to twelve drinks a week, but does not drink five days a week and then will binge drink. He smokes one and a half pack a day for 15 years, but he has recently stopped smoking for the past two weeks.,FAMILY HISTORY: , Obesity, heart disease, and diabetes. Family history is negative for hypertension and stroke.,CURRENT MEDICATIONS:, Include Diovan, Crestor, and Tricor.,MISCELLANEOUS/EATING HISTORY: ,He says a couple of friends of his have had heart attacks and have had died. He used to drink everyday, but stopped two years ago. He now only drinks on weekends. He is on his second week of Chantix, which is a medication to come off smoking completely. Eating, he eats bad food. He is single. He eats things like bacon, eggs, and cheese, cheeseburgers, fast food, eats four times a day, seven in the morning, at noon, 9 p.m., and 2 a.m. He currently weighs 344 pounds and 5’9″. His ideal body weight is 160 pounds. He is 184 pounds overweight. If he lost 70% of his excess body weight that would be 129 pounds and that would get him down to 215.,REVIEW OF SYSTEMS: , Negative for head, neck, heart, lungs, GI, GU, orthopedic, or skin. He also is positive for gout. He denies chest pain, heart attack, coronary artery disease, congestive heart failure, arrhythmia, atrial fibrillation, pacemaker, pulmonary embolism, or CVA. He denies venous insufficiency or thrombophlebitis. Denies shortness of breath, COPD, or emphysema. Denies thyroid problems, hip pain, osteoarthritis, rheumatoid arthritis, GERD, hiatal hernia, peptic ulcer disease, gallstones, infected gallbladder, pancreatitis, fatty liver, hepatitis, rectal bleeding, polyps, incontinence of stool, urinary stress incontinence, or cancer. He denies cellulitis, pseudotumor cerebri, meningitis, or encephalitis.,PHYSICAL EXAMINATION: ,He is alert and oriented x 3. Cranial nerves II-XII are intact. Neck is soft and supple. Lungs: He has positive wheezing bilaterally. Heart is regular rhythm and rate. His abdomen is soft. Extremities: He has 1+ pitting edema.,IMPRESSION/PLAN:, I have explained to him the risks and potential complications of laparoscopic gastric bypass in detail and these include bleeding, infection, deep venous thrombosis, pulmonary embolism, leakage from the gastrojejuno-anastomosis, jejunojejuno-anastomosis, and possible bowel obstruction among other potential complications. He understands. He wants to proceed with workup and evaluation for laparoscopic Roux-en-Y gastric bypass. He will need to get a letter of approval from Dr. XYZ. He will need to see a nutritionist and mental health worker. He will need an upper endoscopy by either Dr. XYZ. He will need to go to Dr. XYZ as he previously had a sleep study. We will need another sleep study. He will need H. pylori testing, thyroid function tests, LFTs, glycosylated hemoglobin, and fasting blood sugar. After this is performed, we will submit him for insurance approval.

Restatement:

Consult for laparoscopic gastric bypass.

Although the raw transcript document is quite long, only a few of the sentences actually appear to be related to the restatement “Consult for laparoscopic gastric bypass.” We highlighted two sentences within the document that you might intuitively think best match the restatement. The approach we deployed quantifies the similarities and reports the sentences in the document that best match the restatement. We did this by using a pretrained BERT language model trained specifically on clinical texts (published by Alsentzer et. al. 2019). The model itself is hosted by HuggingFace, a platform for sharing open-source natural language processing (NLP) projects. We used this model to calculate sentence-by-sentence similarities using the sentence-transform Python library.

It is important to note that in this example and in this solution, we are performing the sentence ranking without explicitly extracting and detecting the medical entities. However, many applications rely on explicitly extracting and analyzing diagnoses, medications, and other health information. For detecting medical entities such as medical conditions, medications, and other medical information in medical text, consider using Amazon Comprehend Medical, a HIPAA-eligible service built to extract medical information from unstructured medical text.

More information about this approach is available in our technical write-up.

Architecture diagram

In this section, we go over the architecture diagram for this solution at a very high level. For more details and to see the step-by-step framework, see our technical write-up.

In the model development and testing phase, we use Amazon SageMaker Studio. Studio is a powerful integrated development environment (IDE) for building, training, testing, and deploying ML models. Because we use a prebuilt model for this solution, we don’t need to use Studio’s full ability to train algorithms at scale. Instead, we use it for development and deployment purposes.

We created a Jupyter notebook that you can import into Studio. This notebook walks you through the entire development and deployment process. We start by writing the code for our model to a file. The model is then built using an NGINX/Flask framework, so that new data can be passed to it at inference time. Prior to deploying the model, we package it as a Docker container, build it using AWS CodeBuild, and push it to Amazon Elastic Container Registry (Amazon ECR). Then we deploy the model using Amazon Elastic Container Service (Amazon ECS).

The final result is a model that you can query using a simple API call. This is an important point: the ability to query models via an API capability is an essential component of designing scalable, easy-to-use interfaces. For more information, see Implementing Microservices on AWS.

After we deploy our model, we create a graphical user interface (using Streamlit) so that our model can be easily accessed through a webpage. Streamlit is an open-source library used to create front ends for ML applications. After we create our webpage, we deploy it in a similar way to how we deployed our model: we package it as a separate Docker container, build it using CodeBuild, push it to Amazon ECR, and deploy it using Amazon ECS.

By creating and deploying this webpage, we provide users with no programming experience the ability to use our model to test their own documents and restatements. The following screenshot shows what the webpage looks like.

After the user inputs their restatement and corresponding document, the top five results (the five sentences that best match the statement) are returned. If you deploy the entire solution using our original MTSamples example, the final result looks like the following screenshot.

The solution reports the following results:

  • The top five sentences within the document that best match the restatement.
  • The similarity distance between each sentence and the restatement. A lower distance means closer similarities between that sentence and the restatement sentence.

In this example, the best matching sentence is “He wants to proceed with workup and evaluation for laparoscopic Roux-en-Y gastric bypass” with a distance of .0672. Therefore, this approach has correctly identified a sentence within the document that matches the restatement.

Limitations

Like any algorithm, this approach has some limitations. For instance, this approach is not designed to handle cases where the restatement of the document is actually high-level metadata about the document not directly related to the text of the document itself. You can solve such use cases by using Amazon Comprehend custom models. For more information, see Comprehend Custom and Building a custom classifier using Amazon Comprehend.

Another limitation in our approach is that it doesn’t explicitly handle negation (words such as “not,” “no,” and “denies”), which may change the meaning of the text. AWS services such as Amazon Comprehend and Amazon Comprehend Medical use deep learning models to handle negation.

Conclusion

In this post, we walked through the high-level steps to deploy a pre-built NLP model to analyze medical texts. If you’re interested in deploying this yourself, see our step-by-step technical write-up.

References

For more information, see the following references:


About the Authors

Joshua Broyde is an AI/ML Specialist Solutions Architect on the Global Healthcare and Life Sciences team at Amazon Web Services. He works with customers in the healthcare and life sciences industry at all levels of the Machine Learning Lifecycle on a number of AI/ML fronts, including analyzing medical images and video, analyzing machine sensor data and performing natural language processing of medical and healthcare texts.

 

Claire Palmer is a Solutions Architect at Amazon Web Services. She is on the Global Account Development team, supporting healthcare and life sciences customers. Claire has a passion for driving innovation initiatives and developing solutions that are both secure and scalable. She is based out of Seattle, Washington and enjoys exploring the PNW in her free time.

Read More

HDR+ with Bracketing on Pixel Phones

Posted by Manfred Ernst and Bartlomiej Wronski, Software Engineers, Google Research

We’re continuously working to improve the Pixel — making it more helpful, more capable, and more fun — with regular updates, such as the recent V8.2 update to the Camera app. One such improvement (launched on Pixel 5 and Pixel 4a 5G in October) is a feature that operates “under the hood”, HDR+ with Bracketing. This feature works by merging images taken with different exposure times to improve image quality (especially in shadows), resulting in more natural colors, improved details and texture, and reduced noise.

Why Are HDR Scenes Hard to Capture?
The original HDR+ burst photography system is the engine behind high-quality mobile photography, which captures a rapid series of deliberately underexposed images, then combines and renders them in a way that preserves detail across the range of tones. But this system had one limitation: scenes with high dynamic range (HDR) like the one below were noisy in the shadows because all images captured are underexposed.

The same photo using HDR+ (red outline) and HDR+ with Bracketing (green outline). While the characteristic HDR+ look remains the same, bracketing improves image quality, especially in shadows, with more natural colors, improved details and texture, and reduced noise.

Capturing HDR scenes is difficult because of the physical constraints of image sensors combined with limited signal in the shadows. We can correctly expose either the shadows or the highlights, but not both at the same time.

The same scene shot with different exposure settings and tonemapped to similar overall brightness. Left/Top: Exposure set for the highlights. The bright blue sky is preserved, but the shadows are very noisy. Right/Bottom: Exposure set for the shadows. Noise in the shadows is reduced, but the sky is clipped (white).

Photographers sometimes work around these limitations by taking two different exposures and combining them. This approach, known as exposure bracketing, can deliver the best of both worlds, but it is time-consuming to do by hand. It is also challenging in computational photography because it requires:

  1. Capturing additional long exposure frames while maintaining the fast, predictable capture experience of the Pixel camera.
  2. Taking advantage of long exposure frames while avoiding ghosting artifacts caused by motion between frames.

To avoid these challenges, the original HDR+ system used a different approach to handle high dynamic range scenes.

The Limits of HDR+
The capture strategy used by HDR+ is based on underexposure, which avoids loss of detail in the highlights. While this strategy comes at the expense of noise in the shadows, HDR+ offsets the increased noise through the use of burst photography.

Using bursts to improve image quality. HDR+ starts from a burst of full-resolution raw images (left). Depending on conditions, between 2 and 15 images are aligned and merged into a computational raw image (middle). The merged image has reduced noise and increased dynamic range, leading to a higher quality final result (right).

This approach works well for scenes with moderate dynamic range, but breaks down for HDR scenes. To understand why, we need to take a closer look at how two types of noise get into an image.

Noise in Burst Photography
One important type of noise is called shot noise, which depends only on the total amount of light captured — the sum of N frames, each with E seconds of exposure time has the same amount of shot noise as a single frame exposed for N × E seconds. If this were the only type of noise present in captured images, burst photography would be as efficient as taking longer exposures. Unfortunately, a second type of noise, read noise, is introduced by the sensor every time a frame is captured. Read noise doesn’t depend on the amount of light captured but instead depends on the number of frames taken — that is, with each frame taken, an additional fixed amount of read noise is added.

This is why using burst photography to reduce total noise isn’t as efficient as simply taking longer exposures: taking multiple frames can reduce the effect of shot noise, but will also increase read noise. Even though read noise increases with the number of frames, it is still possible to reduce the overall noisiness with burst photography, but it becomes less efficient. If one were to break a long exposure into N shorter exposures, the ratio of signal to noise in the final image would be lower because of the additional read noise. In this case, to get back to the signal-to-noise ratio in the single long exposure, one would need to merge N2 short-exposure frames. In the example below, if a long exposure were divided into 12 short exposures, we’d have to capture 144 (12 × 12) short frames to match the signal-to-noise ratio in the shadows! Capturing and processing this many frames would be much more time consuming — burst capture and processing could take over a minute and result in a poor user experience. Instead, with bracketing one can capture both short and long exposures — combining highlight protection and noise reduction.

Left: The result of merging 12 short-exposure frames in Night Sight mode. Right: A single frame whose exposure time is 12 times longer than an individual short exposure. The longer exposure has significantly less noise in the shadows but sacrifices the highlights.

Solving with Bracketing
While the challenges of bracketing prevented the original HDR+ system from using it, incremental improvements since then, plus a recent concentrated effort, have made it possible in the Camera app. To start, adding bracketing to HDR+ required redesigning the capture strategy. Capturing is complicated by zero shutter lag (ZSL), which underpins the fast capture experience on Pixel. With ZSL, the frames displayed in the viewfinder before the shutter press are the frames we use for HDR+ burst merging. For bracketing, we capture an additional long exposure frame after the shutter press, which is not shown in the viewfinder. Note that holding the camera still for half a second after the shutter press to accommodate the long exposure can help improve image quality, even with a typical amount of handshake.

Capture strategy. Top: The original HDR+ method captures short exposures before the shutter press, six in this example. Bottom: HDR+ with Bracketing captures five short exposures before the shutter press and one long exposure after the shutter press.

For Night Sight, the capture strategy isn’t constrained by the viewfinder — because all frames are captured after the shutter press while the viewfinder is stopped, this mode easily accommodates capturing longer exposure frames. In this case, we capture three long exposures to further reduce noise.

Capture strategy for Night Sight. Top: The original Night Sight captured 15 short exposure frames. Bottom: Night Sight with bracketing captures 12 short and 3 long exposures.

The Merging Algorithm
When merging bracketed shots, we choose one of the short frames as the reference frame to avoid potentially clipped highlights and motion blur. All other frames are aligned to this frame before they are merged. This introduces a challenge — for complex scene motion or occluded regions, it is impossible to find exactly matching regions and a naïve merge algorithm would produce ghosting artifacts in these cases.

Left: Ghosting artifacts are visible around the silhouette of a moving person, when deghosting is disabled.
Right: Robust merging produces a clean image.

To address this, we designed a new spatial merge algorithm, similar to the one used for Super Res Zoom, that decides per pixel whether image content should be merged or not. This deghosting is more complicated for frames with different exposures. Long exposure frames have different noise characteristics, clipped highlights, and different amounts of motion blur, which makes comparisons with the short exposure reference frame more difficult. In addition, ghosting artifacts are more visible in bracketed shots, because noise that would otherwise mask these errors is reduced. Despite those challenges, our algorithm is as robust to these issues as the original HDR+ and Super Res Zoom and doesn’t produce ghosting artifacts. At the same time, it merges images 40% faster than its predecessors. Because it merges RAW images early in the photographic pipeline, we were able to achieve all of those benefits while keeping the rest of processing and the signature HDR+ look unchanged. Furthermore, users who prefer to use computational RAW images can take advantage of those image quality and performance improvements.

Bracketing on Pixel
HDR+ with Bracketing is available to users of Pixel 4a (5G) and 5 in the default camera, as well as in Night Sight and Portrait modes. For users of Pixel 4 and 4a, the Google Camera app supports bracketing in Night Sight mode. No user interaction is needed to activate HDR+ with Bracketing — depending on the dynamic range of the scene, and the presence of motion, HDR+ with bracketing chooses the best exposures to maximize image quality (examples).

Acknowledgements
HDR+ with Bracketing is the result of a collaboration across several teams at Google. The project would not have been possible without the joint efforts of Sam Hasinoff, Dillon Sharlet, Kiran Murthy, Mike Milne, Andy Radin, Nicholas Wilson, Navin Sarma‎, Gabriel Nava, Emily To, Sushil Nath, Alexander Schiffhauer, Isaac Reynolds, Bill Strathearn, Marius Renn, Alex Hong, Jose Ricardo Lima, Bob Hung, Ying Chen Lou, Joy Hsu, Blade Chiu, David Massoud, Jean Hsu, Ellie Yang, and Marc Levoy.

Read More

Cognitive document processing for automated mortgage processing

This post was guest authored by AWS Advanced Consulting Partner Quantiphi.

The mortgage industry is highly complex and largely dependent on documents for the information required across different stages in their business value chain. Day-to-day operations for mortgage underwriting, property appraisal, and mortgage insurance underwriting are heavily dependent on the comprehension of different types of documents. The slow pace of document transfer between different business units of an organization slows down the overall approval process, leading to poor customer experience.

The mortgage loan approval process usually takes multiple weeks because a multitude of user-submitted documents are scrutinized at each stage to assess the underlying risk. Organizations need the right information at the right time to increase operational efficiency and better document management.

In the wake of COVID-19, the mortgage industry is reeling under pressure to undergo a digital transformation to provide a better customer experience. Large companies are cutting down capital and operational expenditure to sustain operations. The need for operational efficiency is higher than ever.

This post analyzes the role of machine learning (ML) solutions in document extraction in the mortgage industry to enhance business operations.

We highlight the key aspects of Quantiphi’s document processing solution built on AWS, and unveil how it helped a US-based mortgage insurance company address document management challenges through artificial intelligence (AI) and ML techniques.

Quantiphi is an AWS Partner Network (APN) Advanced Consulting Partner with AWS competencies in Machine Learning, Financial Services, Data & Analytics, and DevOps. Quantiphi also has multiple AWS Service Delivery designations, recognizing its expertise in leveraging specific AWS services.

ML-based document extraction for the mortgage industry

Lenders usually have to manually sieve through large volumes of loan packages containing structured and unstructured information to classify documents and identify key information. The identified information is further used for risk assessment. Most of this key information is usually contained in paragraphs, key-value pairs, and tables.

These lenders usually receive loan packages in bulk containing different types of documents such as W2, tax statements, 1008 forms, and so on. Currently, people have to first classify these documents manually and extract the relevant information. Therefore, mortgage firms are looking for meaningful ways of incorporating cognitive capabilities and solutions into their existing mortgage processing pipeline to automate the identification of key information and facilitate easy risk scoring in order to develop operational excellence and reduce manual efforts.

Quantiphi’s cognitive document processing solution combines state-of-the-art AI and ML services from AWS with Quantiphi’s custom document processing models to digitize a wide variety of mortgage documents. Quantiphi’s solution leverages services like Amazon Textract, Amazon SageMaker, Amazon Comprehend, Amazon Kendra, and Amazon Augmented AI (Amazon A2I) to help mortgage firms extract information from structured and unstructured documents, classify them into document types, and further address needs around risk assessment through ML.

Document classification and information extraction

Mortgage underwriting is done to assess the underlying risk for each application by analyzing the multitudes of user-submitted documents, such as W2 or I9 forms, tax returns, loan application (1003) forms, underwriting transmittal (1008) forms, demographic addendum, credit reports, bank account statements, and paycheck stubs. For example, the underwriting transmittal (1008) form contains the summary of the key information used during the risk assessment such as monthly income, qualifying rate, property details, and occupancy status. Paycheck stubs are another example of such documents, used to understand a borrower’s income in order to be sure that the borrower is able to repay the loan.

Similarly, property appraisal documents such as chain of title document and deed documents (assignment, trust, quitclaim) along with the property appraisal report are used to complement the property valuation process. Deed documents are processed to establish ownership and legal rights to a property. For example, if the lender sells a mortgage loan to another lender, they need to issue an assignment of deed of trust to give the new lender the same legal rights to the property.

Based on the inherent structure of the different types of mortgage documents, we have defined three broad segments to classify these documents:

  • Structured documents, which are standard documents with some variations over the years and across different states. All values, check boxes, and tables are usually contained in predefined areas of the document.
  • Semi-structured documents, which don’t follow a standardized template strictly but have a similar format.
  • Unstructured documents, which don’t follow any defined format.

Structured documents

Examples of structured documents include the loan application (1003) form, underwriting transmittal summary (1008) form, verification of employment (1005) form, and W2 form.

Consider underwriting the transmittal summary (1008) form. Quantiphi’s solution uses the standardized 1008 document as a reference for training, which is then used for extraction (see the following screenshot).

Key information that can be extracted from 1008 includes borrower and co-borrower names, property address, SSN, sales price, and appraised value.

Semi-structured documents

Semi-structured documents include pay stubs, bank statements, credit reports, and loan estimates. Here, Quantiphi’s solution uses a generic key-value pair and table detection model to extract the relevant features. Searching for certain common keywords results in a more efficient extraction.

The following screenshot shows data extraction from a pay stub.

Key information that can be extracted from includes paid period, deductions, net pay, 401K summary, and more.

Unstructured documents

Unstructured documents include deeds documents, appraisal reports, and more. Consider the assignment of deed of trust. Quantiphi’s Solution uses custom NLP techniques like entity recognition and syntax analysis to extract information (see the following screenshot).

Key information that can be extracted from the assignment of deed of trust includes the date of assignment, assignor, assignee, executor name, principal sum, and more.

Quantiphi’s cognitive document processing solution

Quantiphi’s cognitive document processing solution works across all types of structured and unstructured mortgage documents. Some key aspects of Quantiphi’s solution are as follows:

  • Capture – Powered by Amazon Textract and SageMaker. This feature uses deep-learning based OCR to identify and extract information such as key-value pairs, check boxes, tables, and signatures for further consumption by risk scoring models.
  • Categorize – Powered by SageMaker and Amazon Comprehend. This feature includes the automated classification of various types of mortgage documents like loan application (1003) forms, underwriting transmittal summary (1008) forms, W2 forms, pay stubs, bank statements, and credit reports.
  • Call out – Powered by Amazon Textract. This feature converts scanned applications and supporting documents into digital (searchable) PDFs and highlights the important information in the document along with the bookmarks to assist the loan underwriters with quick navigation through the document and to consume the relevant information for the further decision-making process.
  • Redaction – Powered by Amazon Textract and Amazon Comprehend. This feature enables identification and redaction of PII and PCI data like addresses, phone numbers, and names.
  • Interpret – Powered by Amazon Kendra and Amazon Elasticsearch Service (Amazon ES). This feature offers tools for easy consumption like a contextual and keyword-based search engine. Users can directly perform a search on a repository of processed documents to retrieve the relevant information along with the corresponding document link.
  • Human in the loop – Powered by Amazon A2I. This feature allows you to review and edit the extracted information based on the confidence score against the ground truth through an augmented AI workflow. This manual feedback is then used for continuous improvement of the extraction results through active learning.

The extracted information can be further fed into a risk assessment module to enable risk scoring of submissions in which low-risk applications are auto-approved and high-risk applications are marked for human review.

Quantiphi’s cognitive document processing solution is capable of achieving over 90% accuracy, provides substantial cost reductions, and facilitates better visibility of the mortgage processing workflow while assuring faster processing.

Let’s look at how Quantiphi built this solution by using a combination of AI and ML services provided by AWS.

Components used in the architecture ensure that the complete solution remains robust and scalable while providing high performance and reliability to process the incoming workload of documents in a cost-effective manner.

Solution architecture

The following diagram illustrates the architecture of Quantiphi’s solution.

The architecture consists of the following elements:

  1. The UI is hosted on Amazon Simple Storage Service (Amazon S3) and Amazon CloudFront is used for web distribution to ensure low latency.
  2. Through the UI, the user can upload multiple types of documents to Amazon S3 and select use cases like information extraction, document searchability, document classification, entity recognition, and insight generation.
  3. AWS Batch carries out preprocessing and stores these preprocessed images back into Amazon S3. The metadata information is captured in Amazon Aurora.
  4. AWS Lambda invokes Amazon Textract.
  5. Amazon Textract performs OCR to extract information. When the OCR job is complete, Amazon Textract triggers an Amazon Simple Notification Service (Amazon SNS) notification to add the completed job to an Amazon Simple Queue Service (Amazon SQS) queue.
  6. Lambda receives the Amazon Textract output and stores it in Amazon S3.
  7. An Auto Scaling Amazon Elastic Compute Cloud (Amazon EC2) instance converts these scanned images into a digital (searchable) PDF and writes the output to Amazon S3. Depending on the selected use cases, it writes the message into the respective four postprocessing SQS queues.
  8. If the uploaded PDFs are digital, AWS Batch skips the pipeline to write directly to the four postprocessing SQS queues.
  9. The solution uses a document classifier Docker container to classify pages and documents into categories.
  10. The enterprise search Docker enables the contextual search engine with the Q&A option on document content, content snippet generation, and document ranking.
  11. A document entity recognition and insights Docker is used for keyword and entity recognition and highlighting, masking of confidential data, chronological distribution of information, summarization, and topic modeling.
  12. The information extraction Docker has two functions:
    1. If the uploaded documents match any pre-trained documents, it extracts information from the documents and presents them to the user via the UI.
    2. If the documents don’t match any pre-trained model documents, it calls SageMaker via Amazon API Gateway and extracts key-value pairs, tables, check boxes, signatures, and stamps.

Customer use case: US-based mortgage insurance company

For this post, we present a use case in which the client is a leading US-based mortgage insurance company with a suite of mortgage, risk, real estate, and title services.

The client had millions of scanned pages of mortgage documents containing both handwritten and typed content, which were manually parsed to extract information and classify them accordingly. Processing of new mortgage loans was extremely time-consuming due to manual handling of over 400 different document types.

Quantiphi solution

Quantiphi developed an AI virtual assistant that takes user-uploaded documents and automatically classifies the documents and pages contained in them into different categories, such as bank statements, credit reports, tax returns, and property tax bills and statements.

To augment the consumption of information, the solution highlights key entities (with bounding boxes) present in them. The user can review and edit the extraction results via a custom reviewer UI tool, which is then used for accuracy benchmarking and re-training purposes.

Amazon QuickSight was used to build an interactive dashboard for presenting accuracy metrics and reconciliation.

The solution successfully digitized the processing of more than 50 million pages yearly, with an accuracy of over 97% in classification and 90% in the extraction of more than 40 different data points like borrower’s name, loan amount, and so on.

Quantiphi succeeded in expanding the customer’s profit margin by lowering its document processing costs. Their processing efficiency was enhanced through quick and accurate extraction and detection of data while eliminating manual efforts to greatly reduce the loan processing time.

Summary

Traditional methods of mortgage loan processing are manual in nature and highly time-consuming. Customers are often asked to provide a large number of documents that lenders have to manually go through for assessment.

Quantiphi’s cognitive document processing solution expedites the process by automating information extraction from the documents. Mortgage companies can use Quantiphi’s solution to increase their operational efficiency and significantly reduce their mortgage processing time.

 

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Arnav Gupta is AWS Practice Head at Quantiphi.

Bhaskar Kalita is FSI Head at Quantiphi.

Read More

A Learning Theoretic Perspective on Local Explainability

Fig 1: A formal relationship between interpretability and complexity. Going from left to right, we consider increasingly complex functions. As the complexity increases, local linear explanations can approximate the function only in smaller and smaller neighborhoods. These neighborhoods, in other words, need to become more and more disjoint as the function becomes more complex. Indeed, we quantify “disjointedness” of the neighborhoods via a term denoted by (rho_S) and relate it to the complexity of the function class, and subsequently, its generalization properties.

Introduction

There has been a growing interest in interpretable machine learning (IML), towards helping users better understand how their ML models behave. IML has become a particularly relevant concern especially as practitioners aim to apply ML in important domains such as healthcare [Caruana et al., ’15], financial services [Chen et al., ’18], and scientific discovery [Karpatne et al., ’17].

While much of the work in IML has been qualitative and empirical, in our recent ICLR21 paper, we study how concepts in interpretability can be formally related to learning theory. At a high level, the connection between these two fields seems quite natural. Broadly, one can consider there to be a trade-off between notions of a model’s interpretability and its complexity. That is, as a model’s decision boundary gets more complicated (e.g., compare a sparse linear model vs. a neural network) it is harder in a general sense for a human to understand how it makes decisions. At the same time, learning theory commonly analyzes relationships between notions of complexity for a class of functions (e.g., the number of parameters required to represent those functions) and the functions’ generalization properties (i.e., their predictive accuracy on unseen test data). Therefore, it is natural to suspect that, through some appropriate notion of complexity, one can establish connections between interpretability and generalization.

How do we establish this connection? First, IML encompasses a wide range of definitions and problems, spanning both the design of inherently interpretable models as well as post-hoc explanations for black-boxes (e.g. including but not limited to approximation [Riberio et al., ’16], counterfactual [Dhurandhar et al., ’18], and feature importance-based explanations [Lundberg & Lee, ’17]). In our work, we focus on a notion of interpretability that is based on the quality of local approximation explanations. Such explanations are a common and flexible post-hoc approach for IML, used by popular methods such as LIME [Riberio et al., ’16] and MAPLE [Plumb et al., ‘18] which we’ll briefly outline later in the blog. We then answer two questions that relate this notion of local explainability to important statistical properties of the model:

  1. Performance Generalization: How can a model’s predictive accuracy on unseen test data be related to the interpretability of the learned model? 
  2. Explanation Generalization: We look at a novel statistical problem that arises in a growing subclass of local approximation algorithms (such as MAPLE and RL-LIM [Yoon et al., ‘19]). Since these algorithms learn explanations by fitting them on finite data, the explanations may not necessarily fit unseen data well. Hence, we ask, what is the quality of those explanations on unseen data?

In what follows, we’ll first provide a quick introduction to local explanations. Then, we’ll motivate and answer our two main questions by discussing a pair of corresponding generalization guarantees that are in terms of how “accurate” and “complex” the explanations are. Here, the “complexity” of the local explanations corresponds to how large of a local neighborhood the explanations span (the larger the neighborhood, the lower the complexity — see Fig 1 for a visualization). For question (1), this results in a bound that roughly captures the idea that an easier-to-locally-approximate (f) enjoys better performance generalization. For question (2), our bound tells us that, when the explanations can accurately fit all the training data that fall within large neighborhoods, the explanations are likely to fit unseen data better. Finally, we’ll examine our insights in practice by verifying that these guarantees capture non-trivial relationships in a few real-world datasets. 

Local Explainability

Local approximation explanations operate on a basic idea: use a model from a simple family (like a linear model) to locally mimic a model from a more complex family (like a neural network model). Then, one can directly inspect the approximation (e.g. by looking at the weights of the linear model). More formally, for a given black-box model (f : mathcal{X} rightarrow mathcal{Y}), the explanation system produces at any input (x in mathcal{X}) , a ”simple” function (g_x(cdot) : mathcal{X} rightarrow mathcal{Y}) that approximates (f) in a local neighborhood around (x) . Here, we assume (f in mathcal{F}) (the complex model class) and (g_x in mathcal{G}_{text{local}} ) (the simple model class). 

As an example, here’s what LIME (Local Interpretable Model-agnostic Explanations) does. At any point (x), in order to produce the corresponding explanation (g_x), LIME would sample a bunch of perturbed points in the neighborhood around (x) and label them using the complex function (f(x)). It would then learn a simple function that fits the resulting dataset. One can then use this simple function to better understand how (f) behaves in the locality around (x).

Performance Generalization

Our first result is a generalization bound on the squared error loss of the function (f). Now, a typical generalization bound would look something like

$$text{TestLoss}(f) ≤ text{TrainLoss}(f) + sqrt{ frac{text{Complexity}(mathcal{F})}{text{# of training examples}}  }$$

where the bound is in terms of how well (f) fits the training data, and also how “complex” the function class (mathcal{F}) is. In practice though, (mathcal{F}) can often be a very complex class, rendering these bounds too large to be meaningful.

Yet while (mathcal{F}) is complex, what if the function (f) might itself have been picked from a subset of (mathcal{F}) that is in some way much simpler? For example, this is the case in neural networks trained by gradient descent [Zhang et al., ‘17, Neyshabur et al., ‘15]). Capturing this sort of simplicity could lead to more interesting bounds that (a) aren’t as loose and/or (b) shed insight into what meaningful  properties of (f) can influence how well it generalizes. While there are many different notions of simplicity that different learning theory results have studied, here we are interested in quantifying simplicity in terms of how “interpretable” (f) is, and relate that to generalization. 

To state our  result, imagine that we have a training set (S = {(x_i,y_i)}_{i=1}^{m})sampled from the data distribution (D). Then, we show the following bound on the test-time squared error loss:

$$underbrace{mathbb{E}_{D}[(f(x)-y)^2]}_{text{Test Loss}} leq underbrace{hat{mathbb{E}}_{S}[(f(x)-y)^2]}_{text{Train Loss}} + underbrace{mathbb{E}_{x sim D} mathbb{E}_{x’ sim N_{x}}[(g_{x’}(x) – f(x))^2]}_{text{Explanation Quality (MNF)}} + underbrace{rho_S cdot mathcal{R}(mathcal{G}_{local})}_{substack{text{Complexity of} \ text{Explanation System}}}.$$

Let’s examine these terms one by one.

Train Loss: The first term, as is typical of many generalization bounds, is simply the training error of (f) on (S). 

Explanation Quality (MNF): The second term captures a notion of how interpretable (f) is, measuring how accurate the set of local explanations (g) is with a quantity that we call the “mirrored neighborhood fidelity” (MNF). This metric is actually a slight modification of a standard notion of local explanation quality used in IML literature, called neighborhood fidelity (NF) [Riberio et al., ’16; Plumb et al., ‘18]. More concretely, we explain how MNF and NF are calculated below in Fig 2.

Fig 2. How MNF is calculated: We use orange to denote “source points” (where explanations are generated) and teal to denote “target points” (where approximation error is computed). To compute the inner expectation for MNF, we sample a single target point (x sim D). Then, we sample a source point (x’) from a “neighborhood” distribution (N_x) (typically a distribution centered at (x)). We then measure how well (f) is approximated at (x) by the explanation generated at (x’). Averaging over (x) and (x’), we define (text{MNF} = mathbb{E}_{x sim D} mathbb{E}_{x’ sim N_{x}}[(g_{x’}(x) – f(x))^2]). To get NF, we simply need to swap (x) and (x’) in the innermost term: (text{NF} = mathbb{E}_{x sim D} mathbb{E}_{x’ sim N_{x}}[(g_{x}(x’) – f(x’))^2]).

While notationally the differences between MNF and the standard notion of NF are slight, there are some noteworthy differences and potential (dis)advantages of using MNF over NF from an interpretability point of view. At a high level, we argue that MNF offers more robustness (when compared to NF) to any potential irregular behavior of (f) off the manifold of the data distribution. We discuss this in greater detail in the full paper.

Complexity of Explanation System: Finally, the third and perhaps the most interesting term measures how complex the infinite set of explanation functions ({g_x}_{x in mathcal{X}}) is. As it turns out, this system of explanations ({g_x}_{x in mathcal{X}}), which we will call (g), has a complexity that can be nicely decomposed into two factors. One factor, namely (mathcal{R}(mathcal{G}_{text{local}})), corresponds to the (Rademacher) complexity of the simple local class (mathcal{G}_{text{local}}), which is going to be a very small quantity, much smaller than the complexity of (mathcal{F}). Think of this factor as typically scaling linearly with the number of parameters for (mathcal{G}_{text{local}}) and also with the dataset size (m) as (1/sqrt{m}). The second factor is (rho_S), and is what we call the “neighborhood disjointedness” factor. This factor lies between ([1, sqrt{m}]) and is defined by how little overlap there is between the different local neighborhoods specified for each of the training datapoints in (S). When there is absolutely no overlap, (rho_S) can be as large as (sqrt{m}), but when all these neighborhoods are exactly the same, (rho_S) equals (1). 

Implications of the overall bound: Having unpacked all the terms, let us take a step back and ask: assuming that (f) has fit the training data well (i.e., the first term is small), when are the other two terms large or small? We visualize this in Fig 1. Consider the case where MNF can be made small by approximating (f) by (g) on very large neighborhoods (Fig 1 left). In such a case, the neighborhoods would overlap heavily, thus keeping (rho_S) small as well. Intuitively, this suggests good generalization when (f) is “globally simple”. On the other hand, when (f) is too complex, then we need to either shrink the neighborhoods or increase the complexity of (mathcal{G}_{text{local}}) to keep MNF small. Thus, one would either suffer from MNF or (rho_S) exploding, suggesting bad generalization. In fact, when (rho_S) is as large as (sqrt{m}), the bound is “vacuous” as the complexity term no longer decreases with the dataset size (m), suggesting no generalization!

Explanation Generalization

We’ll now turn to a different, novel statistical question which arises when considering a number of recent IML algorithms. Here we are concerned with how well explanations learned from finite data generalize to unseen data. 

To motivate this question more clearly, we need to understand a key difference between canonical and finite-sample-based IML approaches. In canonical approaches (e.g. LIME), at different values of (x’), the explanations (g_{x’}) are learned by fitting on a fresh bunch of points (S_{x’}) from a (user-defined) neighborhood distribution (N_{x’}) (see Fig 3, top). But a growing number of approaches such as MAPLE and RL-LIM learn their explanations by fitting (g_{x’}) on a “realistic” dataset (S) drawn from (D) (rather than from an arbitrary distribution) and then re-weighting the datapoints in (S) depending on a notion of their closeness to (x’) (see Fig 3, bottom).

Now, while the canonical approaches effectively train (g) on an infinite dataset (cup_{x’ in mathcal{X}} S_{x’}), recent approaches train (g) on only that finite dataset (S) (reused for every (x’)). 

Using a realistic (S) has certain advantages (as motivated in this blog post), but on the flip side, since (S) is finite, it can potentially result in a severe chance of overfitting (we visualize this in Fig 3 right). This makes it valuable to seek a guarantee on the approximation error of (g) on test data (“Test MNF”) in terms of its fit on the training data (S) (“Train MNF”). In our paper, we derive such a result below: 

$$underbrace{mathbb{E}_{x sim D} mathbb{E}_{x’ sim N_{x}}[(g_{x’}(x) – f(x))^2]}_{text{Test MNF}} leq underbrace{hat{mathbb{E}}_{x sim S} mathbb{E}_{x’ sim N_{x}}[(g_{x’}(x) – f(x))^2]}_{text{Train MNF}} + rho_S cdot mathcal{R}(mathcal{G}_{local}). $$

As before, what this bound implies is that when the neighborhoods have very little overlap, there is poor generalization. This indeed makes sense: if the neighborhoods are too tiny, any explanation (g_{x’}) would have been trained on a very small subset of (S) that falls within its neighborhood. Thus the fit of (g_{x’}) won’t generalize to other neighboring points.

Fig 3. Difference between canonical local explanations (a) vs. finite-sample-based explanations (b and c): On the top panel (a), we visualize how one would go about generating explanations for different source points in a canonical method like LIME. In the bottom panels (b and c), we visualize the more recent approaches where one uses (and reuses) a single dataset for each explanation. Crucially, to learn an explanation at a particular source point, these procedures correspondingly re-weight this common dataset (visualized by the orange square boxes which are more opaque for points closer to each source point). In panel (b), the common dataset is large enough that it leads to good explanations; but in panel (c), the dataset is too small that the explanations do not generalize well to their neighborhoods.

Experiments

While our theoretical results offer insights that make sense qualitatively, we also want to make a case empirically that they indeed capture meaningful relationships between the quantities involved. Particularly, we explore this for neural networks trained on various regression tasks from the UCI suite and in the context of the “explanation generalization” bound. That is we learn explanations to fit a finite dataset (S) by minimizing Train MNF, and then evaluate what Test MNF is like. Here, there are two important relationships we empirically establish:

  1. Dependence on (rho_S): Given that our bounds may be vacuous for large (rho_S=sqrt{m}), does this quantity actually scale well in practice (i.e. less than (O(sqrt{m})))? Indeed, we observe that we can find reasonably large choices of the neighborhood size ( sigma ) without causing Train MNF to become too high (somewhere around (sigma = 1) in Fig 4 bottom) and for which we can also achieve a reasonably small (rho_S approx O(m^{0.2})) (Fig 4 top).
  2. Dependence on neighborhood size: Do wider neighborhoods actually lead to improved generalization gaps? From Fig 4 bottom, we do observe that as the neighborhood width increases, TrainMNF and TestMNF overall get closer to each other, indicating that the generalization gap decreases (Fig 4 bottom).

Fig 4. Empirical study of our bounds For various neighborhood widths (sigma), in the top, we plot the approximate exponent of (rho_S)‘s polynomial growth rate i.e., the exponent (c) in (rho_S = O(|S|^c)). Below, we plot train/test MNF. We observe a tradeoff here: increasing (rho_S) results in better values of (rho_S) but hurts the MNF terms.

Conclusion

We have shown how a model’s local explainability can be formally connected to some of its various important statistical properties. One direction for future work is to consider extending these ideas to high-dimensional datasets, a challenging setting where our current bounds become prohibitively large. Another direction would be to more thoroughly explore these bounds in the context of neural networks, for which researchers are in search of novel types of bounds [Zhang et al., ‘17; Nagarajan and Kolter ‘19].

Separately, when it comes to the interpretability community, it would be interesting to explore the advantages/disadvantages of evaluating and learning explanations via MNF rather than NF. As discussed here, MNF appears to have reasonable connections to generalization, and as we show in the paper, it may also promise more robustness to off-manifold behavior.

To learn more about our work, check out our upcoming ICLR paper. Moreover, for a broader discussion about IML and some of the most pressing challenges in the field, here is a link to a recent white paper we wrote.

References

Jeffrey Li, Vaishnavh Nagarajan, Gregory Plumb, and Ameet Talwalkar, 2021, “A Learning Theoretic Perspective on Local Explainability“, ICLR 2021.

Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad, 2015, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission.” ACM SIGKDD, 2015.

Chaofan Chen, Kancheng Lin, Cynthia Rudin, Yaron Shaposhnik, Sijia Wang, and Tong Wang, 2018, “An interpretable model with globally consistent explanations for credit risk.” NeurIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness, Explainability, Accuracy, and Privacy, 2018.

Anuj Karpatne, Gowtham Atluri, James H. Faghmous, Michael Steinbach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, and Vipin Kumar, 2017, “Theory-guided data science: A new paradigm for scientific discovery from data.” IEEE Transactions on Knowledge and Data Engineering, 2017.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, 2016, “Why should I trust you?: Explaining the predictions of any classifier.” ACM SIGKDD, 2016.

Scott M. Lundberg, and Su-In Lee, 2017, “A unified approach to interpreting model predictions.” NeurIPS, 2017.

Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan Shanmugam, and Payel Das, 2018, “Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives” NeurIPS, 2018.

Jinsung Yoon, Sercan O. Arik, and Tomas Pfister, 2019, “RL-LIM: Reinforcement learning-based locally interpretable modeling”, arXiv 2019 1909.12367.

Gregory Plumb, Denali Molitor and Ameet S. Talwalkar, 2018, “Model Agnostic Supervised Local Explanations“, NeurIPS 2018.

Vaishnavh Nagarajan and J. Zico Kolter, 2019, “Uniform convergence may be unable to explain generalization in deep learning”, NeurIPS 2019

Behnam Neyshabur, Ryota Tomioka, Nathan Srebro, 2015, “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning”, ICLR 2015 Workshop.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, 2017, “Understanding deep learning requires rethinking generalization”, ICLR’ 17.

Valerie Chen, Jeffrey Li, Joon Sik Kim, Gregory Plumb, and Ameet Talwalkar, 2021, “Towards Connecting Use Cases and Methods in Interpretable Machine Learning“, arXiv 2021 2103.06254.

Read More