Reviewing online fraud using Amazon Fraud Detector and Amazon A2I

Each year, organizations lose tens of billions of dollars to online fraud globally. Organizations such as ecommerce companies and credit card companies use machine learning (ML) to detect online fraud. Some of the most common types of online fraud include email account compromise (personal or business), new account fraud, and non-payment or non-delivery (including card numbers compromised).

A common challenge with ML is the need for a large labeled dataset to create ML models for detecting fraud. Moreover, even if you have this dataset, you need the skill set and infrastructure to build, train, deploy, and scale your ML model to detect fraud with millions of events. In addition, you need humans to review the subset of high-risk fraud predictions to ensure that the results are highly accurate. Setting up a human review system with your fraud detection model requires provisioning complex workflows and managing a group of reviewers, which increases the time to market for your applications and overall costs.

In this post, we provide an approach to identify high-risk predictions from Amazon Fraud Detector and use Amazon Augmented AI (Amazon A2I) to set up a human review workflow to automatically trigger a review process for further investigation and validation.

Amazon Fraud Detector is a fully managed service that uses ML and more than 20 years of fraud detection expertise from Amazon to identify potential fraudulent activity so you can catch more online fraud faster. Amazon Fraud Detector automates the time-consuming and expensive steps to build, train, and deploy an ML model for fraud detection, making it easier for you to leverage the technology. Amazon Fraud Detector customizes each model it creates to your dataset, making the accuracy of models higher than current one-size-fits-all ML solutions. And because you pay only for what you use, you avoid large upfront expenses.

Amazon A2I is an ML service that makes it easy to build the workflows with ML models required for human review. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of reviewers.

Overview of the solution

The high-level solution is summarized through the following architecture.

The workflow contains the following steps:

  1. The client application sends information to the Amazon Fraud Detector endpoint.
  2. Amazon Fraud Detector predicts a risk score (in the range of 0–1,000) on the input data with an ML model that is trained using historical data. A score of 0 indicates that the prediction is considered to have the lowest possible risk, and a score of 1,000 indicates that the prediction is considered to have the highest possible risk.
  3. If the risk score for a particular prediction falls beneath a predefined threshold, there is no further action.
  4. If the risk score exceeds the predefined threshold (for example, a score of 900), the Amazon A2I loop starts automatically and sends predictions for human review to an Amazon A2I private workforce. A private workforce can be employees of your company. They open the Amazon A2I interface, review the case, and make an adjudication (approve, deny, or send it for further verification).
  5. The approval or rejection result from the private workforce is stored in Amazon Simple Storage Service (Amazon S3). From Amazon S3, it can be directly sent to the client application.

Solution walkthrough

In this post, we set up Amazon Fraud Detector using the AWS Management Console, and set up Amazon A2I using an Amazon SageMaker notebook. The following steps outline the detailed solution:

  1. Train and deploy the Amazon Fraud Detector model using historical data.
  2. Set up an Amazon A2I human loop with Amazon Fraud Detector.
  3. Use the model to predict the risk score for a given new input data.
  4. Set up an Amazon A2I human workflow and loop.

Prerequisites

Before getting started, you must complete the following prerequisite steps:

  1. Download the training data. For this post, we use synthetic training data.
  2. Create an S3 bucket named fraud-detector-a2i and upload the training data to the bucket.

Training and deploying the Amazon Fraud Detector model

This section covers the high-level steps for building the model and creating a fraud detector:

  1. Create an event to evaluate for fraud.
  2. Define the model and training details to train the model using the data previously uploaded to Amazon S3.
  3. Deploy the model.
  4. Create the detector. 

Creating an event

Navigate to the Amazon Fraud Detector console. You uploaded the training dataset to Amazon S3 in the prerequisite steps. In this step, we create an event. An event is a business activity that is evaluated for fraud risk, and the event type defines the structure for an event sent to Amazon Fraud Detector.

  1. On the Amazon Fraud Detector console, choose Create event.
  2. For Name, enter registration.
  3. For Entity, choose Create new entity.

For Entity, choose Create new entity.

The entity represents who is performing or triggering the event.

  1. For Entity type name, enter customer.

For Entity type name, enter customer.

  1. For Choose how to define this event’s variables, choose Select variables from a training dataset.
  2. For AWS Identity and Access Management or IAM role, choose Create IAM role.

For IAM role¸ choose Create IAM role.

  1. In the Create IAM role section, enter the specific bucket name where you uploaded your training data. 

The name of IAM role should be the S3 bucket name where you uploaded your training data. Otherwise, you get an Access denied Exception error.

  1. Choose Create role.

Choose Create role.

  1. For Data location, enter the path to your training data.
  2. Choose Upload.

This pulls in the variables from the previously uploaded dataset. Choose the variable types as shown in the following screenshot.

Choose the variable types as shown in the following screenshot.

You need to create at least two labels for the model to use.

  1. For Labels, choose fraud and legit.

For Labels, choose fraud and legit.

  1. Choose Create event type.

Creating the model

When the event is successfully created, move on to create the model.

  1. On the Define model details page, for Model name¸ enter sample_fraud_detection.
  2. For Model type, choose Online Fraud Insights.
  3. For Event type, choose registration.
  4. For IAM role, choose the role you created earlier or create a new one.
  5. For Training data location, enter the path to your training data; for example, s3://<bucket-name>/<object name>.
  6. Choose Next.

Choose Next.

  1. On the Configure training page, for Model inputs, select all the variables from your historical event dataset.
  2. For Fraud labels, choose fraud.
  3. For Legitimate labels, choose legit.
  4. Choose Next.

Choose Next

  1. Choose Create and train model.

The process of creating and training the model takes approximately 45 minutes to complete. When the model has stopped training, you can check model performance by choosing the model version. 

Amazon Fraud Detector validates model performance using 15% of your data that was not used to train the model and provides performance metrics, including the confusion matrix and the area under the curve (AUC). You need to consider these metrics together with your business objectives (minimize false positives). For further details on the metrics and how to determine thresholds, see Fraud Detector Training performance metrics.

The following screenshot shows our model performance.

The following screenshot shows our model performance.Deploying the model

When the model is trained, you’re ready to deploy it.

  1. Choose your model (sample_fraud_detection) and the version you want to deploy.
  2. On the model version details page, on the Actions menu, choose Deploy model version.

On the model version details page, on the Actions menu, choose Deploy model version.

Creating a detector

After you deploy your model, you need to create a detector to hold your deployed model and decision logic.

  1. On the Amazon Fraud Detector console, choose Detectors.
  2. Choose Create detector.
  3. For Detector name, enter fraud_detector.
  4. For Event type, choose registration.
  5. Choose Next.

Choose Next.

  1. In the Add model section, for Model, choose your model and its version.

In the Add model section, for Model, choose your model and its version.

  1. Choose Next.

You need to create rules to interpret what is considered a high-risk event based on the model score produced by your detector.

  1. In the Add rules section, for Name, enter high_fraud_risk.
  2. For Expression, enter the following code:
    $ sample_fraud_detection_insightscore > 900,

Each rule must contain a single expression that captures your business logic. All expressions must evaluate to a Boolean value (true or false) and be less than 4,000 characters in length. If-else type conditions are not supported. All variables used in the expression must be predefined in the evaluated event type. For help with more advanced expressions, see Rule language reference.

  1. For Outcomes, choose the outcome you want for your rule.

An outcome is the result of a fraud prediction. Create an outcome for each possible fraud prediction result. For example, you may want outcomes to represent risk levels (high_risk, medium_risk, and low_risk) or actions (approve, review). You can add one or more outcomes to a rule.

  1. Choose Add rule to run the rule validation checker and save the rule.

Choose Add rule to run the rule validation checker and save the rule.

  1. In the Configure rule execution section, for Rule execution modes, select First matched.
  2. Choose Next.

Choose Next.

  1. In the Review and Create section, choose Create detector.

We have successfully created the detector.

Setting up an Amazon A2I human loop with Amazon Fraud Detector

In this section, we show you to configure an Amazon A2I custom task type with Amazon Fraud Detector using the accompanying Jupyter notebook. We use a custom task type to integrate a human review loop into any ML workflow. You can use a custom task type to integrate Amazon A2I with other AWS services like Amazon Comprehend, Amazon Transcribe, and Amazon Translate, as well as your own custom ML workflows.

To get started, complete the following steps:

  1. Create a notebook instance in SageMaker.

Make sure your SageMaker notebook has AWS Identity and Access Management (IAM) roles and permissions for FraudDetectorFullAccess and SagemakerFullAccess, and Amazon S3 read and write access to the bucket you specified in BUCKET.

  1. When the notebook is active, choose Open Jupyter.
  2. On the Jupyter dashboard, choose New, and choose
  3. In the terminal, enter the following code:
  1. Open the notebook by choosing Amazon A2I and Amazon Fraud Detector.ipynb in the root folder.
  2. Run the Install and Setup steps to install the necessary libraries.
  3. To set up the S3 bucket in the notebook, enter the bucket you created in the prerequisite step in which you uploaded your training data:
    # Replace the following with your bucket name
    BUCKET = ' fraud-detector-a2i '

  1. Run the next cells to assert your bucket is in same Region in which you’re running this notebook.

For this post, you create a private work team and add only one user (you) to it.

  1. On the SageMaker console, create a private workforce.
  2. After you create the private workforce, find the workforce ARN and enter the ARN in the notebook:
    WORKTEAM_ARN = "your workteam arn"

  1. Run the notebook cells to complete setting up, such as initializing Amazon Fraud Detector Python Boto3 APIs.
  2. After you create your fraud detector model, replace the MODEL_NAME, DETECTOR_NAME, EVENT_TYPE, and ENTITY_TYPE with your model values:
    MODEL_NAME = 'sample_fraud_detection'
    DETECTOR_NAME = 'fraud_detector'
    EVENT_TYPE = 'registration'
    ENTITY_TYPE = 'customer'
    

Testing the fraud detector with a sample data record

Run the Amazon Fraud Detector Get Event Prediction API on sample data. This API provides a model score on the event and an outcome based on the designated detector. See the following code:

eventId = uuid.uuid1()
timestampStr = '2013-07-16T19:00:00Z'

# Construct a sample data record

rec = {
   'ip_address': '36.72.99.64',
   'email_address': 'fake_bakermichael@example.net',
   'billing_state' : 'NJ',
   'user_agent' : 'Mozilla',
   'billing_postal' : '32067',
   'phone_number' :'555-555-0100',
   'user_agent' : 'Mozilla',
   'billing_address' :'12351 Amanda Knolls Fake St'
}


pred = client.get_event_prediction(detectorId=DETECTOR_NAME, 
                                   detectorVersionId='1',
                                   eventId = str(eventId),
                                   eventTypeName = EVENT_TYPE,
                                   eventTimestamp = timestampStr, 
                                   entities = [{'entityType': ENTITY_TYPE, 'entityId':str(eventId.int)}],
                                   eventVariables=rec)

The API provides the following output:

pred
{'modelScores': [{'modelVersion': {'modelId': 'sample_fraud_detection',
    'modelType': 'ONLINE_FRAUD_INSIGHTS',
    'modelVersionNumber': '1.0'},
   'scores': {'sample_fraud_detection_insightscore': 992.0}}],
 'ruleResults': [{'ruleId': 'high-risk', 'outcomes': ['verify']}],
 'ResponseMetadata': {'RequestId': '8902a475-df5b-470d-a990-ec217d5908cd',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Mon, 02 Nov 2020 17:22:11 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '250',
   'connection': 'keep-alive',
   'x-amzn-requestid': '8902a475-df5b-470d-a990-ec217d5908cd'},
  'RetryAttempts': 0}}

Run the following notebook cell to print the model score:

pred['modelScores'][0]['scores']['sample_fraud_detection_insightscore']

Creating a human task UI using a custom worker task template

Use HTML elements to create a custom worker template that Amazon A2I uses to generate your worker task UI. For instructions on creating a custom template, see Create Custom Worker Task Template. We have over 70 pre-built UIs or worker task templates for various use cases. For this post, we use the following custom task template to flag the high-risk output as Fraudulent, Valid, or Needs further Investigation:

template="""<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
      <crowd-classifier
          name="category"
          categories="['Fradulent, 'Valid, 'Needs further Review]"
          header="Select the most relevant category"
      >
      <classification-target>
        <h3><strong>Risk Score (out of 1000): </strong><span style="color: #ff9900;">{{ task.input.score.sample_fraud_detection_insightscore }}</span></h3>
        <hr>
        <h3> Claim Details </h3>
        <p style="padding-left: 50px;"><strong>Email Address   :  </strong>{{ task.input.taskObject.email_address }}</p>
        <p style="padding-left: 50px;"><strong>Billing Address :  </strong>{{ task.input.taskObject.billing_address }}</p>
        <p style="padding-left: 50px;"><strong>Billing State   :  </strong>{{ task.input.taskObject.billing_state }}</p>
        <p style="padding-left: 50px;"><strong>Billing Zip     :  </strong>{{ task.input.taskObject.billing_postal }}</p>
        <p style="padding-left: 50px;"><strong>Originating IP  :  </strong>{{ task.input.taskObject.ip_address }}</p>
        <p style="padding-left: 50px;"><strong>Phone Number    :  </strong>{{ task.input.taskObject.phone_number }}</p>
        <p style="padding-left: 50px;"><strong>User Agent      :  </strong>{{ task.input.taskObject.user_agent }}</p>
      </classification-target>
      
      <full-instructions header="Claim Verification instructions">
      <ol>
        <li><strong>Review</strong> the claim application and documents carefully.</li>
        <li>Mark the claim as valid or fraudulent</li>
      </ol>
      </full-instructions>

      <short-instructions>
           Choose the most relevant category that is expressed by the text. 
      </short-instructions>
    </crowd-classifier>

</crowd-form>
"""

You can create a worker task template using the SageMaker console and the SageMaker API operation CreateHumanTaskUi. Run the following cell to create the human task UI for fraud detection:

def create_task_ui(task_ui_name, template):
    '''
    Creates a Human Task UI resource.

    Returns:
    struct: HumanTaskUiArn
    '''
    response = sagemaker.create_human_task_ui(
        HumanTaskUiName=task_ui_name,
        UiTemplate={'Content': template})
    return response
taskUIName = 'fraud'+ str(uuid.uuid1())

# Create task UI
humanTaskUiResponse = create_task_ui(taskUIName, template)
humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']
print(humanTaskUiArn)

Creating a human review workflow definition

Workflow definitions allow you to specify the following:

  • The worker template or human task UI you created in the previous step.
  • The workforce that your tasks are sent to. For this post, it’s the private workforce you created in the prerequisite steps.
  • The instructions that your workforce receives.

This post uses the Create Flow Definition API to create a workflow definition. Run the following cell in the notebook:

def create_flow_definition(flow_definition_name):
    '''
    Creates a Flow Definition resource

    Returns:
    struct: FlowDefinitionArn
    '''
    response = sagemaker.create_flow_definition(
            FlowDefinitionName= flow_definition_name,
            RoleArn= ROLE,
            HumanLoopConfig= {
                "WorkteamArn": WORKTEAM_ARN,
                "HumanTaskUiArn": humanTaskUiArn,
                "TaskCount": 1,
                "TaskDescription": "Please review the  data and flag for potential fraud",
                "TaskTitle": " Review and Approve / Reject Amazon Fraud detector predictions. "
            },
            OutputConfig={
                "S3OutputPath" : OUTPUT_PATH
            }
        )
    
    return response['FlowDefinitionArn']

Optionally, you can create this workflow definition on the Amazon A2I console. For instructions, see Create a Human Review Workflow.

Setting threshold to start a human loop for high-risk scores from Amazon Fraud Detector predictions

As outlined earlier, you can invoke the Amazon Fraud Detector model endpoint to detect the risk score for given input data. If the risk score is greater than a certain threshold (for example, 900), you create and start the Amazon A2I human loop.

You can change the value of the SCORE_THRESHOLD depending on the risk level for triggering the human review. pred refers to the prediction from the sample record rec from the earlier code. Run the following cell to set up your threshold:

FraudScore= pred['modelScores'][0]['scores']['sample_fraud_detection_insightscore']
print(FraudScore)

SCORE_THRESHOLD = 900
if FraudScore > SCORE_THRESHOLD :

    # Create the human loop input JSON object
    humanLoopInput = {
        'score' : pred['modelScores'][0]['scores'],
        'taskObject': rec
    }

print(json.dumps(humanLoopInput))

Below is the response:

996.0
{"score": {"sample_fraud_detection_insightscore": 996.0}, "taskObject": {"ip_address": "36.72.99.64", "email_address": "fake_bakermichael@example.net", "billing_state": "NJ", "user_agent": "Mozilla", "billing_postal": "32067", "phone_number": "'555-555-0100", "billing_address": "12351 Amanda Knolls Fake St"}}

Starting a human loop for high risk Amazon Fraud detector’s predictions

We send the human loop input for human review and start the Amazon A2I loop with the start-human-loop API. When using Amazon A2I for a custom task, a human loop starts when StartHumanLoop is called in your application. Run the following cell in the notebook to start the human loop:

# Create flow definition
uniqueId = str(int(round(time.time() * 1000)))
flowDefinitionName = f'fraud-detector-a2i-{uniqueId}'
flowDefinitionArn = create_flow_definition(flowDefinitionName)

# Start the human loop
humanLoopName = 'Fraud-detector-' + str(int(round(time.time() * 1000)))
print('Starting human loop - ' + humanLoopName)

response = a2i_runtime_client.start_human_loop(
                            HumanLoopName=humanLoopName,
                            FlowDefinitionArn= flowDefinitionArn,
                            HumanLoopInput={
                                'InputContent': json.dumps(humanLoopInput)
                                }
                            )

Checking the status of the human loop

Run the following accompanying notebook cell to get a login link to navigate to the private workforce portal:

workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

Use the generated link to log in to the private worker portal. Choose Start working to review the results.

Use the generated link to log in to the private worker portal.

On the next page, you can review and classify the fraud detector’s response or send it for further reviews.

On the next page, you can review and classify the fraud detector’s response or send it for further reviews.

The private worker can review the results and submit a response by selecting an option, for example Needs further review and choose Submit.

Evaluating the results

When the labeling work is complete for each high-risk prediction, your results should be available in the S3 output path specified in the human review workflow definition. The human answers (labels) are returned and saved in a JSON file. Run the notebook cell to get the results from Amazon S3:

import re
import pprint
pp = pprint.PrettyPrinter(indent=2)

def retrieve_a2i_results_from_output_s3_uri(bucket, a2i_s3_output_uri):
    '''
    Gets the json file published by A2I and returns a deserialized object
    '''
    splitted_string = re.split('s3://' +  bucket + '/', a2i_s3_output_uri)
    output_bucket_key = splitted_string[1]

    response = s3.get_object(Bucket=bucket, Key=output_bucket_key)
    content = response["Body"].read()
    return json.loads(content)
    

for human_loop_name in completed_loops:

    describe_human_loop_response = a2i_runtime_client.describe_human_loop(
        HumanLoopName=human_loop_name
    )
    
    print(f'nHuman Loop Name: {describe_human_loop_response["HumanLoopName"]}')
    print(f'Human Loop Status: {describe_human_loop_response["HumanLoopStatus"]}')
    print(f'Human Loop Output Location: : {describe_human_loop_response["HumanLoopOutput"]["OutputS3Uri"]} n')
    
    # Uncomment below line to print out a2i human answers
    pp.pprint(retrieve_a2i_results_from_output_s3_uri(BUCKET, describe_human_loop_response['HumanLoopOutput']['OutputS3Uri']))

The following code is the human reviewed output with labels you just submitted:

Human Loop Name: Fraud-detector-1613589638354
Human Loop Status: Completed
Human Loop Output Location: : s3://a2i-fd-demos-2020/a2i-results/fraud-detector-a2i-1613589635065/2021/02/17/19/20/38/Fraud-detector-1613589638354/output.json 

{ 'flowDefinitionArn': 'arn:aws:sagemaker:us-east-1:534095625703:flow-definition/fraud-detector-a2i-1613589635065',
  'humanAnswers': [ { 'acceptanceTime': '2021-02-17T19:20:52.563Z',
                      'answerContent': { 'category': { 'label': 'Needs furthur '
                                                                'review'}},
                      'submissionTime': '2021-02-17T19:23:38.092Z',
                      'timeSpentInSeconds': 165.529,
                      'workerId': '7fe4cd6b55282093',
                      'workerMetadata': { 'identityData': { 'identityProviderType': 'Cognito',
                                                            'issuer': 'https://cognito-idp.us-east-1.amazonaws.com/us-east-1_1DCLqiVmd',
                                                            'sub': 'ec69f8cb-3505-4fef-a2d7-56d1b974644a'}}}],
  'humanLoopName': 'Fraud-detector-1613589638354',
  'inputContent': { 'score': {'sample_fraud_detection_insightscore': 996},
                    'taskObject': { 'billing_address': '12351 Amanda Knolls '
                                                       'Fake St',
                                    'billing_postal': '32067',
                                    'billing_state': 'NJ',
                                    'email_address': 'fake_bakermichael@example.net',
                                    'ip_address': '36.72.99.64',
                                    'phone_number': '555-555-0100',
                                    'user_agent': 'Mozilla'}}}

To improve model performance of the existing Amazon Fraud Detector model, you can combine the preceding JSON response from Amazon A2I with your existing training dataset and retrain your model with a new version.

Cleaning up

To avoid incurring unnecessary charges, delete the resources used in this walkthrough when not in use. For instructions, see the following:

Conclusion

This post demonstrated how you can detect online fraud using Amazon Fraud Detector and set up human review workflows using Amazon A2I custom task type to review and validate high-risk predictions. If this post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available on the GitHub repo for you to use and extend. Contributions are always welcome!


About the Authors

Srinath Godavarthi is a Senior Solutions Architect at AWS and is based in the Washington, DC, area. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS. Prior to AWS, he worked with various systems integrators in healthcare, public safety, and telecom verticals. He focuses on innovative solutions using AI and ML technologies.

 

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML explainability areas in AI/ML. Prior to AWS, she did her masters in Computer Information Systems with a major in Big Data Analytics, and has worked for various IT consultants in the global markets domain.

 

Pranusha Manchala is a Solutions Architect at AWS based in Virginia. She works with hundreds of EdTech customers and provides them with architectural guidance for building highly scalable and cost-optimized applications on AWS. She found her interests in machine learning and artificial intelligence and started to dive deep into this technology. Prior to AWS, she did her masters in Computer Science with double majors in Networking and Cloud Computing.

Read More

How Zopa enhanced their fraud detection application using Amazon SageMaker Clarify

This post is co-authored by Jiahang Zhong, Head of Data Science at Zopa

Zopa is a UK-based digital bank and peer to peer (P2P) lender. In 2005, Zopa launched the first ever P2P lending company to give people access to simpler, better-value loans and investments. In 2020, Zopa received a full bank license to offer people more ways to feel good about money. Since 2005, it has lent out over £5 billion to almost half a million borrowers and generated over £250 million in interest for investors on the platform. Zopa’s key business objectives are to identify quality borrowers, offer competitive credit products to them, and provide great customer experience. Technology and machine learning (ML) are at the core of their business, with applications ranging from credit risk modeling to fraud detection and customer service.

In this post, we use Zopa’s fraud detection system for loans to showcase how Amazon SageMaker Clarify can explain your ML models and improve your operational efficiency.

Business context

Every day, Zopa receives thousands of loan applications and lends out millions of pounds to their borrowers. Due to the nature of its products, Zopa is also a target for identity fraudsters. To combat this, Zopa uses advanced ML models to flag suspicious applications for human review, while leaving the majority of genuine applications to be approved by the highly automated system.

Although a primary objective of such models is to achieve great classification performance, another important concern at Zopa is the explainability of these models, for the following reasons:

  • As a financial service provider, Zopa is obligated to treat customers fairly and provide reasonable visibility into its automated decisions.
  • The data scientists at Zopa need to demonstrate the validity of the model and understand the impact of each input feature.
  • The manual review by the underwriters can be quicker if they know why the model has considered a case as suspicious. They can also be more focused in their investigations and reduce friction in the customer experience.

Technical challenge

The advanced ML algorithms used in Zopa’s fraud detector can learn the non-linear relationship and interactions between the input features. Instead of a constant proportional effect, an input feature can have different levels of impact on each model prediction.

The data scientists at Zopa often used several traditional feature importance methods to understand the impact of the input features in non-linear ML models, such as the Partial Dependence Plots and Permutation Feature Importance. However, these methods can only provide summary insights about the model for a specific population. For the purposes we described, Zopa needed to explain the contribution of each input feature into an individual model score. SHAP (SHapley Additive exPlanations), based on the concept of a Shapley value from the field of cooperative game theory, works well for such a scenario.

There are multiple explainability techniques for individual inference to choose from, each with their pros and cons. For example, Tree SHAP is only applicable to tree-based models, and Integrated Gradients are specific to deep learning models. LIME is model agnostic but not always robust, and Kernel SHAP is computationally expensive. Because Zopa uses an ensemble of models, including gradient boosted trees and neural networks, the choice of specific explainability technique needs to accommodate the range of models used.

As a contrastive explainability technique, SHAP values are calculated by evaluating the model on synthetic data generated against a baseline sample. The explanations of the same case can be different depending on the choices of this baseline sample. This can be partly due to the distinct distributions of the chosen baseline population, such as their demographics. It can also be mere statistical fluctuation due to the limited size of the baseline sample constrained by the computation expense. Therefore, it’s important for the data scientists at Zopa to try out various choices of baseline samples efficiently.

After the SHAP explanations are produced at the granularity of an individual inference, the data scientists at Zopa also want to have an aggregated view over a certain inference population to understand the overall impact. This allows them to spot common patterns and outliers and adjust the model accordingly.

Why SageMaker Clarify

SageMaker is a fully managed service to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML. SageMaker Clarify provides ML developers with greater visibility into your training data and models so you can identify and limit bias, and explain predictions.

One of the key factors why Zopa chose SageMaker Clarify was due to the benefit of a fully managed service for model explanations with pay-as-you-go billing and the integration with the training and deployment phases of SageMaker.

Zopa trains its fraud detection model on SageMaker and can use SageMaker Clarify to view a feature attributions plot in SageMaker Experiments after the model has been trained. These details may be useful for compliance requirements or can help determine if a particular feature has more influence than it should on overall model behavior.

In addition, SageMaker Clarify uses a scalable and efficient implementation of Kernel SHAP, resulting in performance efficiency and cost savings for Zopa that would be incurred if it managed its own compute resources using the open-source algorithm.

Also, Kernel SHAP is model agnostic, and Clarify supports efficient processing of models with multiple outcomes via Spark-based parallelization. This is important to Zopa because it typically uses a combination of different frameworks like XGBoost and TensorFlow, and requires explainability for each model outcome. SHAP values of individual predictions can be computed via a SageMaker Clarify processing job and made available to the underwriting team to understand individual predictions.

SHAP explanations are contrastive and account for deviations from a baseline. Different baselines can generate different explanations, and SageMaker Clarify allows you to input a baseline of your choice. A non-informative baseline can be constructed as the average or random instance from the training dataset, or an informative baseline can be constructed by setting the non-actionable features to the same value as in the given instance. For more information about baseline choices and settings, see SHAP Baselines for Explainability.

Solution overview

Zopa’s fraud detection models use a few dozen input features, such as application details, device information, interaction behavior, and demographics. For model building, the training dataset was extracted from their Amazon Redshift data warehouse and cleaned up before being stored into Amazon Simple Storage Service (Amazon S3). Because Zopa has its own in-house ML library for both feature engineering and ML framework support, it uses the bring your own container (BYOC) approach to leverage the SageMaker managed services and advanced functionalities such as hyperparameter optimization. The optimized models are then deployed through a Jenkins CI/CD pipeline to the existing production system and serve as a microservice for real-time fraud detection as part of Zopa’s customer-facing platform.

As previously mentioned, model explanations are carried out both during model training for model validation and after deployment for model monitoring and generating insights for underwriters. These are done in a non-customer-facing analytical environment due to heavy computation requirements and high tolerance of latency. Zopa uses SageMaker MMS model serving stack in a similar BYOC fashion to register the models for the SageMaker Clarify processing job. SageMaker Clarify spins up an ephemeral model endpoint and invokes it for millions of predictions on synthetic contrastive data. These predictions are then used to compute SHAP values for each individual case, which are stored in Amazon S3.

As mentioned above, an important parameter of the SHAP explainability technique is the choice of the baseline sample. For the fraud detection model, the primary concern of explanation is on those instances that are classified as suspicious. Zopa’s data scientists use an informative baseline sample from the population of past approved non-fraud applications, to explain why those flagged instances are considered suspicious by the model. With SageMaker Clarify, Zopa can also quickly experiment with baseline samples of different sizes, to determine the final baseline sample which gives low statistical uncertainty while keeping the computation cost reasonable.

For model validation and monitoring, the global feature impact can be examined by the aggregation of SHAP values on the training and monitoring data, which are available in the SageMaker Experiments panel. To give insights to operation, the data scientists filter out the features that contributed to the fraud score positively (a likely fraudster) for each individual case, and report them to the underwriting team in the order of the SHAP value of each feature.

The following diagram illustrates the solution architecture.

Conclusion

For a regulated financial service company like Zopa, it’s important to understand how each factor contributes to its ML model’s decision. Having visibility into the reasoning of the model gives confidence to its stakeholders, both internal and external. It also helps its operations team respond faster and provide a better service to their customers. With SageMaker Clarify, Zopa can now produce model explanations more quickly and seamlessly.

To learn more about SageMaker Clarify, see What Is Fairness and Model Explainability for Machine Learning Predictions?


About the Authors

Hasan Poonawala is a Machine Learning Specialist Solution Architect at AWS, based in London, UK. Hasan helps customers design and deploy machine learning applications in production on AWS. He is passionate about the use of machine learning to solve business problems across various industries. In his spare time, Hasan loves to explore nature outdoors and spend time with friends and family.

 

 

Jiahang Zhong is the Head of Data Science at Zopa. He is responsible for data science and machine learning projects across the business, with focus on credit risk, financial crime, operation optimization and customer engagement.

 

 

 

Read More

Training, debugging and running time series forecasting models with the GluonTS toolkit on Amazon SageMaker

Time series forecasting is an approach to predict future data values by analyzing the patterns and trends in past observations over time. Organizations across industries require time series forecasting for a variety of use cases, including seasonal sales prediction, demand forecasting, stock price forecasting, weather forecasting, financial planning, and inventory planning.

Various cutting edge algorithms are available for time series forecasting, such as DeepAR, the seq2seq family, and LSTNet (Long- and Short-term Time-series network). The machine learning (ML) process for time series forecasting is often time-consuming, resource intensive, and requires comparative analysis across multiple parameter combinations and datasets to reach the required precision and accuracy with your models. To determine the best model, developers and data scientists need to:

  1. Select algorithms and hyperparameters.
  2. Build, configure, train, and tune models.
  3. Evaluate these models and compare metrics captured at training and evaluation time.
  4. Visualize results.
  5. Repeat the preceding steps multiple times before choosing the optimal model.

The infrastructure management associated with the scaling required at training time for such an iterative process may lead to undifferentiated heavy lifting for the developers and data scientists involved.

In this post and the associated notebook, we show you how to address these challenges by providing an approach with detailed steps to set up and run time series forecasting models at scale using Gluon Time Series (GluonTS) on Amazon SageMaker. GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet. GluonTS provides utilities for loading and iterating over time series datasets, state-of-the-art models ready to be trained, and building blocks to define your own models and quickly experiment with different solutions.

Solution overview

We first show you how to set up GluonTS on SageMaker using the MXNet estimator, then train multiple models using SageMaker Experiments, use SageMaker Debugger to mitigate suboptimal training, evaluate model performance, and finally generate time series forecasts. We walk you through the following steps:

  1. Prepare the time series dataset.
  2. Create the algorithm and hyperparameters combinatorial matrix.
  3. Set up the GluonTS training script.
  4. Set up a SageMaker experiment and trials.
  5. Create the MXNet estimator.
  6. Set up an experiment with Debugger enabled to automatically stop suboptimal jobs.
  7. Train and validate models.
  8. Evaluate metrics and select a winning candidate.
  9. Run time series forecasts.

Prerequisites

Before getting started, you must set up your SageMaker notebook instance and install the required packages. Complete the following steps:

  1. Onboard to Amazon SageMaker Studio with the Quick start procedure.
  2. When you create an AWS Identity and Access Management (IAM) role to the notebook instance, be sure to specify access to Amazon Simple Storage Service (Amazon S3). You can choose any S3 bucket or specify the S3 bucket you want to enable access to. You can use the AWS-managed policies AmazonSageMakerFullAccess to grant general access to SageMaker services.

You can use the AWS-managed policies AmazonSageMakerFullAccess to grant general access to SageMaker services.

  1. When the user is created and active, choose Open Studio.

When the user is created and active, choose Open Studio.

  1. On the Studio landing page, on the File drop-down menu, choose New.
  2. Choose Terminal.

Choose Terminal.

  1. In the terminal, enter the following code:
    git clone https://github.com/aws-samples/amazon-sagemaker-gluonts-timeseriesforecasting-with-debuggerandexperiments

  2. Open the notebook by choosing Amazon SageMaker GluonTS time series forecasting.ipynb
  3. Install the required packages by entering the following code:
    ! pip install gluonts
    ! pip install --upgrade sagemaker
    ! pip install sagemaker-experiments
    ! pip install --upgrade smdebug-rulesconfig

Preparing the time series dataset

For this post, we use the individual household electric power consumption dataset. (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.) The usage data is aggregated hourly.

Let’s download and store the usage data as a DataFrame:

import pandas as pd
url = "https://raw.githubusercontent.com/aws-samples/amazon-forecast-samples/master/notebooks/common/data/item-demand-time.csv"
df = pd.read_csv(url, header=None, names=["date", "usage", "client"])

Define the S3 bucket and folder locations to store the test and training data. This should be within the same Region as the notebook instance, training, and hosting.

Now let’s divide the raw data into train and test samples and save them in their respective S3 folder locations using the Pandas DataFrame query function. We can check first few entries of the train and test dataset. Both datasets should have the same fields, as in the following code:

df_train = df.query('date <= "2014-31-10 11:00:00"').copy()
df_train.to_csv("train.csv")
s3_client.upload_file("train.csv", "glutonts-electricity", pref+"/train.csv")
df_train.head()

df_test = df.query('date >= "2014-1-11 12:00:00"').copy()
df_test.to_csv("test.csv")
s3_client.upload_file("test.csv", "glutonts-electricity", pref+"/test.csv")
df_test.head()

Creating the algorithm and hyperparameters combinatorial matrix

GluonTS comes with pre-built probabilistic forecasting models. Instead of simply predicting a single point estimate, probabilistic forecasting assigns a probability to every outcome. GluonTS provides a number of ready to use algorithm packages for training probabilistic forecasting models. When you select an algorithm, you can configure the hyperparameters to control the learning process during model training.

SageMaker supports bring your own model using Script mode. You can use SageMaker to train and deploy a model using custom MXNet code. The Amazon SageMaker Python SDK MXNet estimators and models and the SageMaker open-source MXNet container make writing a MXNet script and running it in SageMaker easier.

In this post, we train using four different models from the GluonTS toolkit: 

DeepAR – A supervised learning algorithm for forecasting scalar time series using Recurrent Neural Networks (RNN)

SFeedFwd (Simple Feedforward) – A supervised learning algorithm where information moves in only one direction—forward—from the input nodes, through the hidden nodes (if any), and to the output nodes in the forward direction

LSTNet – A multivariate time series forecasting model that uses the combination of Convolution Neural Network (CNN) and the RNN to find short-term local dependency patterns among variables and then find long-term patterns for time series trends

seq2seq (sequence-to-sequence learning) – A family of architectures; for this post we use the MQCNNEstimator of the seq2seq family to set up our training

All these algorithms are already part of GluonTS; we use it to quickly iterate and experiment over different models.

A trainer defines how a network is going to be trained. Let’s define a trainer object using a Pandas DataFrame that has the base list of algorithms, different epochs, learning rate, and hyperparameter combinations that we want to define for our training runs. We use the product function to derive combinations of these parameters from the base set into separate rows in the DataFrame. Each row corresponds to a training job configuration that we subsequently pass to the MXNet estimator to run the training job. See the following code:

import pandas as pd
d = {'epochs': [5, 10, 15, 20], 'algo': ["DeepAR", "SFeedFwd", "lstnet", "seq2seq"], 'num_batches_per_epoch': [10, 15, 20, 25], 'learning_rate':[1e-2, 1e-3, 1e-3, 1e-3], 'hybridize':[False, True, True, True]}
df_hps = pd.DataFrame(data=d)
df_hps['prediction_length'] = [30, 60, 75, 100]

from itertools import product

prod = product(df_hps['epochs'].unique(), df_hps['algo'].unique(), df_hps['num_batches_per_epoch'].unique(), df_hps['learning_rate'].unique(), df_hps['hybridize'].unique(), df_hps['prediction_length'].unique())
df_hps_combo = pd.DataFrame([list(p) for p in prod],
               columns=list(['epochs', 'algo', 'num_batches_per_epoch', 'learning_rate', 'hybridize', 'prediction_length']))
df_hps_combo['jobnumber'] = df_hps_combo.index

Setting up the GluonTS training script

We use a Python entry script to import the necessary GluonTS libraries, set up the GluonTS estimators using the model packages for our algorithms of interest, and pass in our algorithm and hyperparameter preferences from the MXNet estimator we set up in the notebook. The script uses the train and test data files we uploaded to Amazon S3 to create the corresponding GluonTS datasets for training and evaluation. When training is complete, the script runs an evaluation to generate metrics and store them using the SageMaker Debugger hook function, which we use to choose a winning model. For further analysis, the metrics are also available via the SageMaker trial component analytics (which we discuss later in this post). The model is then serialized for storage and future retrieval.

For more details, refer to the entry script available in the GitHub repo. From the accompanying notebook, you can also run the cell in Step 3 to review the script.

Setting up a SageMaker experiment

SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. SageMaker Experiments is integrated with SageMaker Studio, providing a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best-performing models. SageMaker Experiments comes with its own Experiments SDK, which makes the analytics capabilities easily accessible in SageMaker notebooks. Because SageMaker Experiments enables tracking of all the steps and artifacts that go into creating a model, you can quickly revisit the origins of a model when you’re troubleshooting issues in production or auditing your models for compliance verifications. You can create your experiment with the following code:

from smexperiments.experiment import Experiment
sagemaker_boto_client = boto3.client("sagemaker")

Experiment.create(
experiment_name=experiment_name,
description="Timeseries models",
sagemaker_boto_client=sagemaker_boto_client)

For each job, we define a new trial component within that experiment. Next we define an experiment config, which is a dictionary that we pass into the fit() method later on. This ensures that the training job is associated with that experiment and trial. For the full code block for this step, refer to the accompanying notebook.

Creating the MXNet estimator

You can run MXNet training scripts on SageMaker by creating an MXNet estimator. Before setting up the actual training runs with the parameter sweep, let’s test the MXNet estimator using a single set of an algorithm and hyperparameters, in this case DeepAR. See the following code:

import sagemaker
from sagemaker.mxnet import MXNet

mxnet_estimator = MXNet(entry_point='blog_train_algos.py',
                        role=sagemaker.get_execution_role(),
                        train_instance_type='ml.m5.large',
                        train_instance_count=1,
                        framework_version='1.7.0', 
                        py_version='py3',
                        hyperparameters={'bucket': bucket,
                            'seq': 10,
                            'algo': "DeepAR",             
                            'freq': "D", 
                            'prediction_length': 30, 
                            'epochs': 10,
                            'learning_rate': 1e-3,
                            'hybridize': False,
                            'num_batches_per_epoch': 10,
                         })

After specifying our estimator with all the necessary hyperparameters, we can train it using our training dataset. We train it by invoking the fit() method of the estimator. We pass the location of train and test data as well as the experiment configuration. The training algorithm returns a fitted model (or a predictor in GluonTS parlance) that we can use to construct forecasts. See the following code:

mxnet_estimator.fit({"train": s3_train_channel, "test": s3_test_channel}, 
                    experiment_config=experiment_config,
                    wait=False)

You can review the job parameters and metrics from the trial component view in SageMaker Studio (see the following screenshot).

You can review the job parameters and metrics from the trial component view in SageMaker Studio

Setting up an experiment with SageMaker Debugger enabled to automatically stop suboptimal jobs

We ran a parameter sweep and created lots of different configurations when we ran the product function to generate the hyperparameters combinatorial matrix in the second step above. Doing so may produce parameter combinations that lead to suboptimal models. We can use SageMaker Debugger to tune our experiment. Debugger automatically captures data from the model training and provides built-in rules that check for conditions such as overfitting and vanishing gradients. We can then specify actions to automatically stop training jobs ahead of time that would otherwise produce low-quality models. Some of the models in our experiment use RNNs that can suffer from the vanishing gradient problem. We use the Debugger tensor variance rule, which allows us to specify an upper and lower bound on the gradient values. We also specify the action StopTraining, which stops a training job when the rule triggers. By default, Debugger collects data with an interval of 500 steps. For this post, our training dataset is small and our models only train for a few minutes, so we can decrease the save interval. We create a custom collection, where we collect gradients at an interval of 5:

mxnet_estimator.fit({"train": s3_train_channel, "test": s3_test_channel}, 
                    experiment_config=experiment_config,
                    wait=False)
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

debugger_hook_config = DebuggerHookConfig(
      collection_configs=[ 
          CollectionConfig(
                name="custom_collection",
                parameters={ "include_regex": "(.*gradient)(?!.*featureembedder)(.*weight)",
                             "start_step": "10",
                             "save_interval": "5"})])

We then define a new SageMaker experiment to run the trials based on the combinatorial matrix we created earlier. When the experiment is complete, we can determine how many seconds it ran. We then define a helper function to compute the billable seconds and how many training jobs were stopped automatically. This setup is especially useful if you run a parameter sweep with training jobs that train for hours. In our case, each job only trained for less than 10 minutes. Until the Debugger data is uploaded, fetched, and downloaded into the processing job, a few minutes may pass, so the potential cost reduction is less for smaller training jobs.

#name of experiment
timestep = datetime.now()
timestep = timestep.strftime("%d-%m-%Y-%H-%M-%S")
experiment_name = timestep + "-timeseries-models"

#create experiment
Experiment.create(
    experiment_name=experiment_name, 
    description="Timeseries models", 
    sagemaker_boto_client=sagemaker_boto_client)

See the accompanying notebook for the full code in this section.

Training and validating models

In a previous step, we trained one model. Now we iterate over all possible combinations of hyperparameters and algorithms that we generated using the product function with the SageMaker Debugger rules enabled to detect suboptimal training jobs and stop them automatically if the rule fails. A SageMaker experiment consists of multiple trials with a related objective. A trial consists of one or more trial components, such as a data preprocessing job and a training job. Each trial component within our experiment corresponds to one training job run. SageMaker Studio provides an experiments browser that you can use to view lists of experiments, trials, and trial components (see the following screenshot).

SageMaker Studio provides an experiments browser that you can use to view lists of experiments, trials, and trial components

You can choose one of these entities to view detailed information about the entity or choose multiple entities for comparison (see the following screenshot).

You can choose one of these entities to view detailed information about the entity or choose multiple entities for comparison

For more information, see View Experiments, Trials, and Trial Components. For the code block for this step, refer to the accompanying notebook. If you would like to tune further, you can also run a hyperparameter tuning job. Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. Please refer to the SageMaker documentation for an example.

When our experiment completes its training run, we can check to see if any training jobs were stopped automatically. As we can see in the following screenshot, Debugger identified that the tensor variance of three jobs exceeded the gradient limits we set up in the rule and stopped them.

Debugger identified that the tensor variance of three jobs exceeded the gradient limits we set up in the rule and stopped them.

Evaluating metrics and selecting a winning candidate

When the training jobs are running, we can use the experiments view in Studio or the ExperimentAnalytics module to track the status of our training jobs and their metrics. In the training script, we used the SageMaker Debugger function save_scalar to store metrics such as mean absolute percentage error (MAPE), mean squared error (MSE), and root mean squared error (RMSE) in the experiment. We can access the recorded metrics via the ExperimentAnalytics function and convert it to a Pandas DataFrame:

from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(experiment_name=experiment_name)
df = trial_component_analytics.dataframe()
new_df = df[['epochs', 'learning_rate', 'hybridize', 'num_batches_per_epoch','prediction_length','scalar/MASE_GLOBAL - Min', 'scalar/MSE_GLOBAL - Min', 'scalar/RMSE_GLOBAL - Min', 'scalar/MAPE_GLOBAL - Min']]
mape_min = new_df['scalar/MAPE_GLOBAL - Min'].min()
df_winner = new_df[new_df['scalar/MAPE_GLOBAL - Min'] == mape_min]

Now let’s review the job summary and find the job with better forecasting accuracy. Different metrics define their own expectation function. The MAPE is a common statistical measure used for forecast accuracy. From our results matrix, let’s find the prediction job that has the lowest MAPE value to get the winning model. The following screenshot shows that the lowest MAPE in our run is 0.07373.

The following screenshot shows that the lowest MAPE in our run is 0.07373.

Download the winning model with the following code:

s3 = boto3.client("s3")
windir = "gluonts/blog-models/"+str(df_winner['jobnumber'].item())+"/"

def downloadDirectoryFroms3(bucket, windir):
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(bucket) 
    for obj in bucket.objects.filter(Prefix = windir):
        print(obj.key)
        if not os.path.exists(os.path.dirname(windir)):
            os.makedirs(os.path.dirname(windir))
        bucket.download_file(obj.key, obj.key) # save to same path


downloadDirectoryFroms3(bucket, windir)

Restore the predictor with the following code:

from gluonts.model.predictor import Predictor
path = pathlib.Path(windir)   
winning_predictor = Predictor.deserialize(path)

Running time series forecasts

When we use the GluonTS predictor to run our forecasts, we request predictions for the quantiles we’re interested in. A forecast at a specified quantile is used to provide a prediction interval, which is a range of possible values to account for forecast uncertainty. For example, a forecast at the 0.5 quantile estimates a value that is lower than the observed value 50% of the time. Our predictions return a QuantileForecast object that contains time series ordered in array for quantiles and mean. See the following code:

import matplotlib.pyplot as plt
from gluonts.dataset.common import ListDataset
plt.rcParams['figure.figsize'] = (20.0, 6.0)

# run forecast
startdate = '2014-11-01 01:00:00'
test_pred = ListDataset(
    [{"start": startdate, "target": raw_df.query('date >= "2014-11-01 01:00:00" and client == "client_12"').copy()['usage'], "item_id": 'client_12'}],
    freq = "1H"
)

pred = winning_predictor.predict(test_pred)
for test_entry, forecast in zip(test_pred, pred):
    print(forecast.start_date)
    plt.plot(pd.date_range(start=startdate, periods=30), pd.DataFrame.from_dict(test_entry['target'])[0][:30],color='b')
    plt.plot(pd.date_range(start=forecast.start_date, periods=df_winner['prediction_length'].item()), forecast.quantile(.3), color='r') #samples contain all 100 quantiles
    plt.plot(pd.date_range(start=forecast.start_date, periods=df_winner['prediction_length'].item()), forecast.quantile(.5), color='g') #samples contain all 100 quantiles
    plt.plot(pd.date_range(start=forecast.start_date, periods=df_winner['prediction_length'].item()), forecast.quantile(.7), color='k') #samples contain all 100 quantiles
    x=pd.date_range(start=forecast.start_date, periods=df_winner['prediction_length'].item()) #samples contain all 100 quantiles
    y=forecast.quantile(.1) 
    z=forecast.quantile(.9)
    plt.fill_between(x,y,z,color='g', alpha=0.3)
plt.xticks(rotation=30)
plt.legend(['Usage'], loc = 'lower left')
plt.show()

The blue line in the following forecasted plot represents the historical energy usage for a specific client, and the red, green, and black lines indicate the predicted energy usage at 30%, 50%, and 70% quantiles respectively for that client.

The blue line in the following forecasted plot represents the historical energy usage for a specific client.

For more details, see the GluonTS Model Forecast module.

Conclusion

With SageMaker, it’s easy for every developer and data scientist to set up time series forecasting at scale using the MXNet estimator with the GluonTS toolkit. SageMaker removes the undifferentiated heavy lifting from every step of our ML process, automates infrastructure management, enables us to improve the training efficiency with SageMaker Debugger, and accelerates adoption of ML workflows from months to days. Try out the notebook from our post and let us know your comments and feedback.

References

For more information about GluonTS and algorithms like DeepAR, see the following:


About the Authors

Prem Ranga is an Enterprise Solutions Architect based out of Atlanta, GA. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an Autonomous Vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.

 

Nathalie Rauschmayr is an Applied Scientist at AWS, where she helps customers develop deep learning applications.

 

 

 

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML explainability areas in AI/ML.

 

 

Jana Gnanachandran is an Enterprise Solutions Architect at AWS, focusing on Data Analytics, AI/ML, and Serverless platforms. He helps AWS customers across numerous industries to design and build highly scalable, data-driven, analytical solutions to accelerate their cloud adoption. In his spare time, he enjoys playing tennis, 3D printing, and photography.

Read More

Applying voice classification in an Amazon Connect telemedicine contact flow

Given the rising demand for fast and effective COVID-19 detection, customers are exploring the usage of respiratory sound data, like coughing, breathing, and counting, to automatically diagnose COVID-19 based on machine learning (ML) models. University of Cambridge researchers built a COVID-19 sound application and demonstrated that a simple binary ML classifier can classify healthy and COVID-19 coughs with over 80% area under the curve (AUC) for all tasks. Massachusetts Institute of Technology (MIT) researchers published a similar open voice model, and their Convolutional Neural Network (CNN) based binary classifier achieves COVID-19 sensitivity of 98.5% with a specificity of 94.2% (AUC 0.97). Carnegie Mellon University also built a COVID voice detector to develop an automated AI system to diagnose a COVID-19 infection based on the human voice. The promising results of these preliminary studies based on crowdsourced audio signals shows the power of AI in the medical industry for disease diagnosis and detection.

Although the research has shown a lot of promise, it’s still difficult to create a scalable solution that takes advantage of these promising models. In this post, we demonstrate a smart call center application workflow that integrates a voice classification model to detect COVID-19 infections or other types of respiratory diseases in people calling in to the call center. For the purposes of creating an end-to-end workflow, we train the model on the open-source Coswara data, which relies on a variety of sounds like deep or shallow breathing, coughing, and counting to distinguish healthy versus unhealthy sound. You can replace this model and training data with any other model or datasets to achieve the level of performance as demonstrated in the research papers.

Overview of solution

This solution uses Amazon Connect, an easy-to-use omnichannel cloud contact center contact flow to make real-time inference to an ML model trained and deployed using Amazon SageMaker. The audio recordings are labeled as healthy (negative) and unhealthy (positive), meaning a COVID-19 infection and other respiratory illness. Because the distribution of positive and negative labels are highly imbalanced, we use the oversampling technique from the Python imbalanced learn library to improve the ratio. We used the PyTorch acoustic classification model, which relies on deep Convolutional Neural Network (CNN) for this audio-based COVID prediction. The trained CNN model is deployed to a SageMaker inference endpoint. The AWS Lambda function triggered by the Amazon Connect contact flow is used to make real-time inference based on the audio streams from an Amazon Connect phone call recording in Amazon Kinesis Video Streams.

The following is the architecture diagram for integrating online ML inference in a telemedicine contact flow via Amazon Connect.

The following is the architecture diagram for integrating online ML inference in a telemedicine contact flow via Amazon Connect.

Training and deploying a voice classification model using SageMaker

We first create a SageMaker notebook instance, on which we build a voice classification deep learning model to predict the likelihood of respiratory diseases using the open-source Coswara dataset. To deploy the AWS CloudFormation stack for the notebook instance, choose Launch Stack:

Feel free to change the notebook instance type if necessary. The deployment also clones the following two GitHub repositories:

Go to the Jupyter notebook coswara-audio-classification.ipynb under the applying-voice-classification-in-amazon-connect-contact-flow/sagemaker-voice-classification/notebook folder.

The notebook walks you through the following tasks:

The notebook walks you through the following tasks:

  1. Preprocess the Coswara data, including uncompressing files and generating the metadata CSV files for each type of audio recording.
  2. Build and upload the Docker container image for SageMaker training and inference jobs to Amazon Elastic Container Registry (Amazon ECR).
  3. Upload Coswara data to an Amazon Simple Storage Service (Amazon S3) bucket for the SageMaker training job.
  4. Train a Pytorch CNN estimator for voice classification given the sample hyperparameters.
  5. Create a hyperparameter optimization (HPO) job (optional).
  6. Deploy the trained PyTorch estimator to the SageMaker inference endpoint.
  7. Test batch prediction and invoke the endpoint.

Because this dataset is highly unbalanced, we labeled healthy samples as negative and all non-healthy samples as positive, and over-sampled the positive ones using imbalanced-learn library in the train.py file under the notebook folder:

import torch
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
for data, target in data_loader:
    data_resampled, target_resampled = ros.fit_resample(np.squeeze(data), target)
    data = torch.from_numpy(data_resampled)
    data = data.unsqueeze_(-2)
    target = torch.tensor(target_resampled)

In the preceding code, the data and target are torch tensors returned by the getitem function defined in the CoswareDataset class in the coswara_dataset.py file. The oversampling approach improved the prediction performance by approximately 40%. We implemented a very deep CNN for voice classification in the inference.py file with the default number of classes as two, and applied different metrics in the Scikit-learn Python library to evaluate the prediction performance:

from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, fbeta_score, roc_auc_score
accuracy = accuracy_score(actuals, predictions)
rocauc = roc_auc_score(actuals, np.exp(prediction_probs))
precision = precision_score(actuals, predictions, average='weighted')
recall = recall_score(actuals, predictions, average='weighted')
f1 = f1_score(actuals, predictions, average='weighted')
f2 = fbeta_score(actuals, predictions, average='weighted', beta=0.5)

The tuning job tries to maximize the F-beta score, which is the weighted harmonic mean of precision and recall. When you’re satisfied with the prediction performance of the training job, you can deploy a SageMaker inference endpoint:

from sagemaker.pytorch import PyTorchModel

pytorch_model = PyTorchModel(model_data=model_location, 
                             role=role, 
                             entry_point='inference.py',
                             source_dir='./',
                             py_version='py3',
                             framework_version='1.6.0',
                            )
predictor = pytorch_model.deploy(initial_instance_count=1, instance_type='ml.c5.2xlarge', wait=True)

After deploying the estimator for online prediction, take note of the inference endpoint name, which you use in the next step.

After deploying the estimator for online prediction, take note of the inference endpoint name, which you use in the next step.

It’s noteworthy that the inference endpoint can be invoked by two types of request body defined in the inference.py file:

  • A text string for the S3 object of the audio recording WAV file
  • A pickled NumPy array

See the following code:

def input_fn(request_body, request_content_type):
    if request_content_type == 'text/csv':
        new_sr=8000
        audio_len=20
        sampling_ratio=5
        tmp=request_body[5:]
        bucket=tmp[:tmp.index('/')]
        print("bucket: {}".format(bucket))
        obj=tmp[tmp.index('/')+1:]
        print("object: {}".format(obj))
        s3.download_file(bucket, obj, '/audioinput.wav')
        print("audio input file size: {}".format(os.path.getsize('/audioinput.wav')))
        waveform, sample_rate = torchaudio.load('/audioinput.wav')
        waveform = torchaudio.transforms.Resample(sample_rate, new_sr)(waveform[0, :].view(1, -1))
        const_len = new_sr * audio_len
        tempData = torch.zeros([1, const_len])
        if waveform.shape[1] < const_len:
            tempData[0, : waveform.shape[1]] = waveform[:]
        else:
            tempData[0, :] = waveform[0, :const_len]
        sound = tempData
        tempData = torch.zeros([1, const_len])
        if sound.shape[1] < const_len:
            tempData[0, : sound.shape[1]] = sound[:]
        else:
            tempData[0, :] = sound[0, :const_len]
        sound = tempData
        new_const_len = const_len // sampling_ratio
        soundFormatted = torch.zeros([1, 1, new_const_len])
        soundFormatted[0, 0, :] = sound[0, ::5]
        return soundFormatted
    elif request_content_type in ['application/x-npy', 'application/python-pickle']:
        return torch.tensor(np.load(BytesIO(request_body), allow_pickle=True))
    else:
        print("unknown request content type: {}".format(request_content_type))
        return request_body

The output is the probability of the positive class from 0 to 1, which indicates how likely the voice is unhealthy in this use case, defined in inference.py as well:

def predict_fn(input_data, model):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()
    with torch.no_grad():
        output = model(input_data.to(device))
        output = output.permute(1, 0, 2)[0]
        pred_prob = np.exp( output.cpu().detach().numpy()[:,1] )
        return pred_prob[0]

Deploying a CloudFormation template for Lambda functions for audio streaming inference

You can deploy the Lambda function with the following CloudFormation stack one-click deployment in the us-east-1 Region:

You need to fill in the S3 bucket name for the audio recording and the SageMaker inference endpoint as parameters.

You need to fill in the S3 bucket name for the audio recording and the SageMaker inference endpoint as parameters.

If you want to deploy this stack in AWS Regions other than us-east-1, or if you want to change the Lambda functions, go to the connect-audio-stream-solution folder and follow the steps to build and deploy the Serverless Application Model (AWS SAM) stack. Take note of the CloudFormation stack outputs for the Lambda function ARNs, which you use in the next step.

Take note of the CloudFormation stack outputs for the Lambda function ARNs, which you use in the next step.

Setting up an interactive voice response using Amazon Connect

We use an Amazon Connect contact flow to trigger Lambda functions, created in the previous step, to process the captured audio recording in Kinesis Video Streams, assuming you have an Amazon Connect instance ready to use. For instructions on setting up an Amazon Connect instance, see Create an Amazon Connect instance. You also need to enable live audio streaming for your instance. Your instance should be created in the same AWS Region as your previous CloudFormation stack, because your video stream should be created in the same Region for Lambda functions to consume.

You can create a new inbound contact flow by importing the flow configuration file. You need to claim a phone number and associate it with the newly created contact flow. There are two Lambda functions to be configured here: the ARNs of ContactFlowlambdaInitArn and ContactFlowlambdaTriggerArn, located on the Outputs tab of the CloudFormation stack you deployed in the previous step.

You can create a new inbound contact flow by importing the flow configuration file.

After changing the ARNs for the Lambda functions, save and publish the contact flow. Now you’re ready to test it by calling the associated phone number with this contact flow.

Cleaning up

To avoid unexpected future charges, clean up your resources:

  1. Delete the SageMaker inference endpoint.
  2. Empty and delete the S3 bucket DefaultS3Bucket.
  3. Delete the CloudFormation stack for the SageMaker notebook instances and Lambda functions used by Amazon Connect.

References

This solution was inspired and built upon the following GitHub repos:

Conclusion

In this post, we demonstrated how to predict the likelihood of COVID-19 or other respiratory diseases just based on voice classification. To further improve the ML prediction performance, you can incorporate other related information into the model, like age, gender, or existing symptoms. Audio data augmentation plus handcrafted features can help yield better prediction results, according to existing studies. You can use the audio-based diagnostic prediction in an Amazon Connect contact flow to triage the targeted group of incoming calls and escalate to a doctor to follow up if necessary. The intelligence provided by the acoustic classification can be used by call center agents in conjunction with Contact Lens for Amazon Connect, which provides a turn-by-turn transcript, real-time alerts, automated call categorization based on keywords and phrases, sentiment analysis, issue detection (the reason the customer contacted the call center), and sensitive data redaction.

To find the latest developments to this solution, check out the GitHub repo.


About the Authors

Gang Fu is a Senior Healthcare Solution Architect at AWS. He holds a PhD in Pharmaceutical Science from the University of Mississippi and has over 10 years of technology and biomedical research experience. He is passionate about technology and the impact it can make on healthcare.

 

 

Ujjwal Ratan is a Principal Machine Learning Specialist in the Global Healthcare and Life Sciences team at Amazon Web Services. He works on the application of machine learning and deep learning to real-world industry problems like medical imaging, unstructured clinical text, genomics, precision medicine, clinical trials, and quality of care improvement. He has expertise in scaling machine learning and deep learning algorithms on the AWS Cloud for accelerated training and inference. In his free time, he enjoys listening to (and playing) music and taking unplanned road trips with his family.

 

Wei Yih YapWei Yih Yap is a Senior Data Scientist with AWS Professional Services, where he works with customers to address business challenges using machine learning on AWS. In his spare time, he enjoys spending time with his family.

Read More

Machine learning on distributed Dask using Amazon SageMaker and AWS Fargate

As businesses around the world are embarking on building innovative solutions, we’re seeing a growing trend adopting data science workloads across various industries. Recently, we’ve seen a greater push towards reducing the friction between data engineers and data scientists. Data scientists are now enabled to run their experiments on their local machine and port to it powerful clusters that can scale without rewriting the code.

You have many options for running data science workloads, such as running it on your own managed Spark cluster. Alternatively there are cloud options such as Amazon SageMaker, Amazon EMR and Amazon Elastic Kubernetes Service (Amazon EKS) clusters. We’re also seeing customers adopting Dask—a distributed data science computing framework that natively integrates with Python libraries such as Pandas, NumPy, and Scikit-learn machine learning (ML) libraries. These libraries were developed in Python and originally optimized for single-machine processing. Dask was developed to help scale these widely used packages for big data processing. In the past few years, Dask has matured to solve CPU and memory-bound ML problems such as big data processing, regression modeling, and hyperparameter optimization.

Dask provides the distributed computing framework to scale your CPU and memory-bound ML problems beyond a single machine. However, the underlying infrastructure resources have to be provided to the framework. You can use AWS Fargate to provide those infrastructure resources. Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (Amazon ECS) and Amazon EKS. Fargate makes it easy for you to focus on building your applications. Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation by design. Fargate ensures that the infrastructure your containers run on is always up to date with the required patches.

In this post, we demonstrate how to solve common ML problems such as exploratory data analysis on large datasets, preprocessing (such as one-hot encoding), and linear regression modeling using Fargate as the backend. We show you how to connect to a distributed Dask Fargate cluster from a SageMaker notebook, scale out Dask workers, and perform exploratory data analysis work on large public New York taxi trip datasets, containing over 100 million trips. Then we demonstrate how you can run regression algorithms on a distributed Dask cluster. Finally, we demonstrate how you can monitor the operational metrics of a Dask cluster that is fronted by a Network Load Balancer to access the cluster-monitoring dashboard from the internet.

Solution overview

Our use case is to demonstrate how to perform exploratory data analysis on large datasets (over 10 GB with hundreds of millions of records) and run a linear regression algorithm on a distributed Dask cluster. For this use case, we use the publicly available New York City Taxi and Limousine Commission (TLC) Trip Record Data. We use a SageMaker notebook with the backend integrated with a scalable distributed Dask cluster running on Amazon ECS on Fargate.

The following diagram illustrates the solution architecture.

The following diagram illustrates the solution architecture.

We provide an AWS CloudFormation template to provision the following resources:

You then complete the following manual setup steps:

  1. Register the Dask schedulers task’s private IP as the target in the NLB.
  2. Upload the example notebook to the SageMaker notebook instance.
  3. Follow the instructions in the notebook, which we also walk through in this post.

Implementing distributed Dask on Fargate using AWS CloudFormation

To provision your resources with AWS CloudFormation, complete the following steps:

  1. Log in to your AWS account and choose your Region.
  2. On the AWS CloudFormation console, create a stack using the following template.

On the AWS CloudFormation console, create a stack using the following template.

  1. Provide the stack parameters.

Provide the stack parameters.

  1. Acknowledge that AWS CloudFormation might need additional resources and capabilities.
  2. Choose Create stack.

Choose Create stack.

Implementing distributed Dask on Fargate using the AWS CLI

To implement distributed Dask using the AWS Command Line Interface (AWS CLI), complete the following steps:

  1. Install the AWS CLI.
  2. Run the following command to create the CloudFormation stack:
    aws cloudformation create-stack --template-url https://aws-ml-blog.s3.amazonaws.com/artifacts/machine-learning-on-distributed-dask/dask-fargate-main.template --stack-name dask-fargate --capabilities "CAPABILITY_AUTO_EXPAND" "CAPABILITY_IAM" "CAPABILITY_NAMED_IAM" --region us-east-1

Setting up Network Load Balancer to monitor the Dask cluster

To set up NLB to monitor your Fargate Dask cluster, complete the following steps:

  1. On the Amazon ECS console, choose Clusters.
  2. Choose your cluster.
  3. Choose the Dask scheduler service.
  4. On the Tasks tab, choose the running task.

On the Tasks tab, choose the running task.

  1. Copy the value for Private IP.

Copy the value for Private IP.

  1. On the Amazon Elastic Compute Cloud (Amazon EC2) console, choose Target groups.
  2. Choose dask-scheduler-tg1.
  3. On the Targets tab, choose Register targets.

On the Targets tab, choose Register targets.

  1. For Network, choose dask-vpc-main.
  2. For IP, enter the IP you copied earlier.
  3. For Ports, enter 8787.
  4. Choose Include as pending below.

Choose Include as pending below.

  1. Wait until the targets are registered and then navigate to the Load Balancers page on the Amazon EC2 console.
  2. On the Description tab, copy the value for DNS name.

On the Description tab, copy the value for DNS name.

  1. Enter the DNS name into your browser to view the Dask dashboard.

Enter the DNS name into your browser to view the Dask dashboard.

For demo purposes, we set up our Network Load Balancer in a public subnet without certificates. We recommend securing the Network Load Balancer with certificates and appropriate firewall rules.

Machine learning using Dask on Fargate: Notebook overview

To walk through the accompanying notebook, complete the following steps:

  1. On the Amazon ECS console, choose Clusters.
  2. Ensure that Fargate-Dask-Cluster is running with one task each for Dask-Scheduler and Dask-Workers.
  3. On the SageMaker console, choose Notebook instances.
  4. Open Jupyter and upload dask-sm-fargate-example.ipynb.
  5. Run each cell of the notebook and observe the results (see the following sections for details on each cell).
  6. Use the Network Load Balancer public DNS to monitor the performance of the cluster as you run the notebook cells.

Setting up conda package dependencies

The SageMaker notebook’s conda_python3 environment ships with a set of four packages. To run a distributed dask, you need to bring a few new packages as well the updated version of the existing packages. Run conda install for these packages: scikit-learn 0.23, dask-ml 1.6.0, and cloudpickle 1.6.0. See the following code:

!conda install scikit-learn=0.23.2 -c conda-forge -n python3 -y
!conda install -n python3 dask-ml=1.6.0 -c conda-forge -y
!conda install cloudpickle=1.6.0 -c conda-forge  -y
!conda install s3fs=0.4.0 -c conda-forge  -y

Each command takes about 5 minutes to install because it needs to resolve dependencies.

Setting up the Dask client

The client is the primary entry point for users of distributed Dask. You register the client to the distributed Dask scheduler that is deployed and running in the Fargate cluster. See the following code:

from dask.distributed import Client
client = Client('Dask-Scheduler.local-dask:8786')

Scaling Dask workers

Distributed Dask is a centrally managed, distributed, dynamic task scheduler. The central dask-scheduler process coordinates the actions of several dask-worker processes spread across multiple machines and the concurrent requests of several clients. Internally, the scheduler tracks all work as a constantly changing directed acyclic graph of tasks.

You can scale the Dask workers to add more compute and memory capacity for your data science workload. See the following code:

!sudo aws ecs update-service --service Dask-Workers --desired-count 20 --cluster Fargate-Dask-Cluster
client.restart()

Alternatively, you can use the Service Auto Scaling feature of Fargate to automatically scale the resources (number of tasks).

After client restart, you now have 40 cores with over 80 GB of memory, as shown in the following screenshot.

After client restart, you now have 40 cores with over 80 GB of memory, as shown in the following screenshot.

You can verify this on the Amazon ECS console by checking the Fargate Dask workers.

Exploratory data analysis of New York taxi trips with Dask DataFrame

A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask DataFrame operation triggers many operations on the constituent Pandas DataFrames. The following diagram illustrates a Dask DataFrame.

The following diagram illustrates a Dask DataFrame.

Lazy loading taxi trips from Amazon S3 to a Dask DataFrame

Use the s3fs and dask.dataframe libraries to load the taxi trips from the public Amazon Simple Storage Service (Amazon S3) bucket into a distributed Dask DataFrame. S3FS builds on botocore to provide a convenient Python file system interface for Amazon S3. See the following code:

import s3fs
import dask.dataframe as dd

df = dd.read_csv(
    's3://nyc-tlc/trip data/yellow_tripdata_2018-*.csv', storage_options={'anon': True}, parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime']
)

Dask lazily loads the records as computations are performed, unlike the Pandas DataFrame. The datasets contain over 100 million trips, which are loaded into the distributed Dask DataFrame, as shown in the following screenshot.

The datasets contain over 100 million trips, which are loaded into the distributed Dask DataFrame, as shown in the following screenshot.

Taxi trip data structure

You can determine the data structure of the loaded Dask DataFrame using df.dtypes:

VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
RatecodeID                        int64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
trip_dur_secs                     int64
pickup_date                      object

Calculate the maximum trip duration

You have over 100 million trips loaded into the Dask DataFrame, and now you need to determine of all the trips that took the maximum amount of time to complete. This is a memory and compute-intensive operation and hard to perform on a single machine with limited resources. Dask allows you to use the familiar DataFrame operations, but on the backend runs those operations on a scalable distributed Fargate cluster of nodes. See the following code:

df['trip_dur_secs'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.seconds
max_trip_duration = df.trip_dur_secs.max().compute()

The following screenshot shows the output.

The following screenshot shows the output.

Calculating the average number of passengers by pickup date

Now you want to know the average number of passengers across all trips for each pickup date. You can easily compute that across millions of trips using the following code:

df['pickup_date'] = df['tpep_dropoff_datetime'].dt.date
df_mean_psngr_pickup_date = df.groupby('pickup_date').passenger_count.mean().compute()

The following screenshot shows the output.

The following screenshot shows the output.

You can also perform aggregate operations across the entire dataset to identify the total number of trips and trip distance for each vendor, as shown in the following screenshot.

You can also perform aggregate operations across the entire dataset to identify the total number of trips and trip distance for each vendor

You can determine the memory usage across the Dask cluster nodes using the Dask workers dashboard, as shown in the following screenshot. For the preceding operation, the consumed memory across workers amounts to 31 GB.

For the preceding operation, the consumed memory across workers amounts to 31 GB.

 

As you run the preceding code, navigate to the Dask dashboard to see how Dask performs those operations. The following screenshot shows an example visualization of the Dask dashboard.

The following screenshot shows an example visualization of the Dask dashboard.

The visualization shows from-delayed in the progress pane. Sometimes we face problems that are parallelizable, but don’t fit into high-level abstractions like Dask Array or Dask DataFrame. For those problems, the Dask delayed function decorates your functions so they operate lazily. Rather than running your function immediately, it defers running and places the function and its arguments into a task graph.

Persisting collections into memory

You can pin the data in distributed memory of the worker nodes using the distributed Dask’s persist API:

df_persisted = client.persist(df)

Typically, we use asynchronous methods like client.persist to set up large collections and then use df.compute() for fast analyses. For example, you can compute the maximum trip distance across all trips:

max_trip_dist = df_persisted.trip_distance.max().compute()

The following screenshot shows the output.

The following screenshot shows the output.

In the preceding cell, this computation across 102 million trips took just 344 milliseconds. This is because the entire Dask DataFrame was persisted into the memory of the Fargate worker nodes, thereby enabling computations to run significantly faster.

Visualizing trips with Dask DataFrame

Dask DataFrame supports visualization with matplotlib, which is similar to Pandas DataFrame.

Imagine you want to know the top 10 expensive rides by pickup location. You can visualize that as in the following screenshot.

You can visualize that as in the following screenshot.

Predicting trip duration with Dask ML linear regression

As we have explored the taxi trips dataset, now we predict the duration of trips when no pickup and drop-off times are available. We use the historical trips that have a labeled trip duration to train a linear regression model and then use that model to predict trip duration for new trips that have no pickup and drop-off times.

We use LinearRegression from dask_ml.linear_model and the Dask Fargate cluster as the backend for training the model. Dask ML provides scalable ML in Python using Dask alongside popular ML libraries like Scikit-learn, XGBoost, and others.

The dask_ml.linear_model module implements linear models for classification and regression. Dask ML algorithms integrates with joblib to submit jobs to the distributed Dask cluster.

For training the model, we need to prepare the training and testing datasets. We use the dask_ml.model_selection library to create those datasets, as shown in the following screenshot.

We use the dask_ml.model_selection library to create those datasets, as shown in the following screenshot.

We use the dask_ml.model_selection library to create those datasets, as shown in the following screenshot.

After you train the model, you can run a prediction against the model to predict the trip duration using the testing dataset, as shown in the following screenshot.

After you train the model, you can run a prediction against the model to predict the trip duration using the testing dataset

Logging and monitoring

You can monitor, troubleshoot, and set alarms for all your Dask cluster resources running on Fargate using Amazon CloudWatch Container Insights. This fully managed service collects, aggregates, and summarizes Amazon ECS metrics and logs. The following screenshot shows an example of the Container Insights UI for the preceding notebook run.

The following screenshot shows an example of the Container Insights UI for the preceding notebook run.

Cleaning up

To clean up your resources, delete your main stack on the CloudFormation console or use the following AWS CLI command:

aws cloudformation delete-stack --stack-name dask-fargate --region us-east-1

Conclusion

In this post, we showed how to stand up a highly scalable infrastructure for performing ML on distributed Dask with a SageMaker notebook and Fargate. We demonstrated the core concepts of how to use Dask DataFrame to perform big data processing such as aggregating records by pickup date and finding the longest trip of over 100 million taxi trips. Then we did an exploratory data analysis and visualized expensive rides. Finally, we showed how to use the Dask ML library to perform linear regression algorithm to predict trip duration.

Give distributed Dask a try for your ML use cases, such as performing exploratory data analysis, preprocessing, and linear regression, and leave your feedback in the comments section. You can also access the resources for this post in the GitHub repo.

References

For resources and more information about the tools and services in this post, see the following:


About the Authors

Ram VittalRam Vittal is an enterprise solutions architect at AWS. His current focus is to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys tennis, photography, and movies.

 

 

Sireesha Muppala is an AI/ML Specialist Solutions Architect at AWS, providing guidance to customers on architecting and implementing machine learning solutions at scale. She received her Ph.D. in Computer Science from University of Colorado, Colorado Springs. In her spare time, Sireesha loves to run and hike Colorado trails.

 

Read More

Solving numerical optimization problems like scheduling, routing, and allocation with Amazon SageMaker Processing

In this post, we discuss solving numerical optimization problems using the very flexible Amazon SageMaker Processing API. Optimization is the process of finding the minimum (or maximum) of a function that depends on some inputs, called design variables. This pattern is relevant to solving business-critical problems such as scheduling, routing, allocation, shape optimization, trajectory optimization, and others. Several commercial and open-source solvers are available for solving such problems. We demonstrate this solution with three popular Python libraries and solvers that are free to use, and provide a sample notebook that shows how to solve these optimization problems using SageMaker Processing.

Solution overview

SageMaker Processing lets data scientists and ML engineers easily run preprocessing, postprocessing, and model evaluation workloads on SageMaker. This SDK uses the built-in container for scikit-learn and Spark. You can also use your own Docker images without having to conform to any Docker image specification. This gives you maximum flexibility in running any code you want, whether on SageMaker Processing, on AWS container services like Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS), or even on premises; which is what we do in this post. First, we build and push a Docker image that includes several popular optimization packages and solvers, and then we use this Docker image to solve three example problems:

  • Minimize the cost of shipping goods through a distribution network
  • Scheduling shifts of a set of nurses in a hospital
  • Find a trajectory for landing the Apollo 11 Lunar Module with the least amount of fuel

We solve each use case using a different interface that connects to a different solver. We complete the following high-level steps (as in the provided example notebook) for each problem:

  1. Build a Docker container that contains useful Python interfaces (such as Pyomo and PuLP) to optimization solvers (such as GLPK and CBC)
  2. Build and push the image to a repository in Amazon Elastic Container Registry (Amazon ECR).
  3. Use the SageMaker Python SDK (from a notebook or elsewhere with the right permissions) to point to the Docker image in Amazon ECR and send in a Python file with the actual optimization problem.
  4. Monitor the logs in a notebook or Amazon CloudWatch Logs and obtain and outputs you need in a dedicated output folder that you specify in Amazon Simple Storage Service (Amazon S3).

Schematically, this process looks like the following diagram.

Schematically, this process looks like the following diagram.

Let’s get started!

Building and pushing a Docker container

Start with the following Dockerfile:

FROM continuumio/anaconda3

RUN pip install boto3 pandas scikit-learn pulp pyomo inspyred ortools scipy deap 

RUN conda install -c conda-forge ipopt coincbc glpk

ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python"]

In this code, we install Python interfaces to solvers such as PuLP, Pyomo, Inspyred, OR-Tools, Scipy, and DEAP. For more information about these solvers, see the References section at the end of this post.

We then use the following commands from the notebook to build and push this container to Amazon ECR:

import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-opt-container'
tag = ':latest'
processing_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)

# Create ECR repository and push docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

Sample output for this command looks like the following code:

Sending build context to Docker daemon  2.048kB
Step 1/5 : FROM continuumio/anaconda3
 ---> 5fbf7bac70a0
Step 2/5 : RUN pip install boto3 pandas scikit-learn pulp pyomo inspyred ortools scipy deap
 ---> Using cache
 ---> 98864164a472
Step 3/5 : RUN conda install -c conda-forge ipopt coincbc glpk
 ---> Using cache
 ---> 1fde58988350
Step 4/5 : ENV PYTHONUNBUFFERED=TRUE
 ---> Using cache
 ---> 06cc27c84a9a
Step 5/5 : ENTRYPOINT ["python"]
 ---> Using cache
 ---> 0ae65a2ad5b9
Successfully built <code>
Successfully tagged sagemaker-opt-container:latest
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'sagemaker-opt-container' already exists in the registry with id '<account numnber>'
The push refers to repository [<account number>.dkr.ecr.<region>.amazonaws.com/sagemaker-opt-container]

8611989d: Preparing 
d960633d: Preparing 
db96c31c: Preparing 
ea6160d7: Preparing 
4bce66cd: Layer already exists latest: digest: sha256:<hash> size: 1379

Using the SageMaker Python SDK to start a job

Typically, we first initialize a SageMaker Processing ScriptProcessor as follows:

from sagemaker.processing import ScriptProcessor

script_processor = ScriptProcessor(command=['python'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge')

Then we write a file (for this post, we always use a file called preprocessing.py) and run a processing job on SageMaker as follows:

from sagemaker.processing import ProcessingInput, ProcessingOutput

script_processor.run(code='preprocessing.py',
                      outputs=[ProcessingOutput(output_name='data',
                                                source='/opt/ml/processing/data')])

script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)

Use case 1: Minimizing the cost of shipping goods through a distribution network

In this use case, American Steel, an Ohio-based steel manufacturing company, produces steel at its two steel mills located at Youngstown and Pittsburgh. The company distributes finished steel to its retail customers through the distribution network of regional and field warehouses.

The network represents shipment of finished steel from American Steel—two steel mills located at Youngstown (node 1) and Pittsburgh (node 2) to their field warehouses at Albany, Houston, Tempe, and Gary (nodes 6, 7, 8, and 9) through three regional warehouses located at Cincinnati, Kansas City, and Chicago (nodes 3, 4, and 5). Also, some field warehouses can be directly supplied from the steel mills.

The following table presents the minimum and maximum flow amounts of steel that may be shipped between different cities, along with the cost per 1,000 tons per month of shipping the steel. For example, the shipment from Youngstown to Kansas City is contracted out to a railroad company with a minimal shipping clause of 1,000 tons per month. However, the railroad can’t ship more than 5,000 tons per month due the shortage of rail cars.

From node To node Cost Minimum Maximum
Youngstown Albany 500 1000
Youngstown Cincinnati 350 3000
Youngstown Kansas City 450 1000 5000
Youngstown Chicago 375 5000
Pittsburgh Cincinnati 350 2000
Pittsburgh Kansas City 450 2000 3000
Pittsburgh Chicago 400 4000
Pittsburgh Gary 450 2000
Cincinnati Albany 350 1000 5000
Cincinnati Houston 550 6000
Kansas City Houston 375 4000
Kansas City Tempe 650 4000
Chicago Tempe 600 2000
Chicago Gary 120 4000

The objective of transshipment problems in general and The American Steel Problem in particular is to minimize the cost of shipping goods through the network.

All the nodes have supply and demand, demand = 0 for supply nodes, supply = 0 for demand nodes, and supply = demand = 0 for transshipment nodes. The only constraints in the transshipment problem are flow conservation constraints. These constraints simply state that the flow of goods into a node must be greater than or equal to the flow of goods out of a node.

This problem can be formulated as follows:

%%writefile preprocessing.py

import argparse
import os
import warnings

"""
The American Steel Problem for the PuLP Modeller

Authors: Antony Phillips, Dr Stuart Mitchell  2007
"""

# Import PuLP modeller functions
from pulp import *

# List of all the nodes
Nodes = ["Youngstown",
         "Pittsburgh",
         "Cincinatti",
         "Kansas City",
         "Chicago",
         "Albany",
         "Houston",
         "Tempe",
         "Gary"]

nodeData = {# NODE        Supply Demand
         "Youngstown":    [10000,0],
         "Pittsburgh":    [15000,0],
         "Cincinatti":    [0,0],
         "Kansas City":   [0,0],
         "Chicago":       [0,0],
         "Albany":        [0,3000],
         "Houston":       [0,7000],
         "Tempe":         [0,4000],
         "Gary":          [0,6000]}

# List of all the arcs
Arcs = [("Youngstown","Albany"),
        ("Youngstown","Cincinatti"),
        ("Youngstown","Kansas City"),
        ("Youngstown","Chicago"),
        ("Pittsburgh","Cincinatti"),
        ("Pittsburgh","Kansas City"),
        ("Pittsburgh","Chicago"),
        ("Pittsburgh","Gary"),
        ("Cincinatti","Albany"),
        ("Cincinatti","Houston"),
        ("Kansas City","Houston"),
        ("Kansas City","Tempe"),
        ("Chicago","Tempe"),
        ("Chicago","Gary")]

arcData = { #      ARC                Cost Min Max
        ("Youngstown","Albany"):      [0.5,0,1000],
        ("Youngstown","Cincinatti"):  [0.35,0,3000],
        ("Youngstown","Kansas City"): [0.45,1000,5000],
        ("Youngstown","Chicago"):     [0.375,0,5000],
        ("Pittsburgh","Cincinatti"):  [0.35,0,2000],
        ("Pittsburgh","Kansas City"): [0.45,2000,3000],
        ("Pittsburgh","Chicago"):     [0.4,0,4000],
        ("Pittsburgh","Gary"):        [0.45,0,2000],
        ("Cincinatti","Albany"):      [0.35,1000,5000],
        ("Cincinatti","Houston"):     [0.55,0,6000],
        ("Kansas City","Houston"):    [0.375,0,4000],
        ("Kansas City","Tempe"):      [0.65,0,4000],
        ("Chicago","Tempe"):          [0.6,0,2000],
        ("Chicago","Gary"):           [0.12,0,4000]}

# Splits the dictionaries to be more understandable
(supply, demand) = splitDict(nodeData)
(costs, mins, maxs) = splitDict(arcData)

# Creates the boundless Variables as Integers
vars = LpVariable.dicts("Route",Arcs,None,None,LpInteger)

# Creates the upper and lower bounds on the variables
for a in Arcs:
    vars[a].bounds(mins[a], maxs[a])

# Creates the 'prob' variable to contain the problem data    
prob = LpProblem("American Steel Problem",LpMinimize)

# Creates the objective function
prob += lpSum([vars[a]* costs[a] for a in Arcs]), "Total Cost of Transport"

# Creates all problem constraints - this ensures the amount going into each node is at least equal to the amount leaving
for n in Nodes:
    prob += (supply[n]+ lpSum([vars[(i,j)] for (i,j) in Arcs if j == n]) >=
             demand[n]+ lpSum([vars[(i,j)] for (i,j) in Arcs if i == n])), "Steel Flow Conservation in Node %s"%n

# The problem data is written to an .lp file
prob.writeLP('/opt/ml/processing/data/' + 'AmericanSteelProblem.lp')

# The problem is solved using PuLP's choice of Solver
prob.solve()

# The status of the solution is printed to the screen
print("Status:", LpStatus[prob.status])

# Each of the variables is printed with it's resolved optimum value
for v in prob.variables():
    print(v.name, "=", v.varValue)

# The optimised objective function value is printed to the screen    
print("Total Cost of Transportation = ", value(prob.objective))

We solve this problem using the PuLP interface and its default solver GLPK using script_processor.run. Logs from this optimization job provide these solutions:

Status: Optimal
Route_('Chicago',_'Gary') = 4000.0
Route_('Chicago',_'Tempe') = 2000.0
Route_('Cincinatti',_'Albany') = 2000.0
Route_('Cincinatti',_'Houston') = 3000.0
Route_('Kansas_City',_'Houston') = 4000.0
Route_('Kansas_City',_'Tempe') = 2000.0
Route_('Pittsburgh',_'Chicago') = 3000.0
Route_('Pittsburgh',_'Cincinatti') = 2000.0
Route_('Pittsburgh',_'Gary') = 2000.0
Route_('Pittsburgh',_'Kansas_City') = 3000.0
Route_('Youngstown',_'Albany') = 1000.0
Route_('Youngstown',_'Chicago') = 3000.0
Route_('Youngstown',_'Cincinatti') = 3000.0
Route_('Youngstown',_'Kansas_City') = 3000.0
Total Cost of Transportation =  15005.0

Use case 2: Scheduling shifts of a set of nurses in a hospital

In the next example, a hospital supervisor must create a schedule for four nurses over a 3-day period, subject to the following conditions:

  • Each day is divided into three 8-hour shifts
  • Every day, each shift is assigned to a single nurse, and no nurse works more than one shift
  • Each nurse is assigned to at least two shifts during the 3-day period

For more information about this scheduling use case, see Employee Scheduling.

This problem can be formulated as follows:

%%writefile preprocessing.py


from __future__ import print_function
from ortools.sat.python import cp_model



class NursesPartialSolutionPrinter(cp_model.CpSolverSolutionCallback):
    """Print intermediate solutions."""

    def __init__(self, shifts, num_nurses, num_days, num_shifts, sols):
        cp_model.CpSolverSolutionCallback.__init__(self)
        self._shifts = shifts
        self._num_nurses = num_nurses
        self._num_days = num_days
        self._num_shifts = num_shifts
        self._solutions = set(sols)
        self._solution_count = 0

    def on_solution_callback(self):
        if self._solution_count in self._solutions:
            print('Solution %i' % self._solution_count)
            for d in range(self._num_days):
                print('Day %i' % d)
                for n in range(self._num_nurses):
                    is_working = False
                    for s in range(self._num_shifts):
                        if self.Value(self._shifts[(n, d, s)]):
                            is_working = True
                            print('  Nurse %i works shift %i' % (n, s))
                    if not is_working:
                        print('  Nurse {} does not work'.format(n))
            print()
        self._solution_count += 1

    def solution_count(self):
        return self._solution_count




def main():
    # Data.
    num_nurses = 4
    num_shifts = 3
    num_days = 3
    all_nurses = range(num_nurses)
    all_shifts = range(num_shifts)
    all_days = range(num_days)
    # Creates the model.
    model = cp_model.CpModel()

    # Creates shift variables.
    # shifts[(n, d, s)]: nurse 'n' works shift 's' on day 'd'.
    shifts = {}
    for n in all_nurses:
        for d in all_days:
            for s in all_shifts:
                shifts[(n, d,
                        s)] = model.NewBoolVar('shift_n%id%is%i' % (n, d, s))

    # Each shift is assigned to exactly one nurse in the schedule period.
    for d in all_days:
        for s in all_shifts:
            model.Add(sum(shifts[(n, d, s)] for n in all_nurses) == 1)

    # Each nurse works at most one shift per day.
    for n in all_nurses:
        for d in all_days:
            model.Add(sum(shifts[(n, d, s)] for s in all_shifts) <= 1)

    # min_shifts_per_nurse is the largest integer such that every nurse
    # can be assigned at least that many shifts. If the number of nurses doesn't
    # divide the total number of shifts over the schedule period,
    # some nurses have to work one more shift, for a total of
    # min_shifts_per_nurse + 1.
    min_shifts_per_nurse = (num_shifts * num_days) // num_nurses
    max_shifts_per_nurse = min_shifts_per_nurse + 1
    for n in all_nurses:
        num_shifts_worked = sum(
            shifts[(n, d, s)] for d in all_days for s in all_shifts)
        model.Add(min_shifts_per_nurse <= num_shifts_worked)
        model.Add(num_shifts_worked <= max_shifts_per_nurse)

    # Creates the solver and solve.
    solver = cp_model.CpSolver()
    solver.parameters.linearization_level = 0
    # Display the first five solutions.
    a_few_solutions = range(5)
    solution_printer = NursesPartialSolutionPrinter(shifts, num_nurses,
                                                    num_days, num_shifts,
                                                    a_few_solutions)
    solver.SearchForAllSolutions(model, solution_printer)

    # Statistics.
    print()
    print('Statistics')
    print('  - conflicts       : %i' % solver.NumConflicts())
    print('  - branches        : %i' % solver.NumBranches())
    print('  - wall time       : %f s' % solver.WallTime())
    print('  - solutions found : %i' % solution_printer.solution_count())


if __name__ == '__main__':
    main()

We solve this problem using the OR-Tools interface and its CP-SAT solver with script_processor.run. Logs from this optimization job provide these solutions:

Solution 0
Day 0
  Nurse 0 does not work
  Nurse 1 works shift 0
  Nurse 2 works shift 1
  Nurse 3 works shift 2
Day 1
  Nurse 0 works shift 2
  Nurse 1 does not work
  Nurse 2 works shift 1
  Nurse 3 works shift 0
Day 2
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 works shift 0
  Nurse 3 does not work

Solution 1
Day 0
  Nurse 0 works shift 0
  Nurse 1 does not work
  Nurse 2 works shift 1
  Nurse 3 works shift 2
Day 1
  Nurse 0 does not work
  Nurse 1 works shift 2
  Nurse 2 works shift 1
  Nurse 3 works shift 0
Day 2
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 works shift 0
  Nurse 3 does not work

Solution 2
Day 0
  Nurse 0 works shift 0
  Nurse 1 does not work
  Nurse 2 works shift 1
  Nurse 3 works shift 2
Day 1
  Nurse 0 works shift 1
  Nurse 1 works shift 2
  Nurse 2 does not work
  Nurse 3 works shift 0
Day 2
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 works shift 0
  Nurse 3 does not work

Solution 3
Day 0
  Nurse 0 works shift 0
  Nurse 1 does not work
  Nurse 2 works shift 1
  Nurse 3 works shift 2
Day 1
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 does not work
  Nurse 3 works shift 0
Day 2
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 works shift 0
  Nurse 3 does not work

Solution 4
Day 0
  Nurse 0 does not work
  Nurse 1 works shift 0
  Nurse 2 works shift 1
  Nurse 3 works shift 2
Day 1
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 does not work
  Nurse 3 works shift 0
Day 2
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 works shift 0
  Nurse 3 does not work


Statistics
  - conflicts       : 37
  - branches        : 41231
  - wall time       : 0.367511 s
  - solutions found : 5184

Use case 3: Finding a trajectory for landing the Apollo 11 Lunar Module with the least amount of fuel

This example uses Pyomo and a simple model of a rocket to compute a control policy for a soft landing. The parameters used correspond to the descent of the Apollo 11 Lunar Module to the moon on July 20, 1969. For a rocket with a mass 𝑚 in vertical flight at altitude ℎ, a momentum balance yields the following model:

a momentum balance yields the following model:

 

In this model, 𝑢 is the mass flow of propellant and 𝑣𝑒 is the velocity of the exhaust relative to the rocket. In this first attempt at modeling and control, we neglect the change in rocket mass due to fuel burn.

Fuel consumption can be calculated as the following:

Fuel consumption can be calculated as the following:

We want to find a trajectory that minimizes fuel consumption:

We want to find a trajectory that minimizes fuel consumption:

This problem can be formulated as follows:

%%writefile preprocessing.py

import numpy as np

from pyomo.environ import *
from pyomo.dae import *

#Define constants ...
# lunar module
m_ascent_dry = 2445.0          # kg mass of ascent stage without fuel
m_ascent_fuel = 2376.0         # kg mass of ascent stage fuel
m_descent_dry = 2034.0         # kg mass of descent stage without fuel
m_descent_fuel = 8248.0        # kg mass of descent stage fuel

m_fuel = m_descent_fuel
m_dry = m_ascent_dry + m_ascent_fuel + m_descent_dry
m_total = m_dry + m_fuel

# descent engine characteristics
v_exhaust = 3050.0             # m/s
u_max = 45050.0/v_exhaust      # 45050 newtons / exhaust velocity

# landing mission specifications
h_initial = 100000.0           # meters
v_initial = 1520               # orbital velocity m/s
g = 1.62                       # m/s**2

m = ConcreteModel()
m.t = ContinuousSet(bounds=(0, 1))
m.h = Var(m.t)
m.u = Var(m.t, bounds=(0, u_max))
m.T = Var(domain=NonNegativeReals)

m.v = DerivativeVar(m.h, wrt=m.t)
m.a = DerivativeVar(m.v, wrt=m.t)

m.fuel = Integral(m.t, wrt=m.t, rule = lambda m, t: m.u[t]*m.T)
m.obj = Objective(expr=m.fuel, sense=minimize)

m.ode1 = Constraint(m.t, rule = lambda m, t: m_total*m.a[t]/m.T**2 == -m_total*g + v_exhaust*m.u[t])

m.h[0].fix(h_initial)
m.v[0].fix(-v_initial)

m.h[1].fix(0)    # land on surface
m.v[1].fix(0)    # soft landing

def solve(m):
    TransformationFactory('dae.finite_difference').apply_to(m, nfe=50, scheme='FORWARD')
    SolverFactory('ipopt').solve(m, tee=True)
    
solve(m)

We use the Pyomo interface and the nonlinear optimization solver Ipopt to solve this continuous-time, trajectory optimization problem. Logs from ScriptProcessor.run provide the following solution:

Ipopt 3.12.13: 

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

This is Ipopt version 3.12.13, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:      448
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:      154

Error in an AMPL evaluation. Run with "halt_on_ampl_error yes" to see details.
Error evaluating Jacobian of equality constraints at user provided starting point.
  No scaling factors for equality constraints computed!
Total number of variables............................:      201
                     variables with only lower bounds:        1
                variables with lower and upper bounds:       51
                     variables with only upper bounds:        0
Total number of equality constraints.................:      151
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
   0  9.9999800e-05 5.00e+06 9.90e-01  -1.0 0.00e+00    -  0.00e+00 0.00e+00   0
   1r 9.9999800e-05 5.00e+06 9.99e+02   6.7 0.00e+00    -  0.00e+00 4.29e-14R  4
   2r 2.1397987e+02 5.00e+06 4.78e+08   6.7 2.14e+05    -  1.00e+00 6.83e-05f  1
   3r 2.1342176e+02 5.00e+06 1.36e+08   3.2 4.37e+04    -  7.16e-01 6.16e-01f  1
   4r 1.7048263e+02 4.99e+06 4.67e+07   3.2 1.60e+04    -  9.85e-01 4.16e-01f  1
   5r 1.5143799e+02 4.99e+06 2.50e+07   3.2 3.57e+03    -  5.88e-01 7.62e-01f  1
   6r 1.3041897e+02 4.99e+06 2.08e+07   3.2 1.89e+03    -  2.75e-01 8.14e-01f  1
   7r 1.1452223e+02 4.99e+06 3.17e+04   3.2 1.97e+03    -  9.78e-01 8.18e-01f  1
   8r 1.1168709e+02 4.99e+06 2.72e+05   3.2 3.36e-01   4.0 9.78e-01 1.00e+00f  1
   9r 1.0774716e+02 4.99e+06 1.66e+05   3.2 4.28e+03    -  9.36e-01 9.70e-02f  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  10r 8.7784873e+01 5.00e+06 5.08e+04   3.2 3.69e+03    -  8.74e-01 7.24e-01f  1
  11r 7.9008215e+01 5.00e+06 1.88e+04   2.5 1.09e+03    -  1.22e-01 8.35e-01h  1
  12r 1.1960245e+02 5.00e+06 4.34e+03   2.5 1.81e+03    -  6.76e-01 1.00e+00f  1
  13r 1.2344166e+02 5.00e+06 1.35e+03   1.8 1.66e+02    -  8.23e-01 1.00e+00f  1
  14r 2.0065756e+02 4.99e+06 6.85e+02   1.1 4.28e+03    -  4.26e-01 1.00e+00f  1
  15r 3.0115879e+02 4.99e+06 4.78e+01   1.1 9.69e+03    -  7.64e-01 1.00e+00f  1
  16r 3.0355974e+02 4.99e+06 5.30e+00   1.1 4.92e+00    -  1.00e+00 1.00e+00f  1
  17r 3.0555655e+02 4.99e+06 6.83e+02   0.4 7.49e+00    -  1.00e+00 1.00e+00f  1
  18r 4.4494526e+02 4.97e+06 2.28e+01   0.4 2.17e+04    -  8.05e-01 1.00e+00f  1
  19r 3.9588385e+02 4.97e+06 3.77e+00   0.4 4.73e+00    -  1.00e+00 1.00e+00f  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  20r 4.0158949e+02 4.97e+06 7.79e-02   0.4 5.70e-01    -  1.00e+00 1.00e+00h  1
  21r 4.0076180e+02 4.97e+06 9.88e+02  -1.0 1.80e+00    -  1.00e+00 1.00e+00f  1
  22r 5.4964501e+02 4.95e+06 7.59e+02  -1.0 1.57e+05    -  2.48e-01 2.32e-01f  1
  23r 5.5056601e+02 4.95e+06 7.57e+02  -1.0 1.21e+05    -  1.00e+00 3.02e-03f  1
  24r 5.5057553e+02 4.95e+06 7.57e+02  -1.0 1.09e+05    -  8.13e-01 3.34e-05f  1
  25r 5.5898777e+02 4.95e+06 7.00e+02  -1.0 3.82e+04    -  1.00e+00 7.48e-02f  1
  26r 6.0274077e+02 4.96e+06 3.93e+02  -1.0 3.53e+04    -  1.00e+00 4.39e-01f  1
  27r 6.0301192e+02 4.96e+06 3.90e+02  -1.0 1.98e+04    -  1.00e+00 7.83e-03f  1
  28r 6.0301418e+02 4.96e+06 3.89e+02  -1.0 1.61e+04    -  1.00e+00 9.62e-05f  1
  29r 5.9834909e+02 4.96e+06 3.71e+02  -1.0 3.63e+03    -  1.00e+00 1.85e-01f  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  30r 5.7601446e+02 4.95e+06 1.67e+00  -1.0 2.96e+03    -  1.00e+00 1.00e+00f  1
  31r 5.6977301e+02 4.95e+06 6.41e-02  -1.0 1.22e+00    -  1.00e+00 1.00e+00h  1
  32r 5.7024128e+02 4.95e+06 9.05e-05  -1.0 4.89e-02    -  1.00e+00 1.00e+00h  1
  33r 5.6989454e+02 4.95e+06 6.84e+02  -2.5 9.30e-02    -  1.00e+00 1.00e+00f  1
  34r 5.7613459e+02 4.94e+06 5.38e+02  -2.5 5.65e+04    -  4.67e-01 2.13e-01f  1
  35r 5.7617358e+02 4.94e+06 5.37e+02  -2.5 4.45e+04    -  1.00e+00 9.52e-04f  1
  36r 6.6264177e+02 4.90e+06 3.78e+01  -2.5 4.45e+04    -  6.62e-01 9.30e-01f  1
  37r 7.5101828e+02 4.90e+06 7.59e+01  -2.5 3.12e+03    -  1.25e-02 1.00e+00f  1
  38r 7.5705424e+02 4.90e+06 8.60e-02  -2.5 7.04e-01    -  1.00e+00 1.00e+00h  1
  39r 7.5713736e+02 4.90e+06 2.85e-05  -2.5 9.02e-03    -  1.00e+00 1.00e+00h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  40r 7.5713093e+02 4.90e+06 4.90e+02  -5.7 6.76e-03    -  1.00e+00 9.99e-01f  1
  41r 1.0909809e+03 4.78e+06 4.67e+02  -5.7 2.54e+06    -  6.15e-02 4.62e-02f  1
  42r 1.0909867e+03 4.78e+06 4.67e+02  -5.7 2.42e+06    -  1.00e+00 9.55e-07f  1
  43r 1.5672936e+03 4.59e+06 8.15e+03  -5.7 2.42e+06    -  3.36e-03 7.69e-02f  1
  44r 1.7598365e+03 4.50e+06 8.17e+03  -5.7 2.24e+06    -  4.43e-08 4.23e-02f  1
  45r 5.7264420e+03 2.36e+06 4.60e+03  -5.7 2.14e+06    -  7.07e-02 1.00e+00f  1
  46  4.3546591e+03 2.35e+06 1.50e+01  -1.0 2.51e+08    -  3.52e-03 2.97e-03f  1
  47  3.7700543e+03 2.16e+06 1.94e+01  -1.0 2.87e+06    -  3.27e-01 8.10e-02f  1
  48  3.9963720e+03 1.02e+06 7.97e+00  -1.0 3.70e+05    -  3.47e-01 5.26e-01h  1
  49  4.0601733e+03 5.28e+05 5.09e+00  -1.0 1.57e+06    -  5.24e-03 4.85e-01h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  50  4.0596593e+03 5.27e+05 3.53e+00  -1.0 4.32e+06    -  7.60e-01 1.81e-03h  1
  51  4.1577305e+03 9.40e+04 7.32e-01  -1.0 4.01e+05    -  9.09e-01 8.22e-01h  1
  52  4.1754490e+03 1.27e+01 4.74e-02  -1.0 5.08e+04    -  8.32e-01 1.00e+00h  1
  53  4.1752565e+03 7.78e-02 8.68e-07  -1.0 1.49e+04    -  1.00e+00 1.00e+00h  1
  54  4.1704409e+03 1.60e+00 3.18e-05  -2.5 1.16e+04    -  1.00e+00 1.00e+00f  1
  55  4.1704236e+03 6.98e-04 2.83e-08  -2.5 1.41e+03    -  1.00e+00 1.00e+00h  1
  56  4.1702897e+03 1.15e-03 2.31e-08  -3.8 2.98e+02    -  1.00e+00 1.00e+00f  1
  57  4.1702823e+03 3.63e-06 5.75e-11  -5.7 1.67e+01    -  1.00e+00 1.00e+00h  1
  58  4.1702822e+03 1.28e-09 1.62e-14  -8.6 2.04e-01    -  1.00e+00 1.00e+00h  1

Number of Iterations....: 58

                                   (scaled)                 (unscaled)
Objective...............:   4.1702822027548118e+03    4.1702822027548118e+03
Dual infeasibility......:   1.6235231869939369e-14    1.6235231869939369e-14
Constraint violation....:   1.2805685400962830e-09    1.2805685400962830e-09
Complementarity.........:   2.5079038009909822e-09    2.5079038009909822e-09
Overall NLP error.......:   2.5079038009909822e-09    2.5079038009909822e-09


Number of objective function evaluations             = 63
Number of objective gradient evaluations             = 16
Number of equality constraint evaluations            = 63
Number of inequality constraint evaluations          = 0
Number of equality constraint Jacobian evaluations   = 60
Number of inequality constraint Jacobian evaluations = 0
Number of Lagrangian Hessian evaluations             = 58
Total CPU secs in IPOPT (w/o function evaluations)   =      0.682
Total CPU secs in NLP function evaluations           =      0.002

EXIT: Optimal Solution Found.

Summary

We used various examples, front ends, and solvers to solve numerical optimization problems using SageMaker Processing. Next, try using Scipy.optimize, DEAP, or Inspyred to explore other examples. See the references in the next section for documentation and other examples to help solve your own business problems using SageMaker Processing. If you currently use SageMaker APIs for your machine learning projects, using SageMaker Processing for running optimization is a simple, obvious extension. However, consider that other compute options on AWS such as Lambda or Fargate may be more relevant when running some of these open source libraries for serverless optimization, especially when your team has this expertise. Lastly, open source libraries are provided as is, with minimal support whereas commercial libraries such as CPLEX and Gurobi are constantly being tuned for higher performance, and usually provide premium support.

References


About the Author

Shreyas Subramanian is a AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using the AWS platform.

Read More

Building an omnichannel Q&A chatbot with Amazon Connect, Amazon Lex, Amazon Kendra, and the open-source QnABot project

For many students, embarking on a higher education journey is an exciting time filled with new experiences. However, like anything new, it also can also bring plenty of questions to answer and obstacles to overcome. Oklahoma State University, Oklahoma City (OSU-OKC) recognized this, and was intent on providing a better solution to address student questions using machine learning (ML) technology from AWS.

They knew that if they could develop a solution that accurately anticipated their students’ needs and delivered timely and relevant information, they could boost their chances of attracting future students. After all, universities need students the same way businesses need customers.

“The first thing we wanted to address was the lack of visibility we had into customer sentiment at any given time,” says Michael Widell, Interim President at OKC-OSU. “Building on that, we also had a real focus on consistency and accuracy of information—it mattered to us that current and future students could rely on the information they were getting across school and faculty communication channels.”

The team identified conversational chatbots as a way to address the information gap that students face. ML-powered chatbots are dynamic, and help connect with students through the communication channels they prefer, whether that’s a website, phone, chatbot, or by asking an Alexa-enabled device.

With this in mind, OSU-OKC began working with AWS Professional Services in January 2020, and became the first university to deploy a call center using Amazon Connect and the QnABot.

Amazon Connect is a cloud contact center that provides a seamless experience across voice and chat for customers and agents. The QnABot is an open-source project that uses Amazon Lex to provide a conversational interface for your questions and answers, and can be applied to a host of communication channels, including websites, contact centers, chatbots, collaboration tools like Slack, and Amazon Alexa-enabled devices.

Deploying QnABot in the call center

Although OSU-OKC’s use of the QnABot evolved throughout 2020, its initial area of focus centered on boosting call center efficiency. They achieved this by automating answers to student FAQs, thereby delivering accurate and up-to-date information, reducing call hold times, and enabling human call center agents to focus on handling higher-value interactions.

The following diagram illustrates the solution architecture.

The following diagram illustrates the solution architecture.

For OSU-OKC, QnABot simplified bot deployment and administration, allowing even non-technical users to maximize the impact of the solution by allowing them to:

Extending the QnABot to the website

After implementing the QnABot to assist agents inside their call center, OSU-OKC decided to extend the bot’s reach to the university’s website. They used the AWS open-source Amazon Lex Web UI project, a sample Amazon Lex Web UI that helps provide a full-featured web client for Amazon Lex chatbots.

After content was gathered from the campus, creating question and answer responses for the bot was an easy process. The content designer provided customization options that allowed for organization and readability. The built-in test features aided the tuning and development process by attributing a matching score to a response.

Shortly after expanding the QnABot to their website, OSU-OKC realized that providing more channels for students to interact with didn’t dilute engagement levels. In fact, they increased overall engagement from their student body and doubled the average number of conversations with students.

Adding the QnABot to the university website wasn’t a replacement for human interaction; it was an aid to increase quality interactions by reducing repetitive phone traffic. Try asking OSU-OKC bot, OKC Pete, some questions of your own via the university website.

Try asking OSU-OKC bot, OKC Pete, some questions of your own via the university website.

OKC Pete on the university website

Equipping the QnABot with more responses

While QnABot answered high volumes of questions for students and delivered consistent service at scale, the OSU-OKC team learned a great deal about student sentiment by observing which questions the QnABot couldn’t answer.

For example, some questions highlighted how much prospective students knew about the campus and its resources. Incoming students asked about dorms when in fact the campus doesn’t have any student housing.

The team could use the QnABot’s Content Designer UI to continuously enhance the bot, and equip it with appropriate responses about student housing or any other campus resources. This helped students avoid a phone call, which freed call center agents to focus on more critical or higher-quality interactions.

This flexibility proved particularly helpful during the onset of the COVID-19 pandemic in the spring of 2020. OSU-OKC was able to rapidly expand the newly deployed QnABot’s knowledge base to include answers to many pandemic-related questions. Students and parents could quickly get answers to the questions that mattered to them via the QnABot-assisted university call center or via the website chatbot.

Scaling QnABot’s knowledge with Amazon Kendra

QnABot’s Content Designer UI allowed OSU-OKC to add new questions and answers to the bot when they identified a gap. However, the team also wanted to ensure that customers could still get answers when a question had not yet been added.

To achieve this, they used Amazon Kendra, a highly accurate intelligent search service. In the summer of 2020, the team at OSU-OKC integrated the QnABot with Amazon Kendra to enhance the accuracy and relevance of responses in the following ways:

  • Use the document index in Amazon Kendra as an additional source of answers when a question and answer isn’t found in QnABot’s knowledge base. This allows QnABot to find answers to questions that may not have been added to its knowledge base, including unstructured data contained in word documents or PDFs that have been indexed by Amazon Kendra.
  • Without extensive QnABot tuning, the natural language processing and reading comprehension capabilities of Amazon Kendra more accurately understand user queries, and its ML models expertly handle variations in how users phrase their questions to increase search accuracy and return relevant responses to user queries.

By using ML to automate the handling of common customer questions via their call center and website, OSU-OKC ensured consistent service levels even during their busiest time of year. Widell says, “During peak we can receive over 2,000 calls, which is too many for one agent to handle—however, since launching the QnABot, it’s supported over 34,000 conversations and saved 833 hours in staff time, while ensuring every customer received the same level of service and accuracy.”

Creating your own QnABot and integrating with Amazon Kendra

To get started on your own QnABot journey, see Create a Question and Answer Bot with Amazon Lex and Amazon Alexa. The section Turbocharging QnABot with Amazon Kendra outlines how to integrate QnABot with Amazon Kendra. If you want to follow OSU-OKC’s lead and add the QnABot to your website, you can take advantage of our companion Chatbot UI project.

As you think about configuration and deployment, consider the following options:

  • Deploy the QnABot and Chatbot UI yourself (self-serve), using the project as is
  • Make your own customizations and enhancements to the open-source code
  • Follow OSU-OKC’s example and contact AWS Professional Services for expert help to customize and enhance QnABot, and to integrate with your own communication channels

For more information, watch the team at OSU-OKC present their QnABot solution at Re:Invent 2020.

Conclusion

The team at OSU-OKC is excited to build on the early success they have seen from deploying the QnABot, Amazon Kendra, and Amazon Lex. “For customers and students, this has been the most impactful technology that we’ve implemented,” Widell says.

Our overarching vision for ML technology will evolve student interactions from being transactional exchanges to becoming more meaningful experiences, allowing us to easily connect to customers, understand their needs, and serve them better. Widell adds, “In the future, we hope to expand our use of QnABot to provide personalized information to students as it relates to their academic schedules, advisement, and other relevant information related to their course of study.”


About the Authors

Bob StrahanBob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

 

 

 

 

Michael Widell is the Interim President at OSU-OKC. As an innovative agent of change he has worked to strengthen organizations through redesign and resource optimization, allowing individuals to excel, and deliver transformative products and services. In his career, Widell has also held leadership positions in the private sector for AT&T and key roles within the General Office of Walmart Inc. where he began his post collegiate career.

Read More