Monitoring in-production ML models at large scale using Amazon SageMaker Model Monitor

Monitoring in-production ML models at large scale using Amazon SageMaker Model Monitor

Machine learning (ML) models are impacting business decisions of organizations around the globe, from retail and financial services to autonomous vehicles and space exploration. For these organizations, training and deploying ML models into production is only one step towards achieving business goals. Model performance may degrade over time for several reasons, such as changing consumer purchase patterns in the retail industry and changing economic conditions in the financial industry. Degrading model quality has a negative impact on business outcomes. To proactively address this problem, monitoring the performance of a deployed model is a critical process. Continuous monitoring of production models allows you to identify the right time and frequency to retrain and update the model. Although retraining too frequently can be too expensive, not retraining enough could result in less-than-optimal predictions from your model.

Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at any scale. After you train an ML model, you can deploy it on SageMaker endpoints that are fully managed and can serve inferences in real time with low latency. After you deploy your model, you can use Amazon SageMaker Model Monitor to continuously monitor the quality of your ML model in real time. You can also configure alerts to notify and trigger actions if any drift in model performance is observed. Early and proactive detection of these deviations enables you to take corrective actions, such as collecting new ground truth training data, retraining models, and auditing upstream systems, without having to manually monitor models or build additional tooling.

In this post, we discuss monitoring the quality of a classification model through classification metrics like accuracy, precision, and more.

Solution overview

The following diagram illustrates the high-level workflow of Model Monitor. You start with an endpoint to monitor and configure a fraction of inference data to be captured in real time and stored in an Amazon Simple Storage Service (Amazon S3) bucket of your choice. Model Monitor allows you to capture both input data sent to an endpoint and predictions made by the model. After that, you can create a baseline job to generate statistical rules and constraints that serve as the basis for your model analysis later. Then, you define monitoring job and attach it to an endpoint through a schedule.

Model Monitor starts monitoring jobs to analyze the model prediction data collected during a given period. For monitoring model performance characteristics such as accuracy or precision in real time, Model Monitor allows you to ingest the ground truth labels collected from your applications. Model Monitor automatically merges the ground truth information with prediction data to compute the model performance metrics.

The following diagram illustrates the high-level workflow of Model Monitor.

Model Monitor offers four different types of monitoring capabilities to detect and mitigate model drift in real time:

  • Data quality – Helps detect change in statistical properties of independent variables and alerts you when a drift is detected.
  • Model quality – Monitors model performance characteristics such as accuracy and precision in real time and alerts you when there is a degradation in model performance.
  • Model bias – Helps you identify unwanted bias in your ML models and notify you when a bias is detected.
  • Model explainability – Drift detection alerts you when there is a change in the relative importance of feature attributions.

For more information, see Amazon SageMaker Model Monitor.

The rest of this post dives into a notebook with the various steps involved in monitoring a pre-trained and deployed XGBoost customer churn binary classification model. You can use a similar approach for monitoring a regression model for increased error rates.

For detailed notebooks on other Model Monitor capabilities, see the data drift and bias notebook examples on GitHub.

Beyond the steps discussed in this post, there are other steps necessary to import libraries and set up AWS Identity and Access Management (IAM) permissions, and utility functions defined in the notebook, which this post doesn’t mention. You can walk through and run the code with the following notebook in the GitHub repo.

Monitoring model quality

To monitor our model quality, we complete two high-level steps:

  • Deploy a pre-trained model with data capture enabled
  • Generate a baseline for model quality performance

Deploying a pre-trained model

In this step, you deploy a pre-trained XGBoost churn prediction model to a SageMaker endpoint. The model was trained using the XGB Churn Prediction Notebook. If you have a pre-trained model that you want to monitor, you can use your own model in this step.

  1. Upload a trained model artifact to an S3 bucket:
    s3_key = f"s3://{bucket}/{prefix}"
    model_url = S3Uploader.upload("model/xgb-churn-prediction-model.tar.gz", s3_key)
    model_url

You should see output similar to the following code:

s3://sagemaker-us-west-2-xxxxxxxxxxxx/sagemaker/DEMO-ModelMonitor-20200901/xgb-churn-prediction-model.tar.gz
  1. Create a SageMaker model object:
    model_name = f"DEMO-xgb-churn-pred-model-monitor-{datetime.utcnow():%Y-%m-%d-%H%M}"
    image_uri = image_uris.retrieve(framework="xgboost", version="0.90-1", region=region)
    model = Model(image_uri=image_uri, model_data=model_url, role=role, sagemaker_session=session)

  1. Create a variable to specify the data capture parameters. To enable data capture for monitoring the model data quality, you specify the capture option called DataCaptureConfig. You can capture the request payload, the response payload, or both with this configuration.
    endpoint_name = f"DEMO-xgb-churn-model-quality-monitor-{datetime.utcnow():%Y-%m-%d-%H%M}"
    print("EndpointName =", endpoint_name)
    
    data_capture_config = DataCaptureConfig(
                            enable_capture=True,
                            sampling_percentage=100,
                            destination_s3_uri=s3_capture_upload_path)
    
    model.deploy(initial_instance_count=1,
                 instance_type='ml.m4.xlarge',
                 endpoint_name=endpoint_name,
                 data_capture_config=data_capture_config)

  1. Create the SageMaker Predictor object from the endpoint to use for invoking the model:
    from sagemaker.predictor import Predictor
    
    predictor = Predictor(endpoint_name=endpoint_name, sagemaker_session=session, serializer=CSVSerializer())

Generating a baseline for model quality performance

In this step, you generate a baseline model quality that you can use to continuously monitor model quality against. To generate the model quality baseline, you first invoke the endpoint created earlier using validation data. Predictions from the deployed model using this validation data are used as a baseline dataset. You can use either the training or validation dataset to create the baseline. You then use Model Monitor to run a baseline job that computes model performance data and suggests model quality constraints based on the baseline dataset.

  1. Invoke the endpoint with the following code:
    limit = 200 #Need at least 200 samples to compute standard deviations
    i = 0
    with open(f"test_data/{validate_dataset}", "w") as baseline_file:
        baseline_file.write("probability,prediction,labeln") # our header
        with open('test_data/validation.csv', 'r') as f:
            for row in f:
                (label, input_cols) = row.split(",", 1)
                probability = float(predictor.predict(input_cols))
                prediction = "1" if probability > churn_cutoff else "0"
                baseline_file.write(f"{probability},{prediction},{label}n")
                i += 1
                if i > limit:
                    break
                print(".", end="", flush=True)
                sleep(0.5)

  1. Examine the predictions from the model:
    !head test_data/validation_with_predictions.csv

You see output similar to the following code:

probability,prediction,label
0.01516005303710699,0,0
0.1684480607509613,0,0
0.21427156031131744,0,0
0.06330718100070953,0,0
0.02791607193648815,0,0
0.014169521629810333,0,0
0.00571369007229805,0,0
0.10534518957138062,0,0
0.025899196043610573,0,0

Next, you configure a processing job to generate statistical rules and constraints (referred to as your baseline) against which the model quality drift can be detected. Model Monitor suggests a set of default baseline statistics and constraints. You can also bring in custom baseline constraints.

  1. Start by uploading the validation data and predictions to Amazon S3:
    baseline_dataset_uri = S3Uploader.upload(f"test_data/{validate_dataset}", baseline_data_uri)
    baseline_dataset_uri

  1. Create the model quality monitor:
    churn_model_quality_monitor = ModelQualityMonitor(
        role=role,
        instance_count=1,
        instance_type='ml.m5.xlarge',
        volume_size_in_gb=20,
        max_runtime_in_seconds=1800,
        sagemaker_session=session

  1. Run the baseline suggestion processing job:
    job = churn_model_quality_monitor.suggest_baseline(
        job_name=baseline_job_name,
        baseline_dataset=baseline_dataset_uri,
        dataset_format=DatasetFormat.csv(header=True),
        output_s3_uri = baseline_results_uri,
        problem_type='BinaryClassification',
        inference_attribute= "prediction",
        probability_attribute= "probability",
        ground_truth_attribute= "label"
    )
    job.wait(logs=False)

When the baseline job is complete, you can explore the generated metrics and constraints.

  1. View the binary classification metrics with the following code:
    binary_metrics = baseline_job.baseline_statistics().body_dict["binary_classification_metrics"]
    pd.json_normalize(baseline["binary_classification_metrics"]).T

The following screenshot shows your results.

The following screenshot shows your results.

  1. View the constraints generated:
    constraints = json.loads(S3Downloader.read_file(constraints_file))
    constraints["binary_classification_constraints"]
    {'recall': {'threshold': 0.5714285714285714, 'comparison_operator': 'LessThanThreshold'},
     'precision': {'threshold': 1.0,             'comparison_operator': 'LessThanThreshold'},
     'accuracy': {'threshold': 0.9402985074626866,'comparison_operator': 'LessThanThreshold'),
     'true_positive_rate': {'threshold': 0.5714285714285714,'comparison_operator': 'LessThanThreshold'},
     'true_negative_rate': {'threshold': 1.0, 'comparison_operator': 'LessThanThreshold'},
     'false_positive_rate': {'threshold': 0.0,'comparison_operator': 'GreaterThanThreshold'),
     'false_negative_rate': {'threshold': 0.4285714285714286,'comparison_operator': 'GreaterThanThreshold'},
     'auc': {'threshold': 1.0, 'comparison_operator': 'LessThanThreshold'},
     'f0_5': {'threshold': 0.8695652173913042,'comparison_operator': 'LessThanThreshold'},
     'f1': {'threshold': 0.7272727272727273,'comparison_operator': 'LessThanThreshold'},
     'f2': {'threshold': 0.625, 'comparison_operator': 'LessThanThreshold'}}

From the constraints generated, you can see that model monitoring makes sure that the recall score from your model doesn’t regress and drop below 0.571. Similarly, it makes sure that you’re alerted when precision falls below 1.0. This may be too aggressive, but you can modify the generated constraints based on your use case and business needs.

Setting up continuous model monitoring

Now that you have the baseline of the model quality, you set up a continuous model monitoring job that monitors the quality of the deployed model against the baseline to identify model quality drift.

In addition to the generated baseline, Model Monitor needs two additional inputs: predictions made by the deployed model endpoint and the ground truth data to be provided by the model-consuming application. Because you already enabled data capture on the endpoint, prediction data is captured in Amazon S3. The ground truth data depends on the what your model is predicting and what the business use case is. In this case, because the model is predicting customer churn, ground truth data may indicate if the customer actually left the company or not. For the purposes of this notebook, you generate synthetic data as ground truth.

  1. First generate traffic to the deployed endpoint. If there is no traffic, the monitoring jobs are marked as Failed because there is no data to process. See the following code:
    def invoke_endpoint(ep_name, file_name):    
        with open(file_name, 'r') as f:
            i = 0
            for row in f:
                payload = row.rstrip('n')
                response = session.sagemaker_runtime_client.invoke_endpoint(
                    EndpointName=endpoint_name,
                    ContentType='text/csv', 
                    Body=payload,
                    InferenceId=str(i), # unique ID per row
                )["Body"].read()
                i += 1
                sleep(1)
                
    def invoke_endpoint_forever():
        while True:
            invoke_endpoint(endpoint_name, 'test_data/test-dataset-input-cols.csv')
            
    thread = Thread(target = invoke_endpoint_forever)
    thread.start()

  1. View the data captured with the following code:
    for _ in range(120):
        capture_files = sorted(S3Downloader.list(f"{s3_capture_upload_path}/{endpoint_name}"))
        if capture_files:
            capture_file = S3Downloader.read_file(capture_files[-1]).split("n")
            capture_record = json.loads(capture_file[0])
            if "inferenceId" in capture_record["eventMetadata"]:
                break
        print(".", end="", flush=True)
        sleep(1)
    print()
    print("Found Capture Files:")
    print("n ".join(capture_files[-5:]))

You see output similar to the following:

Found Capture Files:
s3://sagemaker-us-west-2-303008809627/sagemaker/Churn-ModelQualityMonitor-20201129/datacapture/DEMO-xgb-churn-model-quality-monitor-2020-12-01-2214/AllTraffic/2020/12/01/22/23-36-108-9df12912-2696-431e-a4ef-a76b3c3f7d32.jsonl
 s3://sagemaker-us-west-2-303008809627/sagemaker/Churn-ModelQualityMonitor-20201129/datacapture/DEMO-xgb-churn-model-quality-monitor-2020-12-01-2214/AllTraffic/2020/12/01/22/24-36-254-df884bcb-405c-4277-9cc8-517f3f31b56f.jsonl
  1. View the contents of a single file:
    print(json.dumps(capture_record, indent=2))

You see output similar to the following:

{
  "captureData": {
    "endpointInput": {
      "observedContentType": "text/csv",
      "mode": "INPUT",
      "data": "75,0,109.0,88,259.3,120,182.1,119,13.3,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0n",
      "encoding": "CSV"
    },
    "endpointOutput": {
      "observedContentType": "text/csv; charset=utf-8",
      "mode": "OUTPUT",
      "data": "0.7990730404853821",
      "encoding": "CSV"
    }
  },
  "eventMetadata": {
    "eventId": "01e27fce-a00a-4707-847e-9748d6a8e580",
    "inferenceTime": "2020-12-01T22:24:36Z"
  },
  "eventVersion": "0"
}

Next, you generate synthetic ground truth. Model Monitor allows you ingest the ground truth data collected periodically from your application and merge it with prediction data to compute model performance metrics. You can periodically upload the ground truth labels as they arrive and upload to Amazon S3. Model Monitor automatically merges the ground truth with prediction data and evaluates model performance against ground truth. The merged data is stored in Amazon S3 and can be accessed later for retraining your models. You can encrypt the data in this bucket and configure fine-grained security, access control mechanisms, and data retention policies.

  1. Enter the following code to generate ground truth in the way that the SageMaker first party merge container expects:
    import random
    def ground_truth_with_id(inference_id):
        random.seed(inference_id) # to get consistent results
        rand = random.random()
        return {
            'groundTruthData': {
                'data': "1" if rand < 0.7 else "0", # randomly generate positive labels 70% of the time
                'encoding': 'CSV'
            },
            'eventMetadata': {
                'eventId': str(inference_id),
            },
            'eventVersion': '0',
        }
    def upload_ground_truth(records, upload_time):
        fake_records = [ json.dumps(r) for r in records ]
        data_to_upload = "n".join(fake_records)
        target_s3_uri = f"{ground_truth_upload_path}/{upload_time:%Y/%m/%d/%H/%M%S}.jsonl"
        print(f"Uploading {len(fake_records)} records to", target_s3_uri)
        S3Uploader.upload_string_as_file_body(data_to_upload, target_s3_uri)

The model quality job fails if either the data capture or ground truth data is missing.

Next, you set up a monitoring schedule that monitors the real-time performance of the model against the baseline.

  1. Set the name of the monitoring scheduler:
    churn_monitor_schedule_name = 
    f"DEMO-xgb-churn-monitoring-schedule-{datetime.utcnow():%Y-%m-%d-%H%M}"
    

You now create the EndpointInput object. For the monitoring schedule, you need to specify how to interpret an endpoint’s output. Because the endpoint in this notebook outputs CSV data, the following code specifies that the first column of the output, 0, contains a probability (of churn in this example). You further specify 0.5 as the cutoff used to determine a positive label (that is, predict that a customer will churn).

  1. Create the EndpointInput object with the following code:
    endpointInput = EndpointInput(endpoint_name=predictor.endpoint_name, 
                                  probability_attribute="0", 
                                  probability_threshold_attribute=0.8,
                                  destination='/opt/ml/processing/input_data')

  1. Create the monitoring schedule. You specify how frequently the monitoring job runs using ScheduleExpression. In the following code, we set the schedule to one time per hour. For MonitoringType, you specify ModelQuality.
    response = churn_model_quality_monitor.create_monitoring_schedule(
        monitor_schedule_name=churn_monitor_schedule_name,
        endpoint_input=endpointInput,
        output_s3_uri = baseline_results_uri,
        problem_type='BinaryClassification',
        ground_truth_input=ground_truth_upload_path,
        constraints=baseline_job.suggested_constraints(),
        schedule_cron_expression=CronExpressionGenerator.hourly(), 
        enable_cloudwatch_metrics=True
          )

Each time the model quality monitoring job runs, it first runs a merge job and then a monitoring job. The merge job combines two different datasets: inference data collected by data capture enabled on the endpoint and ground truth inference data provided by you.

  1. Examine a single run of the scheduled monitoring job:
    executions = churn_model_quality_monitor.list_executions()
    latest_execution = executions[-1]
    latest_execution.describe()
    status = execution['MonitoringExecutionStatus']
    
    while status in ["Pending", "InProgress"]:
        print("Waiting for execution to finish", end="")
        latest_execution.wait(logs=False)
        latest_job = latest_execution.describe()
        print()
        print(f"{latest_job['ProcessingJobName']} job status:", latest_job['ProcessingJobStatus'])
        print(f"{latest_job['ProcessingJobName']} job exit message, if any:", latest_job.get('ExitMessage'))
        print(f"{latest_job['ProcessingJobName']} job failure reason, if any:", latest_job.get('FailureReason'))
        sleep(30) # model quality executions consist of two Processing jobs, wait for second job to start
        latest_execution = churn_model_quality_monitor.list_executions()[-1]
        execution = churn_model_quality_monitor.describe_schedule()["LastMonitoringExecutionSummary"]
        status = execution['MonitoringExecutionStatus']
    
    print("Execution status is:", status)
        
    if status != 'Completed':
        print(execution)
        print("====STOP==== n No completed executions to inspect further. Please wait till an execution completes or investigate previously reported failures."

  1. Check the violations against the baseline constraints:
    pd.options.display.max_colwidth = None
    violations = latest_execution.constraint_violations().body_dict["violations"]
    violations_df = pd.json_normalize(violations)
    violations_df.head(10)

The following screenshot shows the various violations generated.

The following screenshot shows the various violations generated.

From this list, you can see the false positive rate and false negative rate are both greater than the constraints generated or modified during the baselining step. Similarly, the accuracy and precision metrics are less than expected, indicating model quality degradation.

Analyzing model quality with Amazon CloudWatch metrics

In addition to the violations, the monitoring schedule also emits Amazon CloudWatch metrics. In this step, you view the metrics generated and set up a CloudWatch alarm to trigger when the model quality drifts from the baseline thresholds. You can also use CloudWatch alarms to trigger remedial actions such as retraining your model or updating the training dataset.

  1. To view the list of the CloudWatch metrics generated, enter the following code:
    cw_client = boto3.Session().client('cloudwatch')
    namespace='aws/sagemaker/Endpoints/model-metrics'
    cw_dimenstions=[
            {
                'Name': 'Endpoint',
                'Value': endpoint_name
            },
            {
                'Name': 'MonitoringSchedule',
                'Value': churn_monitor_schedule_name
            }
    ]
    
    paginator = cw_client.get_paginator('list_metrics')
    for response in paginator.paginate(Dimensions=cw_dimenstions,Namespace=namespace):
        model_quality_metrics = response['Metrics']
        
        for metric in model_quality_metrics:
            print(metric['MetricName'])

You see output similar to the following:

f0_5_best_constant_classifier
f2_best_constant_classifier
f1_best_constant_classifier
auc
precision
accuracy_best_constant_classifier
true_positive_rate
f1
accuracy
false_positive_rate
f0_5
true_negative_rate
false_negative_rate
recall_best_constant_classifier
precision_best_constant_classifier
recall
f2
  1. Create an alarm for when a specific metric doesn’t meet the threshold configured. In the following code, we create an alarm if the F2 value of the model falls below the threshold suggested by the baseline constraints:
    alarm_name='MODEL_QUALITY_F2_SCORE'
    alarm_desc='Trigger an cloudwatch alarm when the f2 score drifts away from the baseline constraints'
    mdoel_quality_f2_drift_threshold=0.625 ##Setting this threshold purposefully slow to see the alarm quickly.
    metric_name='f2'
    namespace='aws/sagemaker/Endpoints/model-metrics'
    
    #endpoint_name=endpoint_name
    #monitoring_schedule_name=mon_schedule_name
    
    cw_client.put_metric_alarm(
        AlarmName=alarm_name,
        AlarmDescription=alarm_desc,
        ActionsEnabled=True,
       #AlarmActions=[sns_notifications_topic],
        MetricName=metric_name,
        Namespace=namespace,
        Statistic='Average',
        Dimensions=[
            {
                'Name': 'Endpoint',
                'Value': endpoint_name
            },
            {
                'Name': 'MonitoringSchedule',
                'Value': churn_monitor_schedule_name
            }
        ],
        Period=600,
        EvaluationPeriods=1,
        DatapointsToAlarm=1,
        Threshold=mdoel_quality_f2_drift_threshold,
        ComparisonOperator='LessThanOrEqualToThreshold',
        TreatMissingData='breaching'
    )

In a few minutes, you should see a CloudWatch alarm created. The alarm first shows the status Insufficient Data and then changes to Alert. You can view its status on the CloudWatch console.

You can view its status on the CloudWatch console.

You can view its status on the CloudWatch console.

After you generate the alarm, you can decide on what actions you want to take on these alerts. A possible action could be updating the training data and retraining the model.

Visualizing the reports in Amazon SageMaker Studio

You can collect all the metrics that Model Monitor emits and view them in Amazon SageMaker Studio, a visual, fully integrated development environment (IDE) for ML so you can visually analyze your model performance without writing code or using third-party tools. You can also run ad-hoc analysis on the reports generated in a SageMaker notebook instance.

The following figure shows sample metrics and charts in Studio. Run the notebook in the Studio environment to view all metrics and charts related to the customer churn example.

The following figure shows sample metrics and charts in Studio.

Conclusion

SageMaker Model Monitoring is a very powerful tool that enables organizations employing ML models to create a continuous monitoring and model update cycle. This post discusses the monitoring capability with a focus on monitoring the quality of a deployed ML model. The notebook included with the post provides detailed instructions on monitoring an XGBoost binary classification model, along with a view into the baseline constraints generated and violations against the baseline constraints, and configures automated responses to the violations using CloudWatch alerts. This end-to-end workflow enables you to build continuous model training, monitoring, and model update pipelines. Give Model Monitor a try and leave your feedback in the comments.


About the Authors

Sireesha Muppala is an AI/ML Specialist Solutions Architect at AWS, providing guidance to customers on architecting and implementing machine learning solutions at scale. She received her Ph.D. in Computer Science from University of Colorado, Colorado Springs. In her spare time, Sireesha loves to run and hike Colorado trails.

 

 

David Nigenda is a Software Development Engineer in the Amazon SageMaker team. His current work focuses on providing useful insights on production machine learning workflows. In his spare time he tries to keep up with his kids.

 

 

Archana Padmasenan is a Senior Product Manager at Amazon SageMaker. She enjoys building products that delight customers.

Read More

End-to-End, Transferable Deep RL for Graph Optimization

End-to-End, Transferable Deep RL for Graph Optimization

Posted by Yanqi Zhou and Sudip Roy, Research Scientists, Google Research

An increasing number of applications are driven by large and complex neural networks trained on diverse sets of accelerators. This process is facilitated by ML compilers that map high-level computational graphs to low-level, device-specific executables. In doing so, ML compilers need to solve many optimization problems, including graph rewriting, assignment of operations on devices, operation fusion, layout and tiling of tensors, and scheduling. For example, in a device placement problem, the compiler needs to determine the mapping between operations in the computational graph to the target physical devices so that an objective function, such as training step time, can be minimized. The placement performance is determined by a mixture of intricate factors, including inter-device network bandwidth, peak device memory, co-location constraints, etc., making it challenging for heuristics or search-based algorithms, which typically settle for fast, but sub-optimal, solutions. Furthermore, heuristics are hard to develop and maintain, especially as newer model architectures emerge.

Recent attempts at using learning-based approaches have demonstrated promising results, but they have a number of limitations that make them infeasible to be deployed in practice. Firstly, these approaches do not easily generalize to unseen graphs, especially those arising from newer model architectures, and second, they have poor sample efficiency, leading to high resource consumption during training. Finally, they are only able to solve a single optimization task, and consequently, do not capture the dependencies across the tightly coupled optimization problems in the compilation stack.

In “Transferable Graph Optimizers for ML Compilers”, recently published as an oral paper at NeurIPS 2020, we propose an end-to-end, transferable deep reinforcement learning method for computational graph optimization (GO) that overcomes all of the above limitations. We demonstrate 33%-60% speedup on three graph optimization tasks compared to TensorFlow default optimization. On a diverse set of representative graphs consisting of up to 80,000 nodes, including Inception-v3, Transformer-XL, and WaveNet, GO achieves an average 21% improvement over expert optimization and an 18% improvement over the prior state of the art with 15x faster convergence.

Graph Optimization Problems in ML Compilers
There are three coupled optimization tasks that frequently arise in ML compilers, which we formulate as decision problems that can be solved using a learned policy. The decision problems for each of the tasks can be reframed as making a decision for each node in the computational graph.

The first optimization task is device placement, where the goal is to determine how best to assign the nodes of the graph to the physical devices on which it runs such that the end-to-end run time is minimized.

The second optimization task is operation scheduling. An operation in a computational graph is ready to run when its incoming tensors are present in the device memory. A frequently used scheduling strategy is to maintain a ready queue of operations for each device and schedule operations in first-in-first-out order. However, this scheduling strategy does not take into account the downstream operations placed on other devices that might be blocked by an operation, and often leads to schedules with underutilized devices. To find schedules that can keep track of such cross-device dependencies, our approach uses a priority-based scheduling algorithm that schedules operations in the ready queue based on the priority of each. Similar to device placement, operation scheduling can then be formulated as the problem of learning a policy that assigns a priority for each node in the graph to maximize a reward based on run time.

The third optimization task is operation fusion. For brevity we omit a detailed discussion of this problem here, and instead just note that similar to priority-based scheduling, operation fusion can also use a priority-based algorithm to decide which nodes to fuse. The goal of the policy network in this case is again to assign a priority for each node in the graph.

Finally, it is important to recognize that the decisions taken in each of the three optimization problems can affect the optimal decision for the other problems. For example, placing two nodes on two different devices effectively disables fusion and introduces a communication delay that can influence scheduling.

RL Policy Network Architecture
Our research presents GO, a deep RL framework that can be adapted to solve each of the aforementioned optimization problems — both individually as well as jointly. There are three key aspects of the proposed architecture:

First, we use graph neural networks (specifically GraphSAGE) to capture the topological information encoded in the computational graph. The inductive network of GraphSAGE leverages node attribute information to generalize to previously unseen graphs, which enables decision making for unseen data without incurring significant cost on training.

Second, computational graphs for many models often contain more than 10k nodes. Solving the optimization problems effectively over such large scales requires that the network is able to capture long-range dependencies between nodes. GO’s architecture includes a scalable attention network that uses segment-level recurrence to capture such long-range node dependencies.

Third, ML compilers need to solve optimization problems over a wide variety of graphs from different application domains. A naive strategy of training a shared policy network with heterogeneous graphs is unlikely to capture the idiosyncrasies of a particular class of graphs. To overcome this, GO uses a feature modulation mechanism that allows the network to specialize for specific graph types without increasing the number of parameters.

Overview of GO: An end-to-end graph policy network that combines graph embedding and sequential attention.

To jointly solve multiple dependent optimization tasks, GO has the ability to add additional recurrent attention layers for each task with parameters shared across different tasks. The recurrent attention layers with residual connections of actions enables tracking inter-task dependencies.

Multi-task policy network that extends GO’s policy network with additional recurrent attention layers for each task and residual connections. GE: Graph Embedding, FC: Fully-Connected Layer, Nxf: fusion action dimension, Fxd: placement action dimension, Nxs: scheduling action dimension.

Results
Next, we present evaluation results on a single-task speedup on a device placement task based on real-hardware measurements, generalization to unseen graphs with different GO variants, and multi-task performance jointly optimizing operations fusion, device placement, and scheduling.

Speedup:
To evaluate the performance of this architecture, we apply GO to a device placement problem based on real-hardware evaluation, where we first train the model separately on each of our workloads. This approach, called GO-one, consistently outperforms expert manual placement (HP), TensorFlow METIS placement, and Hierarchical Device Placement (HDP) — the current state-of-the-art reinforcement learning-based device placement. Importantly, with the efficient end-to-end single-shot placement, GO-one has a 15x speedup in convergence time of the placement network over HDP.

Our empirical results show that GO-one consistently outperforms expert placement, TensorFlow METIS placement, and hierarchical device placement (HDP). Because GO is designed in a way to scale up to extremely large graphs consisting of over 80,000 nodes like an 8-layer Google Neural Machine Translation (GNMT) model, it outperforms previous approaches, including HDP, REGAL, and Placeto. GO achieves optimized graph runtimes for large graphs like GNMT that are 21.7% and 36.5% faster than HP and HDP, respectively. Overall, GO-one achieves on average 20.5% and 18.2% run time reduction across a diverse set of 14 graphs, compared to HP and HDP respectively. Importantly, with the efficient end-to-end single-shot placement, GO-one has a 15x speedup in convergence time of the placement network over HDP.

Generalization:
GO generalizes to unseen graphs using offline pre-training followed by fine-tuning on the unseen graphs. During pre-training, we train GO on heterogeneous subsets of graphs from the training set. We train GO for 1000 steps on each such batch of graphs before switching to the next. This pretrained model is then fine-tuned (GO-generalization+finetune) on hold-out graphs for fewer than 50 steps, which typically takes less than one minute. GO-generalization+finetune for hold-out graphs outperforms both expert placement and HDP consistently on all datasets, and on average matches GO-one.

We also run inference directly on just the pre-trained model without any fine-tuning for the target hold-out graphs, and name this GO-generalization-zeroshot. The performance of this untuned model is only marginally worse than GO-generalization+finetune, while being slightly better than expert placement and HDP. This indicates that both graph embedding and the learned policies transfer efficiently, allowing the model to generalize to the unseen data.

Generalization across heterogeneous workload graphs. The figure shows a comparison of two different generalization strategies for GO when trained with graphs from 5 (except the held-out one) of the 6 workloads (Inception-v3, AmoebaNet, recurrent neural network language model (RNNLM), Google Neural Machine Translation (GNMT), Transformer-XL (TRFXL), WaveNet), and evaluated on the held-out workload (x-axis).

Co-optimizing placement, scheduling, and fusion (pl+sch+fu):
Optimizing simultaneously for placement, scheduling and fusion provides 30%-73% speedup compared to the single-gpu unoptimized case and 33%-60% speedup compared to TensorFlow default placement, scheduling, and fusion. Comparing to optimizing each tasks individually, multi-task GO (pl+sch+fu) outperforms single-task GO (p | sch | fu) — optimizing all tasks, one at a time — by an average of 7.8%. Furthermore, for all workloads, co-optimizing all three tasks offers faster run time than optimizing any two of them and using the default policy for the third.

Run time for various workloads on multi-task optimizations. TF-default: TF GPU default placement, fusion, and scheduling. hp-only: human placement only with default scheduling and fusion. pl-only: GO placement only with default scheduling and fusion. pl | sch: GO optimizes placement and scheduling individually with default fusion. pl+sch: multi-task GO co-optimizes placement and scheduling with default fusion. sch+fu: multi-task GO co-optimizes scheduling and fusion with human placement. pl | sch | fu: GO optimizes placement, scheduling, and fusion separately. pl+sch+fu: multi-task GO co-optimizes placement, scheduling, and fusion.

Conclusion
The increasing complexity and diversity of hardware accelerators has made the development of robust and adaptable ML frameworks onerous and time-consuming, often requiring multiple years of effort from hundreds of engineers. In this article, we demonstrated that many of the optimization problems in such frameworks can be solved efficiently and optimally using a carefully designed learned approach.

Acknowledgements
This is joint work with Daniel Wong, Amirali Abdolrashidi, Peter Ma, Qiumin Xu, Hanxiao Liu, Mangpo Phitchaya Phothilimthana, Shen Wang, Anna Goldie, Azalia Mirhoseini, and James Laudon.

Read More

Sustainable and Attainable: Zoox Unveils Autonomous Robotaxi Powered by NVIDIA

Sustainable and Attainable: Zoox Unveils Autonomous Robotaxi Powered by NVIDIA

When it comes to future mobility, you may not have to pave as many paradises for personal car parking lots.

This week, autonomous mobility company Zoox unveiled its much-anticipated purpose-built robotaxi. Designed for everyday urban mobility, the vehicle is powered by NVIDIA and is one of the first level 5 robotaxis featuring bi-directional capabilities, providing a concrete view into the next generation of intelligent transportation.

Zoox and NVIDIA first announced their partnership in 2017, with the innovative startup leveraging the high-performance, energy-efficient compute of NVIDIA to build a level 5 vehicle from the ground up. It was a significant milestone toward an autonomous future. Zoox is also an alumnus of NVIDIA Inception, our accelerator program for startups transforming industries with AI and data science.

Robotaxis are set to transform the way we move. Experts at UBS estimate these vehicles could create a $2 trillion market globally by 2030, while reducing the cost of daily travel for riders by more than 80 percent. With greater affordability, robotaxis are expected to decrease car ownership in urban areas — a recent survey of 6,500 U.S. drivers showed nearly half would be willing to give up car ownership if robotaxis became widespread.

With Zoox and the openness and scalability of NVIDIA AI technology, this vision of safer and more efficient mobility is no longer a faraway future, but a close reality.

Autonomy Forwards and Backwards

Unlike current passenger vehicles that focus on the driver, Zoox is designed for riders. The vehicle was built from the start to optimize features necessary for autonomous, electric mobility, such as sensor placement and large batteries.

Each vehicle features four-wheel steering, allowing it to pull into tight curb spaces without parallel parking. This capability makes it easy for Zoox to pick up and drop off riders, quickly getting to the curb and out of the flow of traffic to provide a better and safer experience.

The vehicle is bidirectional, so there is no fixed front or back end. It can pull forward into a driveway and forward out onto the road without reversing. In the case of an unexpected road closure, the vehicle can simply flip directions or use four-wheel steering to turn around. No reversing required.

Inside the vehicle, carriage seating facilitates clear visibility of the vehicle’s surroundings as well as socializing. Each seat has the same amount of space and delivers the same experience — there’s no bad seat in the house. Carriage seating also makes room for a wider aisle, allowing passengers to easily pass by each other without getting up or contorting into awkward positions.

All together, these design details give riders the freedom of seamless mobility, backed by safety innovations not featured in conventional cars.

One Solution

NVIDIA provides the only end-to-end platform for developing software-defined vehicles with a centralized architecture, spanning from the data center to the vehicle.

For robotaxis, achieving level 5 autonomy requires compute with enough headroom to continuously add new features and capabilities. NVIDIA enables this level of performance, starting with the infrastructure for training and validation and extending to in-vehicle compute.

These vehicles can be continuously updated over the air with deep neural networks that are developed and improved in the data center.

The open and modular nature of the NVIDIA platform enables robotaxi companies to create custom configurations to accommodate new designs, such as Zoox’s symmetrical layout, with cameras, radar and lidar that achieve a 270-degree field of view on all four corners of the vehicle.

With the ability to use as many processors as needed to analyze data from the dozens of onboard sensors, developers can ensure safety through diversity and redundancy of systems and algorithms.

By leveraging NVIDIA, Zoox is using the only proven, high-performance solution for robotaxis, putting the vision of on-demand autonomous mobility within reach.

The post Sustainable and Attainable: Zoox Unveils Autonomous Robotaxi Powered by NVIDIA appeared first on The Official NVIDIA Blog.

Read More

All AIs on Quality: Startup’s NVIDIA Jetson-Enabled Inspections Boost Manufacturing

All AIs on Quality: Startup’s NVIDIA Jetson-Enabled Inspections Boost Manufacturing

Once the founder of a wearable computing startup, Arye Barnehama understands the toils of manufacturing consumer devices. He moved to Shenzhen in 2014 to personally oversee production lines for his brain waves-monitoring headband, Melon.

It was an experience that left an impression: manufacturing needed automation.

His next act is Elementary Robotics, which develops robotics for manufacturing. Elementary Robotics, based in Los Angeles, was incubated at Pasadena’s Idealab.

Founded in 2017, Elementary Robotics recently landed a $12.7 million Series A round of funding, including investment from customer Toyota.

Elementary Robotics is in deployment with customers who track thousands of parts. Its system is constantly retraining algorithms for improvements to companies’ inspections.

“Using the NVIDIA Jetson edge AI platform, we put quite a bit of engineering effort into tracking for 100 percent of inferences, at high frame rates,” said Barnehama, the company’s CEO.

Jetson for Inspections

Elementary Robotics has developed its own hardware and software for inspections used in manufacturing. It offers a Jetson-powered robot that can examine parts for defects. It aims to improve quality with better tracking of parts and problems.

Detecting the smallest of defects on a fast moving production line requires processing of high-resolution camera data with AI in real time. This is made possible with the embedded CUDA-enabled GPU and the CUDA-X AI software on Jetson. As the Jetson platform makes decisions from video streams, these are all ingested into its cloud database so that customers are able to observe and query the data.

The results, along with the live video, are also then published to the Elementary Robotics web application, which can be accessed from anywhere.

Elementary Robotics’ system also enables companies to inspect parts from suppliers before putting them into the production line, avoiding costly failures. It is used for inspections of assemblies on production lines as well as for quality control at post-production.

Its applications include inspections of electronic printed circuit boards and assemblies, automotive components, and gears for light industrial use. Elementary Robotics customers also use its platform in packaging and consumer goods such as bottles, caps and labels.

“Everyone’s demand for quality is always going up,” said Barnehama. “We run real-time inference on the edge with NVIDIA systems for inspections to help improve quality.”

The Jetson platform recently demonstrated leadership in MLPerf AI inference benchmarks in SoC-based edge devices for computer vision and conversational AI use cases.

Elementary Robotics is a member of NVIDIA Inception, a virtual accelerator program that helps startups in AI and data science get to market faster.

Traceability of Operations

The startup’s Jetson-enabled machine learning system can handle split-second anomaly detection to catch mistakes on the production lines. And when there’s a defective part returned, companies that rely on Elementary Robotics can try to understand how it happened. Use cases include electronics, automotive, medical, consumer packaged goods, logistics and other applications.

For manufacturers, such traceability of operations is important so that companies can go back and find and fix the causes of problems for improved reliability, said Barnehama.

“You want to be able to say, ‘OK, this defective item got returned, let me look up when it was inspected and make sure I have all the inspection data,’”  added Barnehama.

NVIDIA Jetson is used by enterprise customers, developers and DIY enthusiasts for creating AI applications, as well as students and educators for learning and teaching AI.

The post All AIs on Quality: Startup’s NVIDIA Jetson-Enabled Inspections Boost Manufacturing appeared first on The Official NVIDIA Blog.

Read More

Training a reinforcement learning Agent with Unity and Amazon SageMaker RL

Training a reinforcement learning Agent with Unity and Amazon SageMaker RL

Unity is one of the most popular game engines that has been adopted not only for video game development but also by industries such as film and automotive. Unity offers tools to create virtual simulated environments with customizable physics, landscapes, and characters. The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables developers to train reinforcement learning (RL) agents against the environments created on Unity.

Reinforcement learning is an area of machine learning (ML) that teaches a software agent how to take actions in an environment in order to maximize a long-term objective. For more information, see Amazon SageMaker RL – Managed Reinforcement Learning with Amazon SageMaker. ML-Agents is becoming an increasingly popular tool among many gaming companies for use cases such as game level difficulty design, bug fixing, and cheat detection. Currently, ML-Agents is used to train agents locally, and can’t scale to efficiently use more computing resources. You have to train RL agents on a local Unity engine for an extensive amount of time before obtaining the trained model. The process is time-consuming and not scalable for processing large amounts of data.

In this post, we demonstrate a solution by integrating the ML-Agents Unity interface with Amazon SageMaker RL, allowing you to train RL agents on Amazon SageMaker in a fully managed and scalable fashion.

Overview of solution

SageMaker is a fully managed service that enables fast model development. It provides many built-in features to assist you with training, tuning, debugging, and model deployment. SageMaker RL builds on top of SageMaker, adding pre-built RL libraries and making it easy to integrate with different simulation environments. You can use built-in deep learning frameworks such as TensorFlow and PyTorch with various built-in RL algorithms from the RLlib library to train RL policies. Infrastructures for training and inference are fully managed by SageMaker, so you can focus on RL formulation. SageMaker RL also provides a set of Jupyter notebooks, demonstrating varieties of domain RL applications in robotics, operations research, finance, and more.

The following diagram illustrates our solution architecture.

In this post, we walk through the specifics of training an RL agent on SageMaker by interacting with the sample Unity environment. To access the complete notebook for this post, see the SageMaker notebook example on GitHub.

Setting up your environments

To get started, we import the needed Python libraries and set up environments for permissions and configurations. The following code contains the steps to set up an Amazon Simple Storage Service (Amazon S3) bucket, define the training job prefix, specify the training job location, and create an AWS Identity and Access Management (IAM) role:

import sagemaker
import boto3
 
# set up the linkage and authentication to the S3 bucket
sage_session = sagemaker.session.Session()
s3_bucket = sage_session.default_bucket()  
s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

# create a descriptive job name
job_name_prefix = 'rl-unity-ray'

# configure where training happens – local or SageMaker instance
local_mode = False

if local_mode:
    instance_type = 'local'
else:
    # If on SageMaker, pick the instance type
    instance_type = "ml.c5.2xlarge"

# create an IAM role
try:
    role = sagemaker.get_execution_role()
except:
    role = get_execution_role()

print("Using IAM role arn: {}".format(role))

Building a Docker container

SageMaker uses Docker containers to run scripts, train algorithms, and deploy models. A Docker container is a standalone package of software that manages all the code and dependencies, and it includes everything needed to run an application. We start by building on top of a pre-built SageMaker Docker image that contains dependencies for Ray, then install the required core packages:

  • gym-unity – Unity provides a wrapper to wrap Unity environment into a gym interface, an open-source library that gives you access to a set of classic RL environments
  • mlagents-envs – Package that provides a Python API to allow direct interaction with the Unity game engine

Depending on the status of the machine, the Docker building process may take up to 10 minutes. For all pre-built SageMaker RL Docker images, see the GitHub repo.

Unity environment example

In this post, we use a simple example Unity environment called Basic. In the following visualization, the agent we’re controlling is the blue box that moves left or right. For each step it takes, it costs the agent some energy, incurring small negative rewards (-0.01). Green balls are targets with fixed locations. The agent is randomly initialized between the green balls, and collects rewards when it collides with the green balls. The large green ball offers a reward of +1, and the small green ball offers a reward of +0.1. The goal of this task is to train the agent to move towards the ball that offers the most cumulative rewards.

Model training, evaluation, and deployment

In this section, we walk you through the steps to train, evaluate, and deploy models.

Writing a training script

Before launching the SageMaker RL training job, we need to specify the configurations of the training process. It’s usually achieved in a single script outside the notebook. The training script defines the input (the Unity environment) and the algorithm for RL training. The following code shows what the script looks like:

import json
import os

import gym
import ray
from ray.tune import run_experiments
from ray.tune.registry import register_env

from sagemaker_rl.ray_launcher import SageMakerRayLauncher
from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.exception import UnityWorkerInUseException
from mlagents_envs.registry import default_registry
from gym_unity.envs import UnityToGymWrapper

class UnityEnvWrapper(gym.Env):
    def __init__(self, env_config):
        self.worker_index = env_config.worker_index
        if 'SM_CHANNEL_TRAIN' in os.environ:
            env_name = os.environ['SM_CHANNEL_TRAIN'] +'/'+ env_config['env_name']
            os.chmod(env_name, 0o755)
            print("Changed environment binary into executable mode.")
            # Try connecting to the Unity3D game instance.
            while True:
                try:
                    unity_env = UnityEnvironment(
                                    env_name, 
                                    no_graphics=True, 
                                    worker_id=self.worker_index, 
                                    additional_args=['-logFile', 'unity.log'])
                except UnityWorkerInUseException:
                    self.worker_index += 1
                else:
                    break
        else:
            env_name = env_config['env_name']
            while True:
                try:
                    unity_env = default_registry[env_name].make(
                        no_graphics=True,
                        worker_id=self.worker_index,
                        additional_args=['-logFile', 'unity.log'])
                except UnityWorkerInUseException:
                    self.worker_index += 1
                else:
                    break
            
        self.env = UnityToGymWrapper(unity_env) 
        self.action_space = self.env.action_space
        self.observation_space = self.env.observation_space

    def reset(self):
        return self.env.reset()

    def step(self, action):
        return self.env.step(action)

class MyLauncher(SageMakerRayLauncher):

    def register_env_creator(self):
        register_env("unity_env", lambda config: UnityEnvWrapper(config))

    def get_experiment_config(self):
        return {
          "training": {
            "run": "PPO",
            "stop": {
              "timesteps_total": 10000,
            },
            "config": {
              "env": "unity_env",
              "gamma": 0.995,
              "kl_coeff": 1.0,
              "num_sgd_iter": 20,
              "lr": 0.0001,
              "sgd_minibatch_size": 100,
              "train_batch_size": 500,
              "monitor": True,  # Record videos.
              "model": {
                "free_log_std": True
              },
              "env_config":{
                "env_name": "Basic"
              },
              "num_workers": (self.num_cpus-1),
              "ignore_worker_failures": True,
            }
          }
        }

if __name__ == "__main__":
    MyLauncher().train_main()

The training script has two components:

  • UnityEnvWrapper – The Unity environment is stored as a binary file. To load the environment, we need to use the Unity ML-Agents Python API. UnityEnvironment takes the name of the environment and returns an interactive environment object. We then wrap the object with UnityToGymWrapper and return an object that is trainable using Ray-RLLib and SageMaker RL.
  • MyLauncher – This class inherits the SageMakerRayLauncher base class for SageMaker RL applications to use Ray-RLLib. Inside the class, we register the environment to be recognized by Ray and specify the configurations we want during training. Example hyperparameters include the name of the environment, discount factor in cumulative rewards, learning rate of the model, and number of iterations to run the model. For a full list of commonly used hyperparameters, see Common Parameters.

Training the model

After setting up the configuration and model customization, we’re ready to start the SageMaker RL training job. See the following code:

metric_definitions = RLEstimator.default_metric_definitions(RLToolkit.RAY)
    
estimator = RLEstimator(entry_point="train-unity.py",
                        source_dir='src',
                        dependencies=["common/sagemaker_rl"],
                        image_name=custom_image_name,
                        role=role,
                        train_instance_type=instance_type,
                        train_instance_count=1,
                        output_path=s3_output_path,
                        base_job_name=job_name_prefix,
                        metric_definitions=metric_definitions,
                        hyperparameters={
				# customize Ray parameters here
                        }
                    )

estimator.fit(wait=local_mode)
job_name = estimator.latest_training_job.job_name
print("Training job: %s" % job_name)

Inside the code, we specify a few parameters:

  • entry_point – The path to the training script we wrote that specifies the training process
  • source_dir – The path to the directory with other training source code dependencies aside from the entry point file
  • dependencies – A list of paths to directories with additional libraries to be exported to the container

In addition, we state the container image name, training instance information, output path, and selected metrics. We are also allowed to customize any Ray-related parameters using the hyperparameters argument. We launch the SageMaker RL training job by calling estimator.fit, and start the model training process based on the specifications in the training script.

At a high level, the training job initiates a neural network and updates the network gradually towards the direction in which the agent collects higher reward. Through multiple trials, the agent eventually learns how to navigate to the high-rewarding location efficiently. SageMaker RL handles the entire process and allows you to view the training job status in the Training jobs page on the SageMaker console.

It’s also possible to monitor model performance by examining the training logs recorded in Amazon CloudWatch. Due to the simplicity of the task, the model completes training (10,000 agent movements) with roughly 800 episodes (number of times the agent reaches a target ball) in under 1 minute. The following plot shows the average reward collected converges around 0.9. The maximum reward the agent can get from this environment is 1, and each step costs 0.01, so a mean reward around 0.9 seems to be the results of optimal policy, indicating our training process is successful!

Evaluating the model

When model training is complete, we can load the trained model to evaluate its performance. Similar to the setup in the training script, we wrap the Unity environment with a gym wrapper. We then create an agent by loading the trained model.

To evaluate the model, we run the trained agent multiple times against the environment with a fixed agent and target initializations, and add up the cumulative rewards the agent collects at each step for each episode.

Out of five episodes, the average episode reward is 0.92 with the maximum reward of 0.93 and minimum reward of 0.89, suggesting the trained model indeed performs well.

Deploying the model

We can deploy the trained RL policy with just a few lines of code using the SageMaker model deployment API. You can pass an input and get out the optimal actions based on the policy. The input shape needs to match the observation input shape from the environment.

For the Basic environment, we deploy the model and pass an input to the predictor:

from sagemaker.tensorflow.model import TensorFlowModel

model = TensorFlowModel(model_data=estimator.model_data,
              framework_version='2.1.0',
              role=role)

predictor = model.deploy(initial_instance_count=1, 
                         instance_type=instance_type)

input = {"inputs": {'observations': np.ones(shape=(1, 20)).tolist(),
                    'prev_action': [0, 0],
                    'is_training': False,
                    'prev_reward': -1,
                    'seq_lens': -1
                   }
        }    

result = predictor.predict(input)
print(result['outputs']['actions'])

The model predicts an indicator corresponding to moving left or right. The recommended direction of movement for the blue box agent always points towards the larger green ball.

Cleaning up

When you’re finished running the model, call predictor.delete_endpoint() to delete the model deployment endpoint to avoid incurring future charges.

Customizing training algorithms, models, and environments

In addition to the preceding use case, we encourage you to explore the customization capabilities this solution supports.

In the preceding code example, we specify Proximal Policy Optimization (PPO) to be the training algorithm. PPO is a popular RL algorithm that performs comparably to state-of-the-art approaches but is much simpler to implement and tune. Depending on your use case, you can choose the most-fitted algorithm for training by either selecting from a list of comprehensive algorithms already implemented in RLLib or building a custom algorithm from scratch.

By default, RLLib applies a pre-defined convolutional neural network or fully connected neural network. However, you can create a custom model for training and testing. Following the examples from RLLib, you can register the custom model by calling ModelCatalog.register_custom_model, then refer to the newly registered model using the custom_model argument.

In our code example, we invoke a predefined Unity environment called Basic, but you can experiment with other pre-built Unity environments. However, as of this writing, our solution only supports a single-agent environment. When new environments are built, register it by calling register_env and refer to the environment with the env parameter.

Conclusion

In this post, we walk through how to train an RL agent to interact with Unity game environments using SageMaker RL. We use a pre-built Unity environment example for the demonstration, but encourage you to explore using custom or other pre-built Unity environments.

SageMaker RL offers a scalable and efficient way of training RL gaming agents to play game environments powered by Unity. For the notebook containing the complete code, see Unity 3D Game with Amazon SageMaker RL.

If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.

 


About the Authors

Yohei Nakayama is a Deep Learning Architect at Amazon Machine Learning Solutions Lab, where he works with customers across different verticals to accelerate their use of artificial intelligence and AWS Cloud services to solve their business challenges. He is interested in applying ML/AI technologies to the space industry.

 

 

Henry Wang is a Data Scientist at Amazon Machine Learning Solutions Lab. Prior to joining AWS, he was a graduate student at Harvard in Computational Science and Engineering, where he worked on healthcare research with reinforcement learning. In his spare time, he enjoys playing tennis and golf, reading, and watching StarCraft II tournaments.

 

 

Yijie Zhuang is a Software Engineer with Amazon SageMaker. He did his MS in Computer Engineering from Duke. His interests lie in building scalable algorithms and reinforcement learning systems. He contributed to Amazon SageMaker built-in algorithms and Amazon SageMaker RL.

Read More

Pinterest Trains Visual Search Faster with Optimized Architecture on NVIDIA GPUs

Pinterest Trains Visual Search Faster with Optimized Architecture on NVIDIA GPUs

Pinterest now has more than 440 million reasons to offer the best visual search experience. That’s because its monthly active users are tracking this high for its popular image sharing and social media service.

Visual search enables Pinterest users to search for images using text, screenshots or camera photos. It’s the core AI behind how people build their Boards of Pins — collections of images by themes —  around their interests and plans. It’s also how people on Pinterest can take action on the inspiration they discover, such as shopping and making purchases based on the products within scenes.

But tracking more than 240 billion images and 5 billion Boards is no small data trick.

This requires visual embeddings — mathematical representations of objects in a scene. Visual embeddings use models for automatically generating and evaluating visualizations to show how similar two images are — say, a sofa in a TV show’s living room compared to ones for sale at retailers.

Pinterest is improving its search results by pretraining its visual embeddings on a smaller dataset. The overall goal is to improve for one unified visual embedding that can perform well for its key business features.

Powered by NVIDIA V100 Tensor Core GPUs, this technique pre-trains Pinterest’s neural nets on a subset of about 1.3 billion images to yield improved relevancy across the wider set of hundreds of billions of images.

Improving results on the unified visual embedding in this fashion can benefit all applications on Pinterest, said Josh Beal, a machine learning researcher for Visual Search at the company.

“This model is fine-tuned on various multitask datasets. And the goal of this project was to scale the model to a large scale,” he said.

Benefitting Shop the Look 

With so many visuals, and new ones coming in all the time, Pinterest is continuously training its neural networks to identify them in relation to others.

A popular visual search feature, Pinterest’s Shop the Look enables people to shop for home and fashion items. By tapping into visual embeddings, Shop the Look can identify items in Pins and connect Pinners to those products online.

Product matches are key to its visual-driven commerce. And it isn’t an easy problem to solve at Pinterest scale.

Yet it matters. Another Pinterest visual feature is the ability to search specific products within an image, or Pin. Improving the accuracy or recommendations with visual embedding improves the magic factor in matches, boosting people’s experience of discovering relevant products and ideas.

An additional feature, Pinterest’s Lens camera search, aims to recommend visually relevant Pins based on the photos Pinners take with their cameras.

“Unified embedding for visual search benefits all these downstream applications,” said Beal.

Making Visual Search More Powerful

Several Pinterest teams have been working to improve visual search on the hundreds of billions of images within Pins. But given the massive scale of the effort and its cost and engineering resource restraints, Pinterest wanted to optimize its existing architecture.

With some suggested ResNeXt-101 architecture optimizations and by simply upgrading to the latest releases of NVIDIA libraries, including cuDNN v8, automated mixed precision and NCCL, Pinterest was able to improve training performance of their models by over 60 percent.

NVIDIA’s GPU-accelerated libraries are constantly being updated to enable companies such as Pinterest to get more performance out of their existing hardware investment.

“It has improved the quality of the visual embedding, so that leads to more relevant results in visual search,” said Beal.

The post Pinterest Trains Visual Search Faster with Optimized Architecture on NVIDIA GPUs appeared first on The Official NVIDIA Blog.

Read More

How Rasa Open Source Gained Layers of Flexibility with TensorFlow 2.x

How Rasa Open Source Gained Layers of Flexibility with TensorFlow 2.x

A guest post by Vincent D. Warmerdam and Vladimir Vlasov, Rasa

Rasa logo with bird

At Rasa, we are building infrastructure for conversational AI, used by developers to build chat- and voice-based assistants. Rasa Open Source, our cornerstone product offering, provides a framework for NLU (Natural Language Understanding) and dialogue management. On the NLU side we offer models that handle intent classification and entity detection using models built with Tensorflow 2.x.

In this article, we would like to discuss the benefits of migrating to the latest version of TensorFlow and also give insight into how some of the Rasa internals work.

A Typical Rasa Project Setup

When you’re building a virtual assistant with Rasa Open Source, you’ll usually begin by defining stories, which represent conversations users might have with your agent. These stories will serve as training data and you can configure them as yaml files. If we pretend that we’re making an assistant that allows you to buy pizzas online then we might have stories in our configuration that look like this:

yaml
version: "2.0"

stories:

- story: happy path
steps:
- intent: greet
- action: utter_greet
- intent: mood_great
- action: utter_happy

- story: purchase path
steps:
- intent: greet
- action: utter_greet
- intent: purchase
entities:
product: “pizza”
- action: confirm_purchase
- intent: affirm
- action: confirm_availability

These stories consist of intents and actions. Actions can be simple text replies, or they can trigger custom Python code (that checks a database, for instance). To define training data for each intent, you supply the assistant with example user messages, which might look something like:

yaml
version: "2.0"

nlu:
- intent: greet
examples: |
- hey
- hello
- hi
- hello there
- good morning

- intent: purchase
examples: |
- i’d like to buy a [veggie pizza](product) for [tomorrow](date_ref)
- i want to order a [pizza pepperoni](product)
- i’d want to buy a [pizza](product) and a [cola](product)
- ...

When you train an assistant using Rasa you’ll supply configuration files like those shown above. You can be very expressive in the types of conversations your agent can handle. Intents and actions are like lego bricks and can be combined expressively to cover many conversational paths. Once these files are defined they are combined to create a training dataset that the agent will learn from.

Rasa allows users to build custom machine learning pipelines to fit their datasets. That means you can incorporate your own (pre-trained) models for natural language understanding if you’d like. But Rasa also provides models, written in TensorFlow, that are specialized for these tasks.

Specific Model Requirements

You may have noticed that our examples include not just intents but also entities. When a user is interested in making a purchase, they (usually) also say what they’re interested in buying. This information needs to be detected when the user provides it. It’d be a bad experience if we needed to supply the user with a form to retrieve this information.

intent greet vs intent purchase

If you take a step back and think about what kind of model could work well here, you’ll soon recognize that it’s not a standard task. It’s not just that we have numerous labels at each utterance; we have multiple *types* of labels too. That means that we need models that have two outputs.

Types of labels texts to tokens

Rasa Open Source offers a model that can detect both intents and entities, called DIET. It uses a transformer architecture that allows the system to learn from the interaction between intents and entities. Because it needs to handle these two tasks at once, the typical machine learning pattern won’t work:

model.fit(X, y).predict(X)

You need a different abstraction.

Abstraction

This is where TensorFlow 2.x has made an improvement to the Rasa codebase. It is now much easier to customize TensorFlow classes. In particular, we’ve made a custom abstraction on top of Keras to suit our needs. One example of this is Rasa’s own internal `RasaModel.` We’ve added the base class’s signature below. The full implementation can be found here.

class RasaModel(tf.keras.models.Model):

def __init__(
self,
random_seed: Optional[int] = None,
tensorboard_log_dir: Optional[Text] = None,
tensorboard_log_level:Optional[Text] = "epoch",
**kwargs,
) -> None:
...

def fit(
self,
model_data: RasaModelData,
epochs: int,
batch_size: Union[List[int], int],
evaluate_on_num_examples: int,
evaluate_every_num_epochs: int,
batch_strategy: Text,
silent: bool = False,
eager: bool = False,
) -> None:
...

This object is customized to allow us to pass in our own `RasaModelData` object. The benefit is that we can keep all the existing features that the Keras model object offers while we can override a few specific methods to suit our needs. We can run the model with our preferred data format while maintaining manual control over “eager mode,” which helps us debug.

These Keras objects are now a central API in TensorFlow 2.x, which made it very easy for us to integrate and customize.

Training Loop

To give another impression of how the code became simpler, let’s look at the training loop inside the Rasa model.

Python Pseudo-Code for TensorFlow 1.8

We’ve got a part of the code used for our old training loop listed below (see here for the full implementation). Note that it is using `session.run` to calculate the loss as well as the accuracy.

def train_tf_dataset(
train_init_op: "tf.Operation",
eval_init_op: "tf.Operation",
batch_size_in: "tf.Tensor",
loss: "tf.Tensor",
acc: "tf.Tensor",
train_op: "tf.Tensor",
session: "tf.Session",
epochs: int,
batch_size: Union[List[int], int],
evaluate_on_num_examples: int,
evaluate_every_num_epochs: int,
)
session.run(tf.global_variables_initializer())
pbar = tqdm(range(epochs),desc="Epochs", disable=is_logging_disabled())

for ep in pbar:
ep_batch_size=linearly_increasing_batch_size(ep, batch_size, epochs)
session.run(train_init_op, feed_dict={batch_size_in: ep_batch_size})

ep_train_loss = 0
ep_train_acc = 0
batches_per_epoch = 0
while True:
try:
_, batch_train_loss, batch_train_acc = session.run(
[train_op, loss, acc])
batches_per_epoch += 1
ep_train_loss += batch_train_loss
ep_train_acc += batch_train_acc

except tf.errors.OutOfRangeError:
break

The train_tf_dataset function requires a lot of tensors as input. In TensorFlow 1.8, you need to keep track of these tensors because they contain all the operations you intend to run. In practice, this can lead to cumbersome code because it is hard to separate concerns.

Python Pseudo-Code for TensorFlow 2.x

In TensorFlow 2, all of this has been made much easier because of the Keras abstraction. We can inherit from a Keras class that allows us to compartmentalize the code much better. Here is the `train` method from Rasa’s DIET classifier (see here for the full implementation).

def train(
self,
training_data: TrainingData,
config: Optional[RasaNLUModelConfig] = None,
**kwargs: Any,
) -> None:
"""Train the embedding intent classifier on a data set."""

model_data = self.preprocess_train_data(training_data)

self.model = self.model_class()(
config=self.component_config,
)

self.model.fit(
model_data,
self.component_config[EPOCHS],
self.component_config[BATCH_SIZES],
self.component_config[EVAL_NUM_EXAMPLES],
self.component_config[EVAL_NUM_EPOCHS],
self.component_config[BATCH_STRATEGY],
)

The object-oriented style of programming from Keras allows us to customize more. We’re able to implement our own `self.model.fit` in such a way that we don’t need to worry about the `session` anymore. We don’t even need to keep track of the tensors because the Keras API abstracts everything away for you.

If you’re interested in the full code, you can find the old loop here and the new loop here.

An Extra Layer of Features

It’s not just the Keras models where we apply this abstraction; we’ve also developed some neural network layers using a similar technique.

We’ve implemented a few custom layers ourselves. For example, we’ve got a layer called `DenseWithSparseWeights.` It behaves just like a dense layer, but we drop many weights beforehand to make it more sparse. Again we only need to inherit from the right class (tf.keras.layers.Dense) to create it.

normal dense vs sparse dense model

We’ve grown so fond of customizing that we’ve even implemented a loss function as a layer. This made a lot of sense for us, considering that losses can get complex in NLP. Many NLP tasks will require you to sample such that you also have labels of negative examples during training. You may also need to mask tokens during the process. We’re also interested in recording the similarity loss as well as the label accuracy. By just making our own layer, we are building components for re-use, and it is easy to maintain as well.

custom layer

Lessons Learned

Discovering this opportunity for customization made a massive difference for Rasa. We like to design our algorithms to be flexible and applicable in many circumstances, and we were happy to learn that the underlying technology stack allowed us to do so. We do have some advice for folks who are working on their TensorFlow migration:

  1. Start by thinking about what “lego bricks” you need in your application. This mental design step will make it much easier to recognize how you can leverage existing Keras/TensorFlow objects for your use-case.
  2. It can be tempting to try to immerse yourself by going for a deep dive immediately. Instead, it may help to start from a working example and drill down from there. TensorFlow is not an average Python package, and the internals can get complex. The Python code that you interact with needs to interact with C++ to keep the tensor operations performant. Once the code works, you’re at a much better place to start tuning/optimizing all the new TensorFlow version’s performance features.

Read More