Accelerate disaster response with computer vision for satellite imagery using Amazon SageMaker and Amazon Augmented AI

Accelerate disaster response with computer vision for satellite imagery using Amazon SageMaker and Amazon Augmented AI

Flooding Image

In recent years, advances in computer vision have enabled researchers, first responders, and governments to tackle the challenging problem of processing global satellite imagery to understand our planet and our impact on it. AWS recently released Amazon SageMaker geospatial capabilities to provide you with satellite imagery and geospatial state-of-the-art machine learning (ML) models, reducing barriers for these types of use cases. For more information, refer to Preview: Use Amazon SageMaker to Build, Train, and Deploy ML Models Using Geospatial Data.

Many agencies, including first responders, are using these offerings to gain large-scale situational awareness and prioritize relief efforts in geographical areas that have been struck by natural disasters. Often these agencies are dealing with disaster imagery from low altitude and satellite sources, and this data is often unlabeled and difficult to use. State-of-the-art computer vision models often underperform when looking at satellite images of a city that a hurricane or wildfire has hit. Given the lack of these datasets, even state-of-the-art ML models are often unable to deliver the accuracy and precision required to predict standard FEMA disaster classifications.

Geospatial datasets contain useful metadata such as latitude and longitude coordinates, and timestamps, which can provide context for these images. This is especially helpful in improving the accuracy of geospatial ML for disaster scenes, because these images are inherently messy and chaotic. Buildings are less rectangular, vegetation has sustained damage, and linear roads have been interrupted by flooding or mudslides. Because labeling these massive datasets is expensive, manual, and time-consuming, the development of ML models that can automate image labeling and annotation is critical.

To train this model, we need a labeled ground truth subset of the Low Altitude Disaster Imagery (LADI) dataset. This dataset consists of human and machine annotated airborne images collected by the Civil Air Patrol in support of various disaster responses from 2015-2019. These LADI datasets focus on the Atlantic hurricane seasons and coastal states along the Atlantic Ocean and Gulf of Mexico. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features, which are rarely featured in computer vision benchmarks and datasets. The teams used existing FEMA categories for damage such as flooding, debris, fire and smoke, or landslides, which standardized the label categories. The solution is then able to make predictions on the rest of the training data, and route lower-confidence results for human review.

In this post, we describe our design and implementation of the solution, best practices, and the key components of the system architecture.

Solution overview

In brief, the solution involved building three pipelines:

  • Data pipeline – Extracts the metadata of the images
  • Machine learning pipeline – Classifies and labels images
  • Human-in-the-loop review pipeline – Uses a human team to review results

The following diagram illustrates the solution architecture.

Solution Architecture

Given the nature of a labeling system like this, we designed a horizontally scalable architecture that would handle ingestion spikes without over-provisioning by using a serverless architecture. We use a one-to-many pattern from Amazon Simple Queue Service (Amazon SQS) to AWS Lambda in multiple spots to support these ingestion spikes, offering resiliency.

Using an SQS queue for processing Amazon Simple Storage Service (Amazon S3) events helps us control the concurrency of downstream processing (Lambda functions, in this case) and handle the incoming spikes of data. Queuing incoming messages also acts as a buffer storage in case of any failures downstream.

Given the highly parallel needs, we chose Lambda to process our images. Lambda is a serverless compute service that lets us run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, and managing runtimes.

We use Amazon OpenSearch Service as our central data store to take advantage of its highly scalable, fast searches and integrated visualization tool, OpenSearch Dashboards. It enables us to iteratively add context to the image, without having to recompile or rescale, and handle schema evolution.

Amazon Rekognition makes it easy to add image and video analysis into our applications, using proven, highly scalable, deep learning technology. With Amazon Rekognition, we get a good baseline of detected objects.

In the following sections, we dive into each pipeline in more detail.

Data pipeline

The following diagram shows the workflow of the data pipeline.

Data Pipeline

The LADI data pipeline starts with the ingestion of raw data images from the FEMA Common Alerting Protocol (CAP) into an S3 bucket. As we ingest the images into the raw data bucket, they are processed in near-real time in two steps:

  1. The S3 bucket triggers event notifications for all object creations, creating messages in the SQS queue for each image ingested.
  2. The SQS queue concurrently invokes the preprocessing Lambda functions on the image.

The Lambda functions perform the following preprocessing steps:

  1. Compute the UUID for each image, providing a unique identifier for each image. This ID will identify the image for its entire lifecycle.
  2. Extract metadata such as GPS coordinates, image size, GIS information, and S3 location from the image and persist it into OpenSearch.
  3. Based on a lookup against FIPS codes, the function moves the image into the curated data S3 bucket. We partition the data by the image’s FIPS-State-code/FIPS-County-code/Year/Month.

Machine learning pipeline

The ML pipeline starts from the images landing in the curated data S3 bucket in the data pipeline step, which triggers the following steps:

  1. Amazon S3 generates a message into another SQS queue for each object created in the curated data S3 bucket.
  2. The SQS queue concurrently triggers Lambda functions to run the ML inference job on the image.

The Lambda functions perform the following actions:

  1. Send each image to Amazon Rekognition for object detection, storing the returned labels and respective confidence scores.
  2. Compose the Amazon Rekognition output into input parameters for our Amazon SageMaker multi-model endpoint. This endpoint hosts our ensemble of classifiers, which are trained for specific sets of damage labels.
  3. Pass the results of the SageMaker endpoint to Amazon Augmented AI (Amazon A2I).

The following diagram illustrates the pipeline workflow.

Machine Learning Pipeline

Human-in-the-loop review pipeline

The following diagram illustrates the human-in-the-loop (HIL) pipeline.

HIL pipeline

With Amazon A2I, we can configure thresholds that will trigger a human review by a private team when a model yields a low confidence prediction. We can also use Amazon A2I to provide an ongoing audit of our model’s predictions. The workflow steps are as follows:

  1. Amazon A2I routes high confidence predictions into OpenSearch Service, updating the image’s label data.
  2. Amazon A2I routes low confidence predictions to the private team to annotate images manually.
  3. The human reviewer completes the annotation, generating a human annotation output file that is stored in the HIL Output S3 bucket.
  4. The HIL Output S3 bucket triggers a Lambda function that parses the human annotations output and updates the image’s data in OpenSearch Service.

By routing the human annotation results back to the data store, we can retrain the ensemble models and iteratively improve the model’s accuracy.

With our high-quality results now stored in OpenSearch Service, we’re able to perform geospatial and temporal search via a REST API, using Amazon API Gateway and Geoserver. OpenSearch Dashboard also enables users to search and run analytics with this dataset.

Results

The following code shows an example of our results.

Output snapshot

With this new pipeline, we create a human backstop for models that are not yet fully performant. This new ML pipeline has been put into production for use with a Civil Air Patrol Image Filter microservice that allows for filtering of Civil Air Patrol images in Puerto Rico. This enables first responders to view the extent of damage and view images associated with that damage following hurricanes. The AWS Data Lab, AWS Open Data Program, Amazon Disaster Response team, and AWS human-in-the-loop team worked with customers to develop an open-source pipeline that can be used to analyze Civil Air Patrol data stored in the Open Data Program registry on demand following any natural disaster. For more information about the pipeline architecture and an overview of the collaboration and impact, check out the video Focusing on disaster response with Amazon Augmented AI, the AWS Open Data Program and AWS Snowball.

Conclusion

As climate change continues to increase the frequency and intensity of storms and wildfires, we continue to see the importance of using ML to understand the impact of these events on local communities. These new tools can accelerate disaster response efforts and allow us to use the data from these post-event analyses to improve the prediction accuracy of these models with active learning. These new ML models can automate data annotation, which enables us to infer the extent of damage from each of these events as we overlay damage labels with map data. That cumulative data can also help improve our ability to predict damage for future disaster events, which can inform mitigation strategies. This in turn can improve the resilience of communities, economies, and ecosystems by giving decision-makers the information they need to develop data-driven polices to address these emerging environmental threats.

In this blog post we discussed using computer vision on satellite imagery. This solution is meant to be a reference architecture or a quick start guide that you can customize for your own needs.

Give it a whirl and let us know how this solved your use case by leaving feedback in the comments section. For more information, see Amazon SageMaker geospatial capabilities.


About the Authors

Vamshi Krishna Enabothala is a Sr. Applied AI Specialist Architect at AWS. He works with customers from different sectors to accelerate high-impact data, analytics, and machine learning initiatives. He is passionate about recommendation systems, NLP, and computer vision areas in AI and ML. Outside of work, Vamshi is an RC enthusiast, building RC equipment (planes, cars, and drones), and also enjoys gardening.

Morgan Dutton is a Senior Technical Program Manager with the Amazon Augmented AI and Amazon SageMaker Ground Truth team. She works with enterprise, academic and public sector customers to accelerate adoption of machine learning and human-in-the-loop ML services.

Sandeep Verma is a Sr. Prototyping Architect with AWS. He enjoys diving deep into customer challenges and building prototypes for customers to accelerate innovation. He has a background in AI/ML, founder of New Knowledge, and generally passionate about tech. In his free time, he loves traveling and skiing with his family.

Read More

Achieve high performance at scale for model serving using Amazon SageMaker multi-model endpoints with GPU

Achieve high performance at scale for model serving using Amazon SageMaker multi-model endpoints with GPU

Amazon SageMaker multi-model endpoints (MMEs) provide a scalable and cost-effective way to deploy a large number of machine learning (ML) models. It gives you the ability to deploy multiple ML models in a single serving container behind a single endpoint. From there, SageMaker manages loading and unloading the models and scaling resources on your behalf based on your traffic patterns. You will benefit from sharing and reusing hosting resources and a reduced operational burden of managing a large quantity of models.

In November 2022, MMEs added support for GPUs, which allows you to run multiple models on a single GPU device and scale GPU instances behind a single endpoint. This satisfies the strong MME demand for deep neural network (DNN) models that benefit from accelerated compute with GPUs. These include computer vision (CV), natural language processing (NLP), and generative AI models. The reasons for the demand include the following:

  • DNN models are typically large in size and complexity and continue growing at a rapid pace. Taking NLP models as an example, many of them exceed billions of parameters, which requires GPUs to satisfy low latency and high throughput requirements.
  • We have observed an increased need for customizing these models to deliver hyper-personalized experiences to individual users. As the quantity of these models increases, there is a need for an easier solution to deploy and operationalize many models at scale.
  • GPU instances are expensive and you want to reuse these instances as much as possible to maximize the GPU utilization and reduce operating cost.

Although all these reasons point to MMEs with GPU as an ideal option for DNN models, it’s advised to perform load testing to find the right endpoint configuration that satisfies your use case requirements. Many factors can influence the load testing results, such as instance type, number of instances, model size, and model architecture. In addition, load testing can help guide the auto scaling strategies using the right metrics rather than iterative trial and error methods.

For those reasons, we put together this post to help you perform proper load testing on MMEs with GPU and find the best configuration for your ML use case. We share our load testing results for some of the most popular DNN models in NLP and CV hosted using MMEs on different instance types. We summarize the insights and conclusion from our test results to help you make an informed decision on configuring your own deployments. Along the way, we also share our recommended approach to performing load testing for MMEs on GPU. The tools and technique recommended determine the optimum number of models that can be loaded per instance type and help you achieve the best price-performance.

Solution overview

For an introduction to MMEs and MMEs with GPU, refer to Create a Multi-Model Endpoint and Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints. For the context of load testing in this post, you can download our sample code from the GitHub repo to reproduce the results or use it as a template to benchmark your own models. There are two notebooks provided in the repo: one for load testing CV models and another for NLP. Several models of varying sizes and architectures were benchmarked on different type of GPU instances: ml.g4dn.2xlarge, ml.g5.2xlarge, and ml.p3.2xlarge. This should provide a reasonable cross section of performance across the following metrics for each instance and model type:

  • Max number of models that can be loaded into GPU memory
  • End-to-end response latency observed on the client side for each inference query
  • Max throughput of queries per second that the endpoint can process without error
  • Max current users per instances before a failed request is observed

The following table lists the models tested.

Use Case Model Name Size On Disk Number of Parameters
CV resnet50 100Mb 25M
CV convnext_base 352Mb 88M
CV vit_large_patch16_224 1.2Gb 304M
NLP bert-base-uncased 436Mb 109M
NLP roberta-large 1.3Gb 335M

The following table lists the GPU instances tested.

Instance Type GPU Type Num of GPUs GPU Memory (GiB)
ml.g4dn.2xlarge NVIDIA T4 GPUs 1 16
ml.g5.2xlarge NVIDIA A10G Tensor Core GPU 1 24
ml.p3.2xlarge NVIDIA® V100 Tensor Core GPU 1 16

As previously mentioned, the code example can be adopted to other models and instance types.

Note that MMEs currently only support single GPU instances. For the list of supported instance types, refer to Supported algorithms, frameworks, and instances.

The benchmarking procedure is comprised of the following steps:

  1. Retrieve a pre-trained model from a model hub.
  2. Prepare the model artifact for serving on SageMaker MMEs (see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints for more details).
  3. Deploy a SageMaker MME on a GPU instance.
  4. Determine the maximum number of models that can be loaded into the GPU memory within a specified threshold.
  5. Use the Locust Load Testing Framework to simulate traffic that randomly invokes models loaded on the instance.
  6. Collect data and analyze the results.
  7. Optionally, repeat Steps 2–6 after compiling the model to TensorRT.

Steps 4 and 5 warrant a deeper look. Models within a SageMaker GPU MME are loaded into memory in a dynamic fashion. Therefore, in Step 4, we upload an initial model artifact to Amazon Simple Storage Service (Amazon S3) and invoke the model to load it into memory. After the initial invocation, we measure the amount of GPU memory consumed, make a copy of the initial model, invoke the copy of the model to load it into memory, and again measure the total amount of GPU memory consumed. This process is repeated until a specified percent threshold of GPU memory utilization is reached. For the benchmark, we set the threshold to 90% to provide a reasonable memory buffer for inferencing on larger batches or leaving some space to load other less-frequently used models.

Simulate user traffic

After we have determined the number of models, we can run a load test using the Locust Load Testing Framework. The load test simulates user requests to random models and automatically measures metrics such as response latency and throughput.

Locust supports custom load test shapes that allow you to define custom traffic patterns. The shape that was used in this benchmark is shown in the following chart. In the first 30 seconds, the endpoint is warmed up with 10 concurrent users. After 30 seconds, new users are spawned at a rate of two per second, reaching 20 concurrent users at the 40-second mark. The endpoint is then benchmarked steadily with 20 concurrent users until the 60-second mark, at which point Locust again begins to ramp up users at two per second until 40 concurrent users. This pattern of ramping up and steady testing is repeated until the endpoint is ramped up to 200 concurrent users. Depending on your use case, you may want to adjust the load test shape in the locust_benchmark_sm.py to more accurately reflect your expected traffic patterns. For example, if you intend to host larger language models, a load test with 200 concurrent users may not be feasible for a model hosted on a single instance, and you may therefore want to reduce the user count or increase the number of instances. You may also want to extend the duration of the load test to more accurately gauge the endpoint’s stability over a longer period of time.

stages = [
{"duration": 30, "users": 10, "spawn_rate": 5},
{"duration": 60, "users": 20, "spawn_rate": 1},
{"duration": 90, "users": 40, "spawn_rate": 2},
…
]

Note that we have only benchmarked the endpoint with homogeneous models all running on a consistent serving bases using either PyTorch or TensorRT. This is because MMEs are best suited for hosting many models with similar characteristics, such as memory consumption and response time. The benchmarking templates provided in the GitHub repo can still be used to determine whether serving heterogeneous models on MMEs would yield the desired performance and stability.

Benchmark results for CV models

Use the cv-benchmark.ipynb notebook to run load testing for computer vision models. You can adjust the pre-trained model name and instance type parameters to performance load testing on different model and instance type combinations. We purposely tested three CV models in different size ranges from smallest to largest: resnet50 (25M), convnext_base (88M), and vit_large_patch16_224 (304M). You may need to adjust to code if you pick a model outside of this list. additionally, the notebook defaults the input image shape to a 224x224x3 image tensor. Remember to adjust the input shape accordingly if you need to benchmark models that take a different-sized image.

After running through the entire notebook, you will get several performance analysis visualizations. The first two detail the model performance with respect to increasing concurrent users. The following figures are the example visualizations generated for the ResNet50 model running on ml.g4dn.2xlarge, comparing PyTorch (left) vs. TensorRT (right). The top line graphs show the model latency and throughput on the y-axis with increasing numbers of concurrent client workers reflected on the x-axis. The bottom bar charts show the count of successful and failed requests.

Looking across all the computer vision models we tested, we observed the following:

  • Latency (in milliseconds) is higher, and throughput (requests per second) is lower for bigger models (resnet50 > convnext_base > vit_large_patch16_224).
  • Latency increase is proportional with the number of users as more requests are queued up on the inference server.
  • Large models consume more compute resources and can reach their maximum throughput limits with fewer users than a smaller model. This is observed with the vit_large_patch16_224 model, which recorded the first failed request at 140 concurrent users. Being significantly larger than the other two models tested, it had the most overall failed requests at higher concurrency as well. This is a clear signal that the endpoint would need to scale beyond a single instance if the intent is to support more than 140 concurrent users.

At the end of the notebook run, you also get a summary comparison of PyTorch vs. TensorRT models for each of the four key metrics. From our benchmark testing, the CV models all saw a boost in model performance after TensorRT compilation. Taking our ResNet50 model as the example again, latency decreased by 32% while throughput increased by 18%. Although the maximum number of concurrent users stayed the same for ResNet50, the other two models both saw a 14% improvement in the number of concurrent users that they can support. The TensorRT performance improvement, however, came at the expense of higher memory utilization, resulting in fewer models loaded by MMEs. The impact is more for models using a convolutional neural network (CNN). In fact, our ResNet50 model consumed approximately twice the GPU memory going from PyTorch to TensorRT, resulting in 50% fewer models loaded (46 vs. 23). We diagnose this behavior further in the following section.

Benchmark results for NLP models

For the NLP models, use the nlp-benchmark.ipynb notebook to run the load test. The setup of the notebook should look very similar. We tested two NLP models: bert-base-uncased (109M) and roberta-large (335M). The pre-trained model and the tokenizer are both downloaded from the Hugging Face hub, and the test payload is generated from the tokenizer using a sample string. Max sequence length is defaulted at 128. If you need to test longer strings, remember to adjust that parameter. Running through the NLP notebook generates the same set of visualizations: Pytorch (left) vs TensorRT (right).


From these, we observed even more performance benefit of TensorRT for NLP models. Taking the roberta-large model on an ml.g4dn.2xlarge instance for example, inference latency decreased dramatically from 180 milliseconds to 56 milliseconds (a 70% improvement), while throughput improved by 406% from 33 requests per second to 167. Additionally, the maximum number of concurrent users increased by 50%; failed requests were not observed until we reached 180 concurrent users, compared to 120 for the original PyTorch model. In terms of memory utilization, we saw one fewer model loaded for TensorRT (from nine models to eight). However, the negative impact is much smaller compared to what we observed with the CNN-based models.

Analysis on memory utilization

The following table shows the full analysis on memory utilization impact going from PyTorch to TensorRT. We mentioned earlier that CNN-based models are impacted more negatively. The ResNet50 model had an over 50% reduction in number of models loaded across all three GPU instance types. Convnext_base had an even larger reduction at approximately 70% across the board. On the other hand, the impact to the transformer models is small or mixed. vit_large_patch16_224 and roberta-large had an average reduction of approximately 20% and 3%, respectively, while bert-base-uncased had an approximately 40% improvement.

Looking at all the data points as a whole in regards to the superior performance in latency, throughput, and reliability, and the minor impact on the maximum number of models loaded, we recommend the TensorRT model for transformer-based model architectures. For CNNs, we believe further cost performance analysis is needed to make sure the performance benefit outweighs the cost of additional hosting infrastructure.

ML Use Case Architecture Model Name Instance Type Framework Max Models Loaded Diff (%) Avg. Diff (%)
CV CNN Resnet50 ml.g4dn.2xlarge PyTorch 46 -50% -50%
TensorRT 23
ml.g5.2xlarge PyTorch 70 -51%
TensorRT 34
ml.p3.2xlarge PyTorch 49 -51%
TensorRT 24
Convnext_base ml.g4dn.2xlarge PyTorch 33 -50% -70%
TensorRT 10
ml.g5.2xlarge PyTorch 50 -70%
TensorRT 16
ml.p3.2xlarge PyTorch 35 -69%
TensorRT 11
Transformer vit_large_patch16_224 ml.g4dn.2xlarge PyTorch 10 -30% -20%
TensorRT 7
ml.g5.2xlarge PyTorch 15 -13%
TensorRT 13
ml.p3.2xlarge PyTorch 11 -18%
TensorRT 9
NLP Roberta-large ml.g4dn.2xlarge PyTorch 9 -11% -3%
TensorRT 8
ml.g5.2xlarge PyTorch 13 0%
TensorRT 13
ml.p3.2xlarge PyTorch 9 0%
TensorRT 9
Bert-base-uncased ml.g4dn.2xlarge PyTorch 26 62% 40%
TensorRT 42
ml.g5.2xlarge PyTorch 39 28%
TensorRT 50
ml.p3.2xlarge PyTorch 28 29%
TensorRT 36

The following tables list our complete benchmark results for all the metrics across all three GPU instances types.

ml.g4dn.2xlarge

Use Case Architecture Model Name Number of Parameters Framework Max Models Loaded Diff (%) Latency (ms) Diff (%) Throughput (qps) Diff (%) Max Concurrent Users Diff (%)
CV CNN resnet50 25M PyTorch 46 -50% 164 -32% 120 18% 180 NA
TensorRT 23 . 111 . 142 . 180 .
convnext_base 88M PyTorch 33 -70% 154 -22% 64 102% 140 14%
TensorRT 10 . 120 . 129 . 160 .
Transformer vit_large_patch16_224 304M PyTorch 10 -30% 425 -69% 26 304% 140 14%
TensorRT 7 . 131 . 105 . 160 .
NLP bert-base-uncased 109M PyTorch 26 62% 70 -39% 105 142% 140 29%
TensorRT 42 . 43 . 254 . 180 .
roberta-large 335M PyTorch 9 -11% 187 -70% 33 406% 120 50%
TensorRT 8 . 56 . 167 . 180 .

ml.g5.2xlarge

Use Case Architecture Model Name Number of Parameters Framework Max Models Loaded Diff (%) Latency (ms) Diff (%) Throughput (qps) Diff (%) Max Concurrent Users Diff (%)
CV CNN resnet50 25M PyTorch 70 -51% 159 -31% 146 14% 180 11%
TensorRT 34 . 110 . 166 . 200 .
convnext_base 88M PyTorch 50 -68% 149 -23% 134 13% 180 0%
TensorRT 16 . 115 . 152 . 180 .
Transformer vit_large_patch16_224 304M PyTorch 15 -13% 149 -22% 105 35% 160 25%
TensorRT 13 . 116 . 142 . 200 .
NLP bert-base-uncased 109M PyTorch 39 28% 65 -29% 183 38% 180 11%
TensorRT 50 . 46 . 253 . 200 .
roberta-large 335M PyTorch 13 0% 97 -38% 121 46% 140 14%
TensorRT 13 . 60 . 177 . 160 .

ml.p3.2xlarge

Use Case Architecture Model Name Number of Parameters Framework Max Models Loaded Diff (%) Latency (ms) Diff (%) Throughput (qps) Diff (%) Max Concurrent Users Diff (%)
CV CNN resnet50 25M PyTorch 49 -51% 197 -41% 94 18% 160 -12%
TensorRT 24 . 117 . 111 . 140 .
convnext_base 88M PyTorch 35 -69% 178 -23% 89 11% 140 14%
TensorRT 11 .137 137 . 99 . 160 .
Transformer vit_large_patch16_224 304M PyTorch 11 -18% 186 -28% 83 23% 140 29%
TensorRT 9 . 134 . 102 . 180 .
NLP bert-base-uncased 109M PyTorch 28 29% 77 -40% 133 59% 140 43%
TensorRT 36 . 46 . 212 . 200 .
roberta-large 335M PyTorch 9 0% 108 -44% 88 60% 160 0%
TensorRT 9 . 61 . 141 . 160 .

The following table summarizes the results across all instance types. The ml.g5.2xlarge instance provides the best performance, whereas the ml.p3.2xlarge instance generally underperforms despite being the most expensive of the three. The g5 and g4dn instances demonstrate the best value for inference workloads.

Use Case Architecture Model Name Number of Parameters Framework Instance Type Max Models Loaded Diff (%) Latency (ms) Diff (%) Throughput (qps) Diff (%) Max Concurrent Users
CV CNN resnet50 25M PyTorch ml.g5.2xlarge 70 . 159 . 146 . 180
. . . . . ml.p3.2xlarge 49 . 197 . 94 . 160
. . . . . ml.g4dn.2xlarge 46 . 164 . 120 . 180
CV CN resnet50 25M TensorRT ml.g5.2xlarge 34 -51% 110 -31% 166 14% 200
. . . . . ml.p3.2xlarge 24 -51% 117 -41% 111 18% 200
. . . . . ml.g4dn.2xlarge 23 -50% 111 -32% 142 18% 180
NLP Transformer bert-base-uncased 109M Pytorch ml.g5.2xlarge 39 . 65 . 183 . 180
. . . . . ml.p3.2xlarge 28 . 77 . 133 . 140
. . . . . ml.g4dn.2xlarge 26 . 70 . 105 . 140
NLP Transformer bert-base-uncased 109M TensorRT ml.g5.2xlarge 50 28% 46 -29% 253 38% 200
. . . . . ml.p3.2xlarge 36 29% 46 -40% 212 59% 200
. . . . . ml.g4dn.2xlarge 42 62% 43 -39% 254 142% 180

Clean up

After you complete your load test, clean up the generated resources to avoid incurring additional charges. The main resources are the SageMaker endpoints and model artifact files in Amazon S3. To make it easy for you, the notebook files have the following cleanup code to help you delete them:

delete_endpoint(sm_client, sm_model_name, endpoint_config_name, endpoint_name)

! aws s3 rm --recursive {trt_mme_path}

Conclusion

In this post, we shared our test results and analysis for various deep neural network models running on SageMaker multi-model endpoints with GPU. The results and insights we shared should provide a reasonable cross section of performance across different metrics and instance types. In the process, we also introduced our recommended approach to run benchmark testing for SageMaker MMEs with GPU. The tools and sample code we provided can help you quickstart your benchmark testing and make a more informed decision on how to cost-effectively host hundreds of DNN models on accelerated compute hardware. To get started with benchmarking your own models with MME support for GPU, refer to Supported algorithms, frameworks, and instances and the GitHub repo for additional examples and documentation.


About the authors

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Read More

A vision-language approach for foundational UI understanding

A vision-language approach for foundational UI understanding

The computational understanding of user interfaces (UI) is a key step towards achieving intelligent UI behaviors. Previously, we investigated various UI modeling tasks, including widget captioning, screen summarization, and command grounding, that address diverse interaction scenarios such as automation and accessibility. We also demonstrated how machine learning can help user experience practitioners improve UI quality by diagnosing tappability confusion and providing insights for improving UI design. These works along with those developed by others in the field have showcased how deep neural networks can potentially transform end user experiences and the interaction design practice.

With these successes in addressing individual UI tasks, a natural question is whether we can obtain foundational understandings of UIs that can benefit specific UI tasks. As our first attempt to answer this question, we developed a multi-task model to address a range of UI tasks simultaneously. Although the work made some progress, a few challenges remain. Previous UI models heavily rely on UI view hierarchies — i.e., the structure or metadata of a mobile UI screen like the Document Object Model for a webpage — that allow a model to directly acquire detailed information of UI objects on the screen (e.g., their types, text content and positions). This metadata has given previous models advantages over their vision-only counterparts. However, view hierarchies are not always accessible, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the short-term gains from using view hierarchies, it may ultimately hamper the model performance and applicability. In addition, previous models had to deal with heterogeneous information across datasets and UI tasks, which often resulted in complex model architectures that were difficult to scale or generalize across tasks.

In “Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus”, accepted for publication at ICLR 2023, we present a vision-only approach that aims to achieve general UI understanding completely from raw pixels. We introduce a unified approach to represent diverse UI tasks, the information for which can be universally represented by two core modalities: vision and language. The vision modality captures what a person would see from a UI screen, and the language modality can be natural language or any token sequences related to the task. We demonstrate that Spotlight substantially improves accuracy on a range of UI tasks, including widget captioning, screen summarization, command grounding and tappability prediction.

Spotlight Model

The Spotlight model input includes a tuple of three items: the screenshot, the region of interest on the screen, and the text description of the task. The output is a text description or response about the region of interest. This simple input and output representation of the model is expressive to capture various UI tasks and allows scalable model architectures. This model design allows a spectrum of learning strategies and setups, from task-specific fine-tuning, to multi-task learning and to few-shot learning. The Spotlight model, as illustrated in the above figure, leverages existing architecture building blocks such as ViT and T5 that are pre-trained in the high-resourced, general vision-language domain, which allows us to build on top of the success of these general domain models.

Because UI tasks are often concerned with a specific object or area on the screen, which requires a model to be able to focus on the object or area of interest, we introduce a Focus Region Extractor to a vision-language model that enables the model to concentrate on the region in light of the screen context.

In particular, we design a Region Summarizer that acquires a latent representation of a screen region based on ViT encodings by using attention queries generated from the bounding box of the region (see paper for more details). Specifically, each coordinate (a scalar value, i.e., the left, top, right or bottom) of the bounding box, denoted as a yellow box on the screenshot, is first embedded via a multilayer perceptron (MLP) as a collection of dense vectors, and then fed to a Transformer model along their coordinate-type embedding. The dense vectors and their corresponding coordinate-type embeddings are color coded to indicate their affiliation with each coordinate value. Coordinate queries then attend to screen encodings output by ViT via cross attention, and the final attention output of the Transformer is used as the region representation for the downstream decoding by T5.

A target region on the screen is summarized by using its bounding box to query into screen encodings from ViT via attentional mechanisms.

Results

We pre-train the Spotlight model using two unlabeled datasets (an internal dataset based on C4 corpus and an internal mobile dataset) with 2.5 million mobile UI screens and 80 million web pages. We then separately fine-tune the pre-trained model for each of the four downstream tasks (captioning, summarization, grounding, and tappability). For widget captioning and screen summarization tasks, we report CIDEr scores, which measure how similar a model text description is to a set of references created by human raters. For command grounding, we report accuracy that measures the percentage of times the model successfully locates a target object in response to a user command. For tappability prediction, we report F1 scores that measure the model’s ability to tell tappable objects from untappable ones.

In this experiment, we compare Spotlight with several benchmark models. Widget Caption uses view hierarchy and the image of each UI object to generate a text description for the object. Similarly, Screen2Words uses view hierarchy and the screenshot as well as auxiliary features (e.g., app description) to generate a summary for the screen. In the same vein, VUT combines screenshots and view hierarchies for performing multiple tasks. Finally, the original Tappability model leverages object metadata from view hierarchy and the screenshot to predict object tappability. Taperception, a follow-up model of Tappability, uses a vision-only tappability prediction approach. We examine two Spotlight model variants with respect to the size of its ViT building block, including B/16 and L/16. Spotlight drastically exceeded the state-of-the-art across four UI modeling tasks.

Model       Captioning       Summarization       Grounding       Tappability      
Baselines    Widget Caption       97                        
Screen2Words             61.3                  
VUT       99.3       65.6       82.1            
Taperception                         85.5      
Tappability                         87.9      
Spotlight    B/16       136.6       103.5       95.7       86.9      
L/16       141.8       106.7       95.8       88.4      

We then pursue a more challenging setup where we ask the model to learn multiple tasks simultaneously because a multi-task model can substantially reduce model footprint. As shown in the table below, the experiments showed that our model still performs competitively.

Model       Captioning       Summarization       Grounding       Tappability
VUT multi-task       99.3       65.1       80.8            
Spotlight B/16       140       102.7       90.8       89.4      
Spotlight L/16       141.3       99.2       94.2       89.5      

To understand how the Region Summarizer enables Spotlight to focus on a target region and relevant areas on the screen, we analyze the attention weights (which indicate where the model attention is on the screenshot) for both widget captioning and screen summarization tasks. In the figure below, for the widget captioning task, the model predicts “select Chelsea team” for the checkbox on the left side, highlighted with a red bounding box. We can see from its attention heatmap (which illustrates the distribution of attention weights) on the right that the model learns to attend to not only the target region of the check box, but also the text “Chelsea” on the far left to generate the caption. For the screen summarization example, the model predicts “page displaying the tutorial of a learning app” given the screenshot on the left. In this example, the target region is the entire screen, and the model learns to attend to important parts on the screen for summarization.

For the widget captioning task, the attention heatmap shows the model attending to the checkbox, i.e., the target object, and the text label on its left when generating a caption for the object. The red bounding box in the figure is for illustration purposes.
For the screen summarization task that the target region encloses the entire screen, the attention heatmap shows the model attending to various locations on the screen that contribute to generating the summary.

Conclusion

We demonstrate that Spotlight outperforms previous methods that use both screenshots and view hierarchies as the input, and establishes state-of-the-art results on multiple representative UI tasks. These tasks range from accessibility, automation to interaction design and evaluation. Our vision-only approach for mobile UI understanding alleviates the need to use view hierarchy, allows the architecture to easily scale and benefits from the success of large vision-language models pre-trained for the general domain. Compared to recent large vision-language model efforts such as Flamingo and PaLI, Spotlight is relatively small and our experiments show the trend that larger models yield better performance. Spotlight can be easily applied to more UI tasks and potentially advance the fronts of many interaction and user experience tasks.

Acknowledgment

We thank Mandar Joshi and Tao Li for their help in processing the web pre-training dataset, and Chin-Yi Cheng and Forrest Huang for their feedback for proofreading the paper. Thanks to Tom Small for his help in creating animated figures in this post.

Read More

Planning for AGI and beyond

Planning for AGI and beyond

Planning for AGI and beyond

Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity.

If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility.

AGI has the potential to give everyone incredible new capabilities; we can imagine a world where all of us have access to help with almost any cognitive task, providing a great force multiplier for human ingenuity and creativity.

On the other hand, AGI would also come with serious risk of misuse, drastic accidents, and societal disruption. Because the upside of AGI is so great, we do not believe it is possible or desirable for society to stop its development forever; instead, society and the developers of AGI have to figure out how to get it right.[1]

Although we cannot predict exactly what will happen, and of course our current progress could hit a wall, we can articulate the principles we care about most:

  1. We want AGI to empower humanity to maximally flourish in the universe. We don’t expect the future to be an unqualified utopia, but we want to maximize the good and minimize the bad, and for AGI to be an amplifier of humanity.
  2. We want the benefits of, access to, and governance of AGI to be widely and fairly shared.
  3. We want to successfully navigate massive risks. In confronting these risks, we acknowledge that what seems right in theory often plays out more strangely than expected in practice. We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios.

The short term

There are several things we think are important to do now to prepare for AGI.

First, as we create successively more powerful systems, we want to deploy them and gain experience with operating them in the real world. We believe this is the best way to carefully steward AGI into existence—a gradual transition to a world with AGI is better than a sudden one. We expect powerful AI to make the rate of progress in the world much faster, and we think it’s better to adjust to this incrementally.

A gradual transition gives people, policymakers, and institutions time to understand what’s happening, personally experience the benefits and downsides of these systems, adapt our economy, and to put regulation in place. It also allows for society and AI to co-evolve, and for people collectively to figure out what they want while the stakes are relatively low.

We currently believe the best way to successfully navigate AI deployment challenges is with a tight feedback loop of rapid learning and careful iteration. Society will face major questions about what AI systems are allowed to do, how to combat bias, how to deal with job displacement, and more. The optimal decisions will depend on the path the technology takes, and like any new field, most expert predictions have been wrong so far. This makes planning in a vacuum very difficult.[2]

Generally speaking, we think more usage of AI in the world will lead to good, and want to promote it (by putting models in our API, open-sourcing them, etc.). We believe that democratized access will also lead to more and better research, decentralized power, more benefits, and a broader set of people contributing new ideas.

As our systems get closer to AGI, we are becoming increasingly cautious with the creation and deployment of our models. Our decisions will require much more caution than society usually applies to new technologies, and more caution than many users would like. Some people in the AI field think the risks of AGI (and successor systems) are fictitious; we would be delighted if they turn out to be right, but we are going to operate as if these risks are existential.


As our systems get closer to AGI, we are becoming increasingly cautious with the creation and deployment of our models.

At some point, the balance between the upsides and downsides of deployments (such as empowering malicious actors, creating social and economic disruptions, and accelerating an unsafe race) could shift, in which case we would significantly change our plans around continuous deployment.

Second, we are working towards creating increasingly aligned and steerable models. Our shift from models like the first version of GPT-3 to InstructGPT and ChatGPT is an early example of this.

In particular, we think it’s important that society agree on extremely wide bounds of how AI can be used, but that within those bounds, individual users have a lot of discretion. Our eventual hope is that the institutions of the world agree on what these wide bounds should be; in the shorter term we plan to run experiments for external input. The institutions of the world will need to be strengthened with additional capabilities and experience to be prepared for complex decisions about AGI.

The “default setting” of our products will likely be quite constrained, but we plan to make it easy for users to change the behavior of the AI they’re using. We believe in empowering individuals to make their own decisions and the inherent power of diversity of ideas.

We will need to develop new alignment techniques as our models become more powerful (and tests to understand when our current techniques are failing). Our plan in the shorter term is to use AI to help humans evaluate the outputs of more complex models and monitor complex systems, and in the longer term to use AI to help us come up with new ideas for better alignment techniques.

Importantly, we think we often have to make progress on AI safety and capabilities together. It’s a false dichotomy to talk about them separately; they are correlated in many ways. Our best safety work has come from working with our most capable models. That said, it’s important that the ratio of safety progress to capability progress increases.

Third, we hope for a global conversation about three key questions: how to govern these systems, how to fairly distribute the benefits they generate, and how to fairly share access.

In addition to these three areas, we have attempted to set up our structure in a way that aligns our incentives with a good outcome. We have a clause in our Charter about assisting other organizations to advance safety instead of racing with them in late-stage AGI development. We have a cap on the returns our shareholders can earn so that we aren’t incentivized to attempt to capture value without bound and risk deploying something potentially catastrophically dangerous (and of course as a way to share the benefits with society). We have a nonprofit that governs us and lets us operate for the good of humanity (and can override any for-profit interests), including letting us do things like cancel our equity obligations to shareholders if needed for safety and sponsor the world’s most comprehensive UBI experiment.


We have attempted to set up our structure in a way that aligns our incentives with a good outcome.

We think it’s important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year. At some point, it may be important to get independent review before starting to train future systems, and for the most advanced efforts to agree to limit the rate of growth of compute used for creating new models. We think public standards about when an AGI effort should stop a training run, decide a model is safe to release, or pull a model from production use are important. Finally, we think it’s important that major world governments have insight about training runs above a certain scale.

The long term

We believe that future of humanity should be determined by humanity, and that it’s important to share information about progress with the public. There should be great scrutiny of all efforts attempting to build AGI and public consultation for major decisions.

The first AGI will be just a point along the continuum of intelligence. We think it’s likely that progress will continue from there, possibly sustaining the rate of progress we’ve seen over the past decade for a long period of time. If this is true, the world could become extremely different from how it is today, and the risks could be extraordinary. A misaligned superintelligent AGI could cause grievous harm to the world; an autocratic regime with a decisive superintelligence lead could do that too.

AI that can accelerate science is a special case worth thinking about, and perhaps more impactful than everything else. It’s possible that AGI capable enough to accelerate its own progress could cause major changes to happen surprisingly quickly (and even if the transition starts slowly, we expect it to happen pretty quickly in the final stages). We think a slower takeoff is easier to make safe, and coordination among AGI efforts to slow down at critical junctures will likely be important (even in a world where we don’t need to do this to solve technical alignment problems, slowing down may be important to give society enough time to adapt).

Successfully transitioning to a world with superintelligence is perhaps the most important—and hopeful, and scary—project in human history. Success is far from guaranteed, and the stakes (boundless downside and boundless upside) will hopefully unite all of us.

We can imagine a world in which humanity flourishes to a degree that is probably impossible for any of us to fully visualize yet. We hope to contribute to the world an AGI aligned with such flourishing.


Acknowledgments

Thanks to Brian Chesky, Paul Christiano, Jack Clark, Holden Karnofsky, Tasha McCauley, Nate Soares, Kevin Scott, Brad Smith, Helen Toner, Allan Dafoe, and the OpenAI team for reviewing drafts of this.


Footnotes

  1. We seem to have been given lots of gifts relative to what we expected earlier: for example, it seems like creating AGI will require huge amounts of compute and thus the world will know who is working on it, it seems like the original conception of hyper-evolved RL agents competing with each other and evolving intelligence in a way we can’t really observe is less likely than it originally seemed, almost no one predicted we’d make this much progress on pre-trained language models that can learn from the collective preferences and output of humanity, etc.

    AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt. ↩︎

  2. For example, when we first started OpenAI, we didn’t expect scaling to be as important as it has turned out to be. When we realized it was going to be critical, we also realized our original structure wasn’t going to work—we simply wouldn’t be able to raise enough money to accomplish our mission as a nonprofit—and so we came up with a new structure.

    As another example, we now believe we were wrong in our original thinking about openness, and have pivoted from thinking we should release everything (though we open source some things, and expect to open source more exciting things in the future!) to thinking that we should figure out how to safely share access to and benefits of the systems. We still believe the benefits of society understanding what is happening are huge and that enabling such understanding is the best way to make sure that what gets built is what society collectively wants (obviously there’s a lot of nuance and conflict here). ↩︎

OpenAI

How AI Is Transforming Genomics

How AI Is Transforming Genomics

Advancements in whole genome sequencing have ignited a revolution in digital biology.

Genomics programs across the world are gaining momentum as the cost of high-throughput, next-generation sequencing has declined.

Whether used for sequencing critical-care patients with rare diseases or in population-scale genetics research, whole genome sequencing is becoming a fundamental step in clinical workflows and drug discovery.

But genome sequencing is just the first step. Analyzing genome sequencing data requires accelerated compute, data science and AI to read and understand the genome. With the end of Moore’s law, the observation that there’s a doubling every two years in the number of transistors in an integrated circuit, new computing approaches are necessary to lower the cost of data analysis, increase the throughput and accuracy of reads, and ultimately unlock the full potential of the human genome.

An Explosion in Bioinformatics Data

Sequencing an individual’s whole genome generates roughly 100 gigabytes of raw data. That more than doubles after the genome is sequenced using complex algorithms and applications such as deep learning and natural language processing.

As the cost of sequencing a human genome continues to decrease, volumes of sequencing data are exponentially increasing.

An estimated 40 exabytes will be required to store all human genome data by 2025. As a reference, that’s 8x more storage than would be required to store every word spoken in history.

Many genome analysis pipelines are struggling to keep up with the expansive levels of raw data being generated.

Accelerated Genome Sequencing Analysis Workflows

Sequencing analysis is complicated and computationally intensive, with numerous steps required to identify genetic variants in a human genome.

Deep learning is becoming important for base calling right within the genomic instrument using RNN- and convolutional neural network (CNN)-based models. Neural networks interpret image and signal data generated by instruments and infer the 3 billion nucleotide pairs of the human genome. This is improving the accuracy of the reads and ensuring that base calling occurs closer to real time, further hastening the entire genomics workflow, from sample to variant call format to final report.

For secondary genomic analysis, alignment technologies use a reference genome to assist with piecing a genome back together after the sequencing of DNA fragments.

BWA-MEM, a leading algorithm for alignment, is helping researchers rapidly map DNA sequence reads to a reference genome. STAR is another gold-standard alignment algorithm used for RNA-seq data that delivers accurate, ultrafast alignment to better understand gene expressions.

The dynamic programming algorithm Smith-Waterman is also widely used for alignment, a step that’s accelerated 35x on the NVIDIA H100 Tensor Core GPU, which includes a dynamic programming accelerator.

Uncovering Genetic Variants

One of the most critical stages of sequencing projects is variant calling, where researchers identify differences between a patient’s sample and the reference genome. This helps clinicians determine what genetic disease a critically ill patient might have, or helps researchers look across a population to discover new drug targets. These variants can be single-nucleotide changes, small insertions and deletions, or complex rearrangements.

GPU-optimized and -accelerated callers such as the Broad Institute’s GATK — a genome analysis toolkit for germline variant calling — increase speed of analysis. To help researchers remove false positives in GATK results, NVIDIA collaborated with the Broad Institute to introduce NVScoreVariants, a deep learning tool for filtering variants using CNNs.

Deep learning-based variant callers such as Google’s DeepVariant increase accuracy of calls, without the need for a separate filtering step. DeepVariant uses a CNN architecture to call variants. It can be retrained to fine-tune for enhanced accuracy with each genomic platform’s outputs.

Secondary analysis software in the NVIDIA Clara Parabricks suite of tools has accelerated these variant callers up to 80x. For example, germline HaplotypeCaller’s runtime is reduced from 16 hours in a CPU-based environment to less than five minutes with GPU-accelerated Clara Parabricks.

Accelerating the Next Wave of Genomics

NVIDIA is helping to enable the next wave of genomics by powering both short- and long-read sequencing platforms with accelerated AI base calling and variant calling. Industry leaders and startups are working with NVIDIA to push the boundaries of whole genome sequencing.

For example, biotech company PacBio recently announced the Revio system, a new long-read sequencing system featuring NVIDIA Tensor Core GPUs. Enabled by a 20x increase in computing power relative to prior systems, Revio is designed to sequence human genomes with high-accuracy long reads at scale for under $1,000.

Oxford Nanopore Technologies offers the only single technology that can sequence any-length DNA or RNA fragments in real time. These features allow the rapid discovery of more genetic variation. Seattle Children’s Hospital recently used the high-throughput nanopore sequencing instrument PromethION to understand a genetic disorder in the first few hours of a newborn’s life.

Ultima Genomics is offering high-throughput whole genome sequencing at just $100 per sample, and Singular Genomics’ G4 is the most powerful benchtop system.

Learn More

At NVIDIA GTC, a free AI conference taking place online March 20-23, speakers from PacBio, Oxford Nanopore, Genomic England, KAUST, Stanford, Argonne National Labs and other leading institutions will share the latest AI advances in genomic sequencing, analysis and genomic large language models for understanding gene expression.

The conference features a keynote from NVIDIA founder and CEO Jensen Huang on Tuesday, March 21, at 8 a.m. PT.

NVIDIA Clara Parabricks is free for students and researchers. Get started today or try a free hands-on lab to experience the toolkit in action.

Read More

Sun in Their AIs: Nonprofit Forecasts Solar Energy for UK Grid

Sun in Their AIs: Nonprofit Forecasts Solar Energy for UK Grid

Cloudy British weather is the butt of many jokes — but the United Kingdom’s national power grid is making the most of its sunshine.

With the help of Open Climate Fix, a nonprofit product lab, the control room of the National Grid Electricity System Operator (ESO) is testing AI models that provide granular, near-term forecasts of sunny and cloudy conditions over the country’s solar panels.

These insights can help ESO, the U.K.’s electric grid operator, address a key challenge in renewable energy: Sudden cloud cover can cause a significant dip in solar power generation, so grid operators ask fossil fuel plants to overproduce energy as backup.

With better forecasts, ESO could cut down on the extra fossil fuel energy held as reserve — improving efficiency while decreasing carbon footprint.

“Traditional weather models aren’t very good at predicting clouds, but using AI and satellite imagery, we can bring a lot more accuracy to solar forecasting,” said Dan Travers, co-founder of Open Climate Fix, a U.K.-based startup. “Solar energy is really effective at displacing coal, but grid operators need accurate forecasts to make it possible to integrate large amounts of solar generation — so we see a lot of opportunity in bringing this solution to coal-heavy electric grids worldwide.”

Open Climate Fix is a member of NVIDIA Inception, a global program that offers cutting-edge startups expertise, technology and go-to-market support. The team publishes its datasets, dozens of models and open-source code to HuggingFace and GitHub.

Each colored dot on the map represents a solar photovoltaic system. Blue dots represent low solar-energy output, yellow dots signify high output and black dots are systems with no data.

AI to Catch a Cloud and Pin It Down

Before the advent of renewable energy, the experts managing the electric grid day-to-day only had to worry about the variability of demand across the network — making sure there was enough power generated to keep up with air conditioners during a heat wave, or electric stoves and appliances on weekday evenings.

By adding renewables such as wind and solar energy to the mix, the energy grid must also account for weather-related variation in the level of supply. Satellite images provide the most up-to-date view to determine when clouds are coming between photovoltaic panels and the sun.

Open Climate Fix’s AI models are trained on terabytes of satellite data captured at five-minute intervals over Europe, the Middle East and North Africa. Additional data sources include years’ worth of hourly weather predictions at ten-kilometer resolution, topographic maps, information about the time of day and the sun’s position in the sky, and live readings from around solar panels across the U.K.

The team is using some of the most recent deep learning models for weather modeling including MetNet, GraphCast and Deep Generative Model of Radar. They’ve shown that their transformer-based AI models are 3x better at predicting solar energy generation than the forecasts generated by ESO’s traditional methods. The increased precision can help ESO reach its goal of being able to operate a zero-carbon electric grid by 2025.

“The physics-based forecasting models are powerful for predicting weather on the scale of days and weeks, but take hours to produce — making them ill-suited for predictions at the hour or minute level,” said Travers. “But with satellite images captured at intervals of a few minutes, we can get closer to a live view of cloud cover.”

AI’s Working on Sunshine

Cloud cover is of particular concern in the U.K., where cities including London, Birmingham and Glasgow receive an average of 1,400 or fewer hours of sunshine each year — less than half that of Los Angeles. But even in desert climates where cloudy days are rare, Open Climate Fix’s AI models could be repurposed to detect when solar panels would be covered by dust from a sandstorm.

In addition to forecasting for the entire U.K., the nonprofit is also developing models that can forecast how much energy individual solar panels will capture. This data could help large solar farm operators understand and maximize their energy output. Smart home companies, too, could use the information to optimize energy use from solar panels on customers’ roofs — giving homeowners insights about when to run power-hungry devices or schedule electric vehicle charging.

Open Climate Fix uses a cluster of NVIDIA RTX A6000 GPUs granted through an NVIDIA Hardware Grant to power its work. When training multiple models at the same time, the team shifts its overflow workload to NVIDIA A100 Tensor Core GPUs available through cloud service providers.

“The hardware grants have helped us develop and iterate on our models more easily,” said Jacob Bieker, a machine learning researcher at Open Climate Fix. “When our team is first debugging and training a model, it’s two or three times faster to do so locally.”

To learn more about AI accelerating decarbonization, boosting grid resiliency and driving energy efficiency, register free for NVIDIA GTC, which takes place online, March 20-23.

Read about NVIDIA’s work in power and utilities and apply to join NVIDIA Inception.

Main image of National Grid ESO Electricity National Control Center, courtesy of ESO Media Center

Read More

Pre-training generalist agents using offline reinforcement learning

Pre-training generalist agents using offline reinforcement learning

Reinforcement learning (RL) algorithms can learn skills to solve decision-making tasks like playing games, enabling robots to pick up objects, or even optimizing microchip designs. However, running RL algorithms in the real world requires expensive active data collection. Pre-training on diverse datasets has proven to enable data-efficient fine-tuning for individual downstream tasks in natural language processing (NLP) and vision problems. In the same way that BERT or GPT-3 models provide general-purpose initialization for NLP, large RL–pre-trained models could provide general-purpose initialization for decision-making. So, we ask the question: Can we enable similar pre-training to accelerate RL methods and create a general-purpose “backbone” for efficient RL across various tasks?

In “Offline Q-learning on Diverse Multi-Task Data Both Scales and Generalizes”, to be published at ICLR 2023, we discuss how we scaled offline RL, which can be used to train value functions on previously collected static datasets, to provide such a general pre-training method. We demonstrate that Scaled Q-Learning using a diverse dataset is sufficient to learn representations that facilitate rapid transfer to novel tasks and fast online learning on new variations of a task, improving significantly over existing representation learning approaches and even Transformer-based methods that use much larger models.

Scaled Q-learning: Multi-task pre-training with conservative Q-learning

To provide a general-purpose pre-training approach, offline RL needs to be scalable, allowing us to pre-train on data across different tasks and utilize expressive neural network models to acquire powerful pre-trained backbones, specialized to individual downstream tasks. We based our offline RL pre-training method on conservative Q-learning (CQL), a simple offline RL method that combines standard Q-learning updates with an additional regularizer that minimizes the value of unseen actions. With discrete actions, the CQL regularizer is equivalent to a standard cross-entropy loss, which is a simple, one-line modification on standard deep Q-learning. A few crucial design decisions made this possible:

  • Neural network size: We found that multi-game Q-learning required large neural network architectures. While prior methods often used relatively shallow convolutional networks, we found that models as large as a ResNet 101 led to significant improvements over smaller models.
  • Neural network architecture: To learn pre-trained backbones that are useful for new games, our final architecture uses a shared neural network backbone, with separate 1-layer heads outputting Q-values of each game. This design avoids interference between the games during pre-training, while still providing enough data sharing to learn a single shared representation. Our shared vision backbone also utilized a learned position embedding (akin to Transformer models) to keep track of spatial information in the game.
  • Representational regularization: Recent work has observed that Q-learning tends to suffer from representational collapse issues, where even large neural networks can fail to learn effective representations. To counteract this issue, we leverage our prior work to normalize the last layer features of the shared part of the Q-network. Additionally, we utilized a categorical distributional RL loss for Q-learning, which is known to provide richer representations that improve downstream task performance.

The multi-task Atari benchmark

We evaluate our approach for scalable offline RL on a suite of Atari games, where the goal is to train a single RL agent to play a collection of games using heterogeneous data from low-quality (i.e., suboptimal) players, and then use the resulting network backbone to quickly learn new variations in pre-training games or completely new games. Training a single policy that can play many different Atari games is difficult enough even with standard online deep RL methods, as each game requires a different strategy and different representations. In the offline setting, some prior works, such as multi-game decision transformers, proposed to dispense with RL entirely, and instead utilize conditional imitation learning in an attempt to scale with large neural network architectures, such as transformers. However, in this work, we show that this kind of multi-game pre-training can be done effectively via RL by employing CQL in combination with a few careful design decisions, which we describe below.

Scalability on training games

We evaluate the Scaled Q-Learning method’s performance and scalability using two data compositions: (1) near optimal data, consisting of all the training data appearing in replay buffers of previous RL runs, and (2) low quality data, consisting of data from the first 20% of the trials in the replay buffer (i.e., only data from highly suboptimal policies). In our results below, we compare Scaled Q-Learning with an 80-million parameter model to multi-game decision transformers (DT) with either 40-million or 80-million parameter models, and a behavioral cloning (imitation learning) baseline (BC). We observe that Scaled Q-Learning is the only approach that improves over the offline data, attaining about 80% of human normalized performance.

Further, as shown below, Scaled Q-Learning improves in terms of performance, but it also enjoys favorable scaling properties: just as how the performance of pre-trained language and vision models improves as network sizes get bigger, enjoying what is typically referred as “power-law scaling”, we show that the performance of Scaled Q-learning enjoys similar scaling properties. While this may be unsurprising, this kind of scaling has been elusive in RL, with performance often deteriorating with larger model sizes. This suggests that Scaled Q-Learning in combination with the above design choices better unlocks the ability of offline RL to utilize large models.

Fine-tuning to new games and variations

To evaluate fine-tuning from this offline initialization, we consider two settings: (1) fine-tuning to a new, entirely unseen game with a small amount of offline data from that game, corresponding to 2M transitions of gameplay, and (2) fine-tuning to a new variant of the games with online interaction. The fine-tuning from offline gameplay data is illustrated below. Note that this condition is generally more favorable to imitation-style methods, Decision Transformer and behavioral cloning, since the offline data for the new games is of relatively high-quality. Nonetheless, we see that in most cases Scaled Q-learning improves over alternative approaches (80% on average), as well as dedicated representation learning methods, such as MAE or CPC, which only use the offline data to learn visual representations rather than value functions.

In the online setting, we see even larger improvements from pre-training with Scaled Q-learning. In this case, representation learning methods like MAE yield minimal improvement during online RL, whereas Scaled Q-Learning can successfully integrate prior knowledge about the pre-training games to significantly improve the final score after 20k online interaction steps.

These results demonstrate that pre-training generalist value function backbones with multi-task offline RL can significantly boost performance of RL on downstream tasks, both in offline and online mode. Note that these fine-tuning tasks are quite difficult: the various Atari games, and even variants of the same game, differ significantly in appearance and dynamics. For example, the target blocks in Breakout disappear in the variation of the game as shown below, making control difficult. However, the success of Scaled Q-learning, particularly as compared to visual representation learning techniques, such as MAE and CPC, suggests that the model is in fact learning some representation of the game dynamics, rather than merely providing better visual features.

Fine-tuning with online RL for variants of the game Freeway, Hero, and Breakout. The new variant used in fine-tuning is shown in the bottom row of each figure, the original game seen in pre-training is in the top row. Fine-tuning from Scaled Q-Learning significantly outperforms MAE (a visual representation learning method) and learning from scratch with single-game DQN.

Conclusion and takeaways

We presented Scaled Q-Learning, a pre-training method for scaled offline RL that builds on the CQL algorithm, and demonstrated how it enables efficient offline RL for multi-task training. This work made initial progress towards enabling more practical real-world training of RL agents as an alternative to costly and complex simulation-based pipelines or large-scale experiments. Perhaps in the long run, similar work will lead to generally capable pre-trained RL agents that develop broadly applicable exploration and interaction skills from large-scale offline pre-training. Validating these results on a broader range of more realistic tasks, in domains such as robotics (see some initial results) and NLP, is an important direction for future research. Offline RL pre-training has a lot of potential, and we expect that we will see many advances in this area in future work.

Acknowledgements

This work was done by Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Special thanks to Sherry Yang, Ofir Nachum, and Kuang-Huei Lee for help with the multi-game decision transformer codebase for evaluation and the multi-game Atari benchmark, and Tom Small for illustrations and animation.

Read More