Differentially private median and more

Differentially private median and more

Differential privacy (DP) is a rigorous mathematical definition of privacy. DP algorithms are randomized to protect user data by ensuring that the probability of any particular output is nearly unchanged when a data point is added or removed. Therefore, the output of a DP algorithm does not disclose the presence of any one data point. There has been significant progress in both foundational research and adoption of differential privacy with contributions such as the Privacy Sandbox and Google Open Source Library.

ML and data analytics algorithms can often be described as performing multiple basic computation steps on the same dataset. When each such step is differentially private, so is the output, but with multiple steps the overall privacy guarantee deteriorates, a phenomenon known as the cost of composition. Composition theorems bound the increase in privacy loss with the number k of computations: In the general case, the privacy loss increases with the square root of k. This means that we need much stricter privacy guarantees for each step in order to meet our overall privacy guarantee goal. But in that case, we lose utility. One way to improve the privacy vs. utility trade-off is to identify when the use cases admit a tighter privacy analysis than what follows from composition theorems.

Good candidates for such improvement are when each step is applied to a disjoint part (slice) of the dataset. When the slices are selected in a data-independent way, each point affects only one of the k outputs and the privacy guarantees do not deteriorate with k. However, there are applications in which we need to select the slices adaptively (that is, in a way that depends on the output of prior steps). In these cases, a change of a single data point may cascade — changing multiple slices and thus increasing composition cost.

In “Õptimal Differentially Private Learning of Thresholds and Quasi-Concave Optimization”, presented at STOC 2023, we describe a new paradigm that allows for slices to be selected adaptively and yet avoids composition cost. We show that DP algorithms for multiple fundamental aggregation and learning tasks can be expressed in this Reorder-Slice-Compute (RSC) paradigm, gaining significant improvements in utility.

The Reorder-Slice-Compute (RSC) paradigm

An algorithm A falls in the RSC paradigm if it can be expressed in the following general form (see visualization below). The input is a sensitive set D of data points. The algorithm then performs a sequence of k steps as follows:

  1. Select an ordering over data points, a slice size m, and a DP algorithm M. The selection may depend on the output of A in prior steps (and hence is adaptive).
  2. Slice out the (approximately) top m data points according to the order from the dataset D, apply M to the slice, and output the result.
A visualization of three Reorder-Slice-Compute (RSC) steps.

If we analyze the overall privacy loss of an RSC algorithm using DP composition theorems, the privacy guarantee suffers from the expected composition cost, i.e., it deteriorates with the square root of the number of steps k. To eliminate this composition cost, we provide a novel analysis that removes the dependence on k altogether: the overall privacy guarantee is close to that of a single step! The idea behind our tighter analysis is a novel technique that limits the potential cascade of affected steps when a single data point is modified (details in the paper).

Tighter privacy analysis means better utility. The effectiveness of DP algorithms is often stated in terms of the smallest input size (number of data points) that suffices in order to release a correct result that meets the privacy requirements. We describe several problems with algorithms that can be expressed in the RSC paradigm and for which our tighter analysis improved utility.

Private interval point

We start with the following basic aggregation task. The input is a dataset D of n points from an ordered domain X (think of the domain as the natural numbers between 1 and |X|). The goal is to return a point y in X that is in the interval of D, that is between the minimum and the maximum points in D.

The solution to the interval point problem is trivial without the privacy requirement: simply return any point in the dataset D. But this solution is not privacy-preserving as it discloses the presence of a particular datapoint in the input. We can also see that if there is only one point in the dataset, a privacy-preserving solution is not possible, as it must return that point. We can therefore ask the following fundamental question: What is the smallest input size N for which we can solve the private interval point problem?

It is known that N must increase with the domain size |X| and that this dependence is at least the iterated log function log* |X| [1, 2]. On the other hand, the best prior DP algorithm required the input size to be at least (log* |X|)1.5. To close this gap, we designed an RSC algorithm that requires only an order of log* |X| points.

The iterated log function is extremely slow growing: It is the number of times we need to take a logarithm of a value before we reach a value that is equal to or smaller than 1. How did this function naturally come out in the analysis? Each step of the RSC algorithm remapped the domain to a logarithm of its prior size. Therefore there were log* |X| steps in total. The tighter RSC analysis eliminated a square root of the number of steps from the required input size.

Even though the interval point task seems very basic, it captures the essence of the difficulty of private solutions for common aggregation tasks. We next describe two of these tasks and express the required input size to these tasks in terms of N.

Private approximate median

One of these common aggregation tasks is approximate median: The input is a dataset D of n points from an ordered domain X. The goal is to return a point y that is between the ⅓ and ⅔ quantiles of D. That is, at least a third of the points in D are smaller or equal to y and at least a third of the points are larger or equal to y. Note that returning an exact median is not possible with differential privacy, since it discloses the presence of a datapoint. Hence we consider the relaxed requirement of an approximate median (shown below).

We can compute an approximate median by finding an interval point: We slice out the N smallest points and the N largest points and then compute an interval point of the remaining points. The latter must be an approximate median. This works when the dataset size is at least 3N.

An example of a data D over domain X, the set of interval points, and the set of approximate medians.

Private learning of axis-aligned rectangles

For the next task, the input is a set of n labeled data points, where each point x = (x1,….,xd) is a d-dimensional vector over a domain X. Displayed below, the goal is to learn values ai , bi for the axes i=1,…,d that define a d-dimensional rectangle, so that for each example x

  • If x is positively labeled (shown as red plus signs below) then it lies within the rectangle, that is, for all axes i, xi is in the interval [ai ,bi], and
  • If x is negatively labeled (shown as blue minus signs below) then it lies outside the rectangle, that is, for at least one axis i, xi is outside the interval [ai ,bi].
A set of 2-dimensional labeled points and a respective rectangle.

Any DP solution for this problem must be approximate in that the learned rectangle must be allowed to mislabel some data points, with some positively labeled points outside the rectangle or negatively labeled points inside it. This is because an exact solution could be very sensitive to the presence of a particular data point and would not be private. The goal is a DP solution that keeps this necessary number of mislabeled points small.

We first consider the one-dimensional case (d = 1). We are looking for an interval [a,b] that covers all positive points and none of the negative points. We show that we can do this with at most 2N mislabeled points. We focus on the positively labeled points. In the first RSC step we slice out the N smallest points and compute a private interval point as a. We then slice out the N largest points and compute a private interval point as b. The solution [a,b] correctly labels all negatively labeled points and mislabels at most 2N of the positively labeled points. Thus, at most ~2N points are mislabeled in total.

Illustration for d = 1, we slice out N left positive points and compute an interval point a, slice out N right positive points and compute an interval point b.

With d > 1, we iterate over the axes i = 1,….,d and apply the above for the ith coordinates of input points to obtain the values ai , bi. In each iteration, we perform two RSC steps and slice out 2N positively labeled points. In total, we slice out 2dN points and all remaining points were correctly labeled. That is, all negatively-labeled points are outside the final d-dimensional rectangle and all positively-labeled points, except perhaps ~2dN, lie inside the rectangle. Note that this algorithm uses the full flexibility of RSC in that the points are ordered differently by each axis. Since we perform d steps, the RSC analysis shaves off a factor of square root of d from the number of mislabeled points.

Training ML models with adaptive selection of training examples

The training efficiency or performance of ML models can sometimes be improved by selecting training examples in a way that depends on the current state of the model, e.g., self-paced curriculum learning or active learning.

The most common method for private training of ML models is DP-SGD, where noise is added to the gradient update from each minibatch of training examples. Privacy analysis with DP-SGD typically assumes that training examples are randomly partitioned into minibatches. But if we impose a data-dependent selection order on training examples, and further modify the selection criteria k times during training, then analysis through DP composition results in deterioration of the privacy guarantees of a magnitude equal to the square root of k.

Fortunately, example selection with DP-SGD can be naturally expressed in the RSC paradigm: each selection criteria reorders the training examples and each minibatch is a slice (for which we compute a noisy gradient). With RSC analysis, there is no privacy deterioration with k, which brings DP-SGD training with example selection into the practical domain.

Conclusion

The RSC paradigm was introduced in order to tackle an open problem that is primarily of theoretical significance, but turns out to be a versatile tool with the potential to enhance data efficiency in production environments.

Acknowledgments

The work described here was done jointly with Xin Lyu, Jelani Nelson, and Tamas Sarlos.

Read More

NVIDIA Partners with India Giants to Advance AI in World’s Most Populous Nation

NVIDIA Partners with India Giants to Advance AI in World’s Most Populous Nation

The world’s largest democracy is poised to transform itself and the world, embracing AI on an enormous scale.

Speaking with the press Friday in Bengaluru, in the context of announcements from two of India’s largest conglomerates, Reliance Industries Limited and Tata Group, NVIDIA founder and CEO Jensen Huang detailed plans to bring AI technology and skills to address the world’s most populous nation’s greatest challenges.

“I think this is going to be one of the largest AI markets in the world,” said Huang, who was wrapping up a week of high-level meetings across the nation, including with Prime Minister Narendra Modi, leading AI researchers, top business leaders, the press and the country’s 4,000-some NVIDIA employees.

The companies will work together to create an AI computing infrastructure and platforms for developing AI solutions. It will be based on NVIDIA technology like the NVIDIA GH200 Grace Hopper Superchip and NVIDIA DGX Cloud.

GH200 marks a fundamental shift in computing architecture that provides exceptional performance and massive memory bandwidth, while DGX Cloud, an AI supercomputing service in the cloud, makes it easier for enterprises to train their employees in AI technology, access the technology internally and provide generative AI services to customers.

In his exchange with more than a dozen of India’s top tech journalists following the announcement, Huang said computer science expertise is a core competency for India, and that with access to technology and capital India is poised to build AI to be able to solve challenges at home and abroad.

“You have the data, you have the talent,” Huang said. “We are open for business and bring great expertise on building supercomputers.

During the freewheeling back and forth with the media, Hiuang emphasized India’s strength in information technology and the potential for AI to accelerate the development of India’s IT industry.

“IT is one of your natural resources. You produce it at an incredible scale. You’re incredibly good at it. You export it all over the world,” Huang said.

India’s “AI Moment”

Earlier, after meeting with many of the region’s top technology leaders — including startup pioneers, AI proponents, and key players in India’s digital public infrastructure — Huang hailed “India’s moment,” saying the nationis on the cusp of becoming a global AI powerhouse.

NVIDIA CEO Jensen Huang with Nandan Nilekani, founder of Infosys and founding chairman of UIDAI during a meeting with key Indian tech leaders.

While India has well-known technical capabilities — distinguished technical universities, 2,500 engineering colleges and an estimated 1.5 million engineers — many of its 1.4 billion people, located across sprawling metropolitan areas and some 650,000 villages, collectively speaking dozens of languages, have yet to fully benefit from this progress.

Applied in the Indian context, AI can help rural farmers interact via cell phones in their local language to get weather information and crop prices. It can help provide, at a massive scale, expert diagnosis of medical symptoms and imaging scans where doctors may not be immediately available. It can better predict cyclonic storms using decades of atmospheric data, enabling those at risk to more quickly evacuate and find shelter.

Reliance Industries and Tata Communications will build and operate state-of-the-art AI supercomputing data centers based on such technology, utilizing it for internal AI development and infrastructure-as-a-service for India’s AI researchers, companies and burgeoning AI startup ecosystem.

That effort, Huang said, during his conversation with the Indian technology press, promises to be part of a process that will turn India into a beacon for AI technology.

“AI could be built in India, used in India, and exported from India,” Huang said.

Read More

Implement smart document search index with Amazon Textract and Amazon OpenSearch

Implement smart document search index with Amazon Textract and Amazon OpenSearch

For modern companies that deal with enormous volumes of documents such as contracts, invoices, resumes, and reports, efficiently processing and retrieving pertinent data is critical to maintaining a competitive edge. However, traditional methods of storing and searching for documents can be time-consuming and often result in a large effort to find a specific document, especially when they include handwriting. What if there was a way to process documents intelligently and make them searchable in with high accuracy?

This is made possible with Amazon Textract, AWS’s Intelligent Document Processing service, coupled with the fast search capabilities of OpenSearch. In this post, we’ll take you on a journey to rapidly build and deploy a document search indexing solution that helps your organization to better harness and extract insights from documents.

Whether you’re in Human Resources looking for specific clauses in employee contracts, or a financial analyst sifting through a mountain of invoices to extract payment data, this solution is tailored to empower you to access the information you need with unprecedented speed and accuracy.

With the proposed solution, your documents are automatically ingested, their content parsed and subsequently indexed into a highly responsive and scalable OpenSearch index.

We’ll cover how technologies such as Amazon Textract, AWS Lambda, Amazon Simple Storage Service (Amazon S3), and Amazon OpenSearch Service can be integrated into a workflow that seamlessly processes documents. Then we dive into indexing this data into OpenSearch and demonstrate the search capabilities that become available at your fingertips.

Whether your organization is taking the first steps into the digital transformation era or is an established giant seeking to turbocharge information retrieval, this guide is your compass to navigating the opportunities that AWS Intelligent Document Processing and OpenSearch offer.

The implementation used in this post utilizes the Amazon Textract IDP CDK constructs – AWS Cloud Development Kit (CDK) components to define infrastructure for Intelligent Document Processing (IDP) workflows – which allow you to build use case specific customizable IDP workflows. The IDP CDK constructs and samples are a collection of components to enable definition of IDP processes on AWS and published to GitHub. The main concepts used are the AWS Cloud Development Kit (CDK) constructs, the actual CDK stacks and AWS Step Functions. The workshop Use machine learning to automate and process documents at scale is a good starting point to learn more about customizing workflows and using the other sample workflows as a base for your own.

Solution overview

In this solution, we focus on indexing documents into an OpenSearch index for quick search-and-retrieval of information and documents. Documents in PDF, TIFF, JPEG or PNG format are put in an Amazon Simple Storage Service (Amazon S3) bucket and subsequently indexed into OpenSearch using this Step Functions workflow.

Step Function workflow

Figure 1: The Step Functions OpenSearch workflow

The OpenSearchWorkflow-Decider looks at the document and verifies that the document is one of the supported mime types (PDF, TIFF, PNG or JPEG). It consists of one AWS Lambda function.

The DocumentSplitter generates maximum of 2500-pages chunk from documents. This means even though Amazon Textract supports documents of up to 3000 pages, you can pass in documents with many more pages and the process still works fine and puts the pages into OpenSearch and creates correct page numbers. The DocumentSplitter is implemented as an AWS Lambda function.

The Map State processes each chunk in parallel.

The TextractAsync task calls Amazon Textract using the asynchronous Application Programming Interface (API) following best practices with Amazon Simple Notification Service (Amazon SNS) notifications and OutputConfig to store the Amazon Textract JSON output to a customer Amazon S3 bucket. It consists of two Amazon Lambda functions: one to submit the document for processing and one getting triggered on the Amazon SNS notification.

Because the TextractAsync task can produce multiple paginated output files, the TextractAsyncToJSON2 process combines them into one JSON file.

The Step Functions context is enriched with information that should also be searchable in the OpenSearch index in the SetMetaData step. The sample implementation adds ORIGIN_FILE_NAME, START_PAGE_NUMBER, and ORIGIN_FILE_URI. You can add any information to enrich the search experience, like information from other backend systems, specific IDs or classification information.

The GenerateOpenSearchBatch takes the generated Amazon Textract output JSON, combines it with the information from the context set by SetMetaData and prepares a file that is optimized for batch import into OpenSearch.

In the OpenSearchPushInvoke, this batch import file is sent into the OpenSearch index and available for search. This AWS Lambda function is connected with the aws-lambda-opensearch construct from the AWS Solutions library using the m6g.large.search instances, OpenSearch version 2.7, and configured the Amazon Elastic Block Service (Amazon EBS) volume size to General Purpose 2 (GP2) with 200 GB. You can change the OpenSearch configuration according to your requirements.

The final TaskOpenSearchMapping step clears the context, which otherwise could exceed the Step Functions Quota of Maximum input or output size for a task, state, or execution.

Prerequisites

To deploy the samples, you need an AWS account , the AWS Cloud Development Kit (AWS CDK), a current Python version and Docker are required. You need permissions to deploy AWS CloudFormation templates, push to the Amazon Elastic Container Registry (Amazon ECR), create Amazon Identity and Access Management (AWS IAM) roles, Amazon Lambda functions, Amazon S3 buckets, Amazon Step Functions, Amazon OpenSearch cluster, and an Amazon Cognito user pool. Make sure your AWS CLI environment is setup with the according permissions.

You can also spin up a AWS Cloud9 instance with AWS CDK, Python and Docker pre-installed to initiate the deployment.

Walkthrough

Deployment

  1. After you set up the prerequisites, you need to first clone the repository:
git clone https://github.com/aws-solutions-library-samples/guidance-for-low-code-intelligent-document-processing-on-aws.git
  1. Then cd into the repository folder and install the dependencies:
cd guidance-for-low-code-intelligent-document-processing-on-aws/

pip install -r requirements.txt
  1. Deploy the OpenSearchWorkflow stack:
cdk deploy OpenSearchWorkflow

The deployment takes around 25 minutes with the default configuration settings from the GitHub samples, and creates a Step Functions workflow, which is invoked when a document is put at an Amazon S3 bucket/prefix and subsequently is processed till the content of the document is indexed in an OpenSearch cluster.

The following is a sample output including useful links and information generated fromcdk deploy OpenSearchWorkflowcommand:

OpenSearchWorkflow.CognitoUserPoolLink = https://us-east-1.console.aws.amazon.com/cognito/v2/idp/user-pools/us-east-1_1234abcdef/users?region=us-east-1
OpenSearchWorkflow.DocumentQueueLink = https://us-east-1.console.aws.amazon.com/sqs/v2/home?region=us-east-1#/queues/https%3A%2F%2Fsqs.us-east-1.amazonaws.com%2F123412341234%2FOpenSearchWorkflow-ExecutionThrottleDocumentQueueABC1234-ABCDEFG1234.fifo
OpenSearchWorkflow.DocumentUploadLocation = s3://opensearchworkflow-opensearchworkflowbucketabcdef1234/uploads/
OpenSearchWorkflow.OpenSearchDashboard = https://search-idp-cdk-opensearch-abcdef1234.us-east-1.es.amazonaws.com/states/_dashboards
OpenSearchWorkflow.OpenSearchLink = https://us-east-1.console.aws.amazon.com/aos/home?region=us-east-1#/opensearch/domains/idp-cdk-opensearch
OpenSearchWorkflow.StepFunctionFlowLink = https://us-east-1.console.aws.amazon.com/states/home?region=us-east-1#/statemachines/view/arn:aws:states:us-east-1:123412341234:stateMachine:OpenSearchWorkflow12341234

This information is also available in the AWS CloudFormation Console.

When a new document is placed under the OpenSearchWorkflow.DocumentUploadLocation, a new Step Functions workflow is started for this document.

To check the status of this document, the OpenSearchWorkflow.StepFunctionFlowLink provides a link to the list of StepFunction executions in the AWS Management Console, displaying the status of the document processing for each document uploaded to Amazon S3. The tutorial Viewing and debugging executions on the Step Functions console provides an overview of the components and views in the AWS Console.

Testing

  1. First test using a sample file.
aws s3 cp s3://amazon-textract-public-content/idp-cdk-samples/moby-dick-hidden-paystub-and-w2.pdf $(aws cloudformation list-exports --query 'Exports[?Name==`OpenSearchWorkflow-DocumentUploadLocation`].Value' --output text)
  1. After selecting the link to the StepFunction workflow or open the AWS Management Console and going to the Step Functions service page, you can look at the different workflow invocations.
Step Function executions list

Figure 2: The Step Functions executions list

  1. Take a look at the currently running sample document execution, where you can follow the execution of the individual workflow tasks.
One document Step Functions workflow execution

Figure 3: One document Step Functions workflow execution

Search

Once the process finished, we can validate that the document is indexed in the OpenSearch index.

  1. To do so, first we create an Amazon Cognito user. Amazon Cognito is used for Authentication of users against the OpenSearch index. Select the link in the output from the cdk deploy (or look at the AWS CloudFormation output in the AWS Management Console) named OpenSearchWorkflow.CognitoUserPoolLink.
Figure 4: The Cognito user pool

Figure 4: The Cognito user pool

  1. Next, select the Create user button, which directs you to a page to enter a username and a password for accessing the OpenSearch Dashboard.
Figure 5: The Cognito create user dialog

Figure 5: The Cognito create user dialog

  1. After choosing Create user, you can continue to the OpenSearch Dashboard by clicking on the OpenSearchWorkflow.OpenSearchDashboard from the CDK deployment output. Login using the previously created username and password. The first time you login, you have to change the password.
  2. Once being logged in to the OpenSearch Dashboard, select the Stack Management section, followed by Index Patterns to create a search index.
Figure 6: OpenSearch Dashboards Stack Management

Figure 6: OpenSearch Dashboards Stack Management

Figure 7: OpenSearch Index Patterns overview

Figure 7: OpenSearch Index Patterns overview

  1. The default name for the index is papers-index and an index pattern name of papers-index* will match that.
Figure 8: Define the OpenSearch index pattern

Figure 8: Define the OpenSearch index pattern

  1. After clicking Next step, select timestamp as the Time field and Create index pattern.
Figure 9: OpenSearch index pattern time field

Figure 9: OpenSearch index pattern time field

  1. Now, from the menu, select Discover.
Figure 10: OpenSearch Discover

Figure 10: OpenSearch Discover

In most cases ,you need to change the time-span according to your last ingest. The default is 15 minutes and often there was no activity in the last 15 minutes. In this example, it changed to 15 days to visualize the ingest.

Figure 11: OpenSearch timespan change

Figure 11: OpenSearch timespan change

  1. Now you can start to search. A novel was indexed, you can search for any terms like call me Ishmael and see the results.
Figure 12: OpenSearch search term

Figure 12: OpenSearch search term

In this case, the term call me Ishmael appears on page 6 of the document at the given Uniform Resource Identifier (URI), which points to the Amazon S3 location of the file. This makes it faster to identify documents and find information across a large corpus of PDF, TIFF or image documents, compared to manually skipping through them.

Running at scale

In order to estimate scale and duration of an indexing process, the implementation was tested with 93,997 documents and a total sum of 1,583,197 pages (average 16.84 pages/document and the largest file having 3755 pages), which all got indexed into OpenSearch. Processing all files and indexing them into OpenSearch took 5.5 hours in the US East (N. Virginia – us-east-1) region using default Amazon Textract Service Quotas. The graph below shows an initial test at 18:00 followed by the main ingest at 21:00 and all done by 2:30.

Figure 13: OpenSearch indexing overview

Figure 13: OpenSearch indexing overview

For the processing, the tcdk.SFExecutionsStartThrottle was set to an executions_concurrency_threshold=550, which means that concurrent document processing workflows are capped at 550 and excess requests are queued to an Amazon SQS Fist-In-First-Out (FIFO) queue, which is subsequently drained when current workflows finish. The threshold of 550 is based on the Textract Service quota of 600 in the us-east-1 region. Therefore, the queue depth and age of oldest message are metrics worth monitoring.

Figure 14: Amazon SQS monitoring

Figure 14: Amazon SQS monitoring

In this test, all documents were uploaded to Amazon S3 at once, therefore the Approximate Number of Messages Visible has a steep increase and then a slow decline as no new documents are ingested. The Approximate Age Of Oldest Message increases until all messages are processed. The Amazon SQS MessageRetentionPeriod is set to 14 days. For very long running backlog processing that could exceed 14 days processing, start with processing a smaller subset of representative documents and monitor the duration of execution to estimate how many documents you can pass in before exceeding 14 days. The Amazon SQS CloudWatch metrics look similar for a use case of processing a large backlog of documents, which is ingested at once then processed fully. If your use case is a steady flow of documents, both metrics, the Approximate Number of Messages Visible and the Approximate Age Of Oldest Message will be more linear. You can also use the threshold parameter to mix a steady load with backlog processing and allocate capacity according to your processing needs.

Another metrics to monitor is the health of the OpenSearch cluster, which you should setup according to the Opernational best practices for Amazon OpenSearch Service. The default deployment uses m6g.large.search instances.

Figure 15: OpenSearch monitoring

Figure 15: OpenSearch monitoring

Here is a snapshot of the Key Performance Indicators (KPI) for the OpenSearch cluster. No errors, constant indexing data rate and latency.

The Step Functions workflow executions show the state of processing for each individual document. If you see executions in Failed state, then select the details. A good metric to monitor is the AWS CloudWatch Automatic Dashboard for Step Functions, which exposes some of the Step Functions CloudWatch metrics.

Figure 16: Step Functions monitoring executions succeeded

Figure 16: Step Functions monitoring executions succeeded

In this AWS CloudWatch Dashboard graph, you see the successful Step Functions executions over time.

Figure 17: OpenSearch monitoring executions failed

Figure 17: OpenSearch monitoring executions failed

And this one shows the failed executions. These are worth investigating through the AWS Console Step Functions overview.

The following screenshot shows one example of a failed execution due to the origin file being of 0 size, which makes sense because the file has no content and could not be processed. It is important to filter failed processes and visualizes failures, in order for you to go back to the source document and validate the root cause.

Figure 18: Step Functions failed workflow

Figure 18: Step Functions failed workflow

Other failures might include documents that are not of mime type: application/pdf, image/png, image/jpeg, or image/tiff because other document types are not supported by Amazon Textract.

Cost

The total cost of ingesting 1,583,278 pages was split across AWS services used for the implementation. The following list serves as approximate numbers, because your actual cost and processing duration vary depending on the size of documents, the number of pages per document, the density of information in the documents, and the AWS Region. Amazon DynamoDB was consuming $0.55, Amazon S3 $3.33, OpenSearch Service $14.71, Step Functions $17.92, AWS Lambda $28.95, and Amazon Textract $1,849.97. Also, keep in mind that the deployed Amazon OpenSearch Service cluster is billed by the hour and will accumulate higher cost when run over a period of time.

Modifications

Most likely, you want to modify the implementation and customize for your use case and documents. The workshop Use machine learning to automate and process documents at scale presents a good overview on how to manipulate the actual workflows, changing the flow, and adding new components. To add custom fields to the OpenSearch index, look at the SetMetaData task in the workflow using the set-manifest-meta-data-opensearch AWS Lambda function to add meta-data to the context, which will be added as a field to the OpenSearch index. Any meta-data information will become part of the index.

Cleaning up

Delete the example resources if you no longer need them, to avoid incurring future costs using the followind command:

cdk destroy OpenSearchWorkflow

in the same environment as the cdk deploy command. Beware that this removes everything, including the OpenSearch cluster and all documents and the Amazon S3 bucket. If you want to maintain that information, backup your Amazon S3 bucket and create an index snapshot from your OpenSearch cluster. If you processed many files, then you may have to empty the Amazon S3 bucket first using the AWS Management Console (i.e., after you took a backup or synced them to a different bucket if you want to retain the information), because the cleanup function can time out and then destroy the AWS CloudFormation stack.

Conclusion

In this post, we showed you how to deploy a full stack solution to ingest a large number of documents into an OpenSearch index, which are ready to be used for search use cases. The individual components of the implementation were discussed as well as scaling considerations, cost, and modification options. All code is accessible as OpenSource on GitHub as IDP CDK samples and as IDP CDK constructs to build your own solutions from scratch. As a next step you can start to modify the workflow, add information to the documents in the search index and explore the IDP workshop. Please comment below on your experience and ideas to expand the current solution.


About the Author

Martin Schade is a Senior ML Product SA with the Amazon Textract team. He has over 20 years of experience with internet-related technologies, engineering, and architecting solutions. He joined AWS in 2014, first guiding some of the largest AWS customers on the most efficient and scalable use of AWS services, and later focused on AI/ML with a focus on computer vision. Currently, he’s obsessed with extracting information from documents.

Read More

Semantic image search for articles using Amazon Rekognition, Amazon SageMaker foundation models, and Amazon OpenSearch Service

Semantic image search for articles using Amazon Rekognition, Amazon SageMaker foundation models, and Amazon OpenSearch Service

Digital publishers are continuously looking for ways to streamline and automate their media workflows in order to generate and publish new content as rapidly as they can.

Publishers can have repositories containing millions of images and in order to save money, they need to be able to reuse these images across articles. Finding the image that best matches an article in repositories of this scale can be a time-consuming, repetitive, manual task that can be automated. It also relies on the images in the repository being tagged correctly, which can also be automated (for a customer success story, refer to Aller Media Finds Success with KeyCore and AWS).

In this post, we demonstrate how to use Amazon Rekognition, Amazon SageMaker JumpStart, and Amazon OpenSearch Service to solve this business problem. Amazon Rekognition makes it easy to add image analysis capability to your applications without any machine learning (ML) expertise and comes with various APIs to fulfil use cases such as object detection, content moderation, face detection and analysis, and text and celebrity recognition, which we use in this example. SageMaker JumpStart is a low-code service that comes with pre-built solutions, example notebooks, and many state-of-the-art, pre-trained models from publicly available sources that are straightforward to deploy with a single click into your AWS account. These models have been packaged to be securely and easily deployable via Amazon SageMaker APIs. The new SageMaker JumpStart Foundation Hub allows you to easily deploy large language models (LLM) and integrate them with your applications. OpenSearch Service is a fully managed service that makes it simple to deploy, scale, and operate OpenSearch. OpenSearch Service allows you to store vectors and other data types in an index, and offers rich functionality that allows you to search for documents using vectors and measuring the semantical relatedness, which we use in this post.

The end goal of this post is to show how we can surface a set of images that are semantically similar to some text, be that an article or tv synopsis.

The following screenshot shows an example of taking a mini article as your search input, rather than using keywords, and being able to surface semantically similar images.

Overview of solution

The solution is divided into two main sections. First, you extract label and celebrity metadata from the images, using Amazon Rekognition. You then generate an embedding of the metadata using a LLM. You store the celebrity names, and the embedding of the metadata in OpenSearch Service. In the second main section, you have an API to query your OpenSearch Service index for images using OpenSearch’s intelligent search capabilities to find images that are semantically similar to your text.

This solution uses our event-driven services Amazon EventBridge, AWS Step Functions, and AWS Lambda to orchestrate the process of extracting metadata from the images using Amazon Rekognition. Amazon Rekognition will perform two API calls to extract labels and known celebrities from the image.

Amazon Rekognition celebrity detection API, returns a number of elements in the response. For this post, you use the following:

  • Name, Id, and Urls – The celebrity name, a unique Amazon Rekognition ID, and list of URLs such as the celebrity’s IMDb or Wikipedia link for further information.
  • MatchConfidence – A match confidence score that can be used to control API behavior. We recommend applying a suitable threshold to this score in your application to choose your preferred operating point. For example, by setting a threshold of 99%, you can eliminate more false positives but may miss some potential matches.

In your second API call, Amazon Rekognition label detection API, returns a number of elements in the response. You use the following:

  • Name – The name of the detected label
  • Confidence – The level of confidence in the label assigned to a detected object

A key concept in semantic search is embeddings. A word embedding is a numerical representation of a word or group of words, in the form of a vector. When you have many vectors, you can measure the distance between them, and vectors which are close in distance are semantically similar. Therefore, if you generate an embedding of all of your images’ metadata, and then generate an embedding of your text, be that an article or tv synopsis for example, using the same model, you can then find images which are semantically similar to your given text.

There are many models available within SageMaker JumpStart to generate embeddings. For this solution, you use GPT-J 6B Embedding from Hugging Face. It produces high-quality embeddings and has one of the top performance metrics according to Hugging Face’s evaluation results. Amazon Bedrock is another option, still in preview, where you could choose Amazon Titan Text Embeddings model to generate the embeddings.

You use the GPT-J pre-trained model from SageMaker JumpStart to create an embedding of the image metadata and store this as a k-NN vector in your OpenSearch Service index, along with the celebrity name in another field.

The second part of the solution is to return the top 10 images to the user that are semantically similar to their text, be this an article or tv synopsis, including any celebrities if present. When choosing an image to accompany an article, you want the image to resonate with the pertinent points from the article. SageMaker JumpStart hosts many summarization models which can take a long body of text and reduce it to the main points from the original. For the summarization model, you use the AI21 Labs Summarize model. This model provides high-quality recaps of news articles and the source text can contain roughly 10,000 words, which allows the user to summarize the entire article in one go.

To detect if the text contains any names, potentially known celebrities, you use Amazon Comprehend which can extract key entities from a text string. You then filter by the Person entity, which you use as an input search parameter.

Then you take the summarized article and generate an embedding to use as another input search parameter. It’s important to note that you use the same model deployed on the same infrastructure to generate the embedding of the article as you did for the images. You then use Exact k-NN with scoring script so that you can search by two fields: celebrity names and the vector that captured the semantic information of the article. Refer to this post, Amazon OpenSearch Service’s vector database capabilities explained, on the scalability of Score script and how this approach on large indexes may lead to high latencies.

Walkthrough

The following diagram illustrates the solution architecture.

Following the numbered labels:

  1. You upload an image to an Amazon S3 bucket
  2. Amazon EventBridge listens to this event, and then triggers an AWS Step function execution
  3. The Step Function takes the image input, extracts the label and celebrity metadata
  4. The AWS Lambda function takes the image metadata and generates an embedding
  5. The Lambda function then inserts the celebrity name (if present) and the embedding as a k-NN vector into an OpenSearch Service index
  6. Amazon S3 hosts a simple static website, served by an Amazon CloudFront distribution. The front-end user interface (UI) allows you to authenticate with the application using Amazon Cognito to search for images
  7. You submit an article or some text via the UI
  8. Another Lambda function calls Amazon Comprehend to detect any names in the text
  9. The function then summarizes the text to get the pertinent points from the article
  10. The function generates an embedding of the summarized article
  11. The function then searches OpenSearch Service image index for any image matching the celebrity name and the k-nearest neighbors for the vector using cosine similarity
  12. Amazon CloudWatch and AWS X-Ray give you observability into the end to end workflow to alert you of any issues.

Extract and store key image metadata

The Amazon Rekognition DetectLabels and RecognizeCelebrities APIs give you the metadata from your images—text labels you can use to form a sentence to generate an embedding from. The article gives you a text input that you can use to generate an embedding.

Generate and store word embeddings

The following figure demonstrates plotting the vectors of our images in a 2-dimensional space, where for visual aid, we have classified the embeddings by their primary category.

You also generate an embedding of this newly written article, so that you can search OpenSearch Service for the nearest images to the article in this vector space. Using the k-nearest neighbors (k-NN) algorithm, you define how many images to return in your results.

Zoomed in to the preceding figure, the vectors are ranked based on their distance from the article and then return the K-nearest images, where K is 10 in this example.

OpenSearch Service offers the capability to store large vectors in an index, and also offers the functionality to run queries against the index using k-NN, such that you can query with a vector to return the k-nearest documents that have vectors in close distance using various measurements. For this example, we use cosine similarity.

Detect names in the article

You use Amazon Comprehend, an AI natural language processing (NLP) service, to extract key entities from the article. In this example, you use Amazon Comprehend to extract entities and filter by the entity Person, which returns any names that Amazon Comprehend can find in the journalist story, with just a few lines of code:

def get_celebrities(payload):
    response = comprehend_client.detect_entities(
        Text=' '.join(payload["text_inputs"]),
        LanguageCode="en",
    )
    celebrities = ""
    for entity in response["Entities"]:
        if entity["Type"] == "PERSON":
            celebrities += entity["Text"] + " "
    return celebrities

In this example, you upload an image to Amazon Simple Storage Service (Amazon S3), which triggers a workflow where you are extracting metadata from the image including labels and any celebrities. You then transform that extracted metadata into an embedding and store all of this data in OpenSearch Service.

Summarize the article and generate an embedding

Summarizing the article is an important step to make sure that the word embedding is capturing the pertinent points of the article, and therefore returning images that resonate with the theme of the article.

AI21 Labs Summarize model is very simple to use without any prompt and just a few lines of code:

def summarise_article(payload):
    sagemaker_endpoint_summarise = os.environ["SAGEMAKER_ENDPOINT_SUMMARIZE"]
    response = ai21.Summarize.execute(
        source=payload,
        sourceType="TEXT",
        destination=ai21.SageMakerDestination(sagemaker_endpoint_summarise)
    )
    response_summary = response.summary 
    return response_summary

You then use the GPT-J model to generate the embedding

def get_vector(payload_summary):
    sagemaker_endpoint = os.environ["SAGEMAKER_ENDPOINT_VECTOR"]
    response = sm_runtime_client.invoke_endpoint(
        EndpointName=sagemaker_endpoint,
        ContentType="application/json",
        Body=json.dumps(payload_summary).encode("utf-8"),
    )
    response_body = json.loads((response["Body"].read()))
    return response_body["embedding"][0]

You then search OpenSearch Service for your images

The following is an example snippet of that query:

def search_document_celeb_context(person_names, vector):
    results = wr.opensearch.search(
        client=os_client,
        index="images",
        search_body={
            "size": 10,
            "query": {
                "script_score": {
                    "query": {
                        "match": {"celebrities": person_names }
                    },
                    "script": {
                        "lang": "knn",
                        "source": "knn_score",
                        "params": {
                            "field": "image_vector",
                            "query_value": vector,
                            "space_type": "cosinesimil"
                        }
                    }
                }
            }
        },
    )
    return results.drop(columns=["image_vector"]).to_dict()

The architecture contains a simple web app to represent a content management system (CMS).

For an example article, we used the following input:

“Werner Vogels loved travelling around the globe in his Toyota. We see his Toyota come up in many scenes as he drives to go and meet various customers in their home towns.”

None of the images have any metadata with the word “Toyota,” but the semantics of the word “Toyota” are synonymous with cars and driving. Therefore, with this example, we can demonstrate how we can go beyond keyword search and return images that are semantically similar. In the above screenshot of the UI, the caption under the image shows the metadata Amazon Rekognition extracted.

You could include this solution in a larger workflow where you use the metadata you already extracted from your images to start using vector search along with other key terms, such as celebrity names, to return the best resonating images and documents for your search query.

Conclusion

In this post, we showed how you can use Amazon Rekognition, Amazon Comprehend, SageMaker, and OpenSearch Service to extract metadata from your images and then use ML techniques to discover them automatically using celebrity and semantic search. This is particularly important within the publishing industry, where speed matters in getting fresh content out quickly and to multiple platforms.

For more information about working with media assets, refer to Media intelligence just got smarter with Media2Cloud 3.0.


About the Author

Mark Watkins is a Solutions Architect within the Media and Entertainment team, supporting his customers solve many data and ML problems. Away from professional life, he loves spending time with his family and watching his two little ones growing up.

Read More

Improving asset health and grid resilience using machine learning

Improving asset health and grid resilience using machine learning

This post is co-written with Travis Bronson, and Brian L Wilkerson from Duke Energy

Machine learning (ML) is transforming every industry, process, and business, but the path to success is not always straightforward. In this blog post, we demonstrate how Duke Energy, a Fortune 150 company headquartered in Charlotte, NC., collaborated with the AWS Machine Learning Solutions Lab (MLSL) to use computer vision to automate the inspection of wooden utility poles and help prevent power outages, property damage and even injuries.

The electric grid is made up of poles, lines and power plants to generate and deliver electricity to millions of homes and businesses. These utility poles are critical infrastructure components and subject to various environmental factors such as wind, rain and snow, which can cause wear and tear on assets. It’s critical that utility poles are regularly inspected and maintained to prevent failures that can lead to power outages, property damage and even injuries. Most power utility companies, including Duke Energy, use manual visual inspection of utility poles to identifyanomalies related to their transmission and distribution network. But this method can be costlyand time-consuming, and it requires that power transmission lineworkers follow rigorous safety protocols.

Duke Energy has used artificial intelligence in the past to create efficiencies in day-to-day operations to great success. The company has used AI to inspect generation assets and critical infrastructure and has been exploring opportunities to apply AI to the inspection of utility poles as well. Over the course of the AWS Machine Learning Solutions Lab engagement with Duke Energy, the utility progressed its work to automate the detection of anomalies in wood poles using advanced computer vision techniques.

Goals and use case

The goal of this engagement between Duke Energy and the Machine Learning Solutions Lab is to leverage machine learning to inspect hundreds of thousands of high-resolution aerial images to automate the identification and review process of all wood pole-related issues across 33,000 miles of transmission lines. This goal will further help Duke Energy to improve grid resiliency and comply with government regulations by identifying the defects in a timely manner. It will also reduce fuel and labor costs, as well as reduce carbon emissions by minimizing unnecessary truck rolls. Finally, it will also improve safety by minimizing miles driven, poles climbed and physical inspection risks associated with compromising terrain and weather conditions.

In the following sections, we present the key challenges associated with developing robust and efficient models for anomaly detection related to wood utility poles. We also describe the key challenges and suppositions associated with various data preprocessing techniques employed to achieve the desired model performance. Next, we present the key metrics used for evaluating the model performance along with the evaluation of our final models. And finally, we compare various state-of-the-art supervised and unsupervised modeling techniques.

Challenges

One of the key challenges associated with training a model for detecting anomalies using aerial images is the non-uniform image sizes. The following figure shows the distribution of image height and width of a sample data set from Duke Energy. It can be observed that the images have a large amount of variation in terms of size. Similarly, the size of images also pose significant challenges. The size of input images are thousands of pixels wide and thousands of pixels long. This is also not ideal for training a model for identification of the small anomalous regions in the image.

Distribution of image height and width for a sample data set

Distribution of image height and width for a sample data set

Also, the input images contain a large amount of irrelevant background information such as vegetation, cars, farm animals, etc. The background information could result in suboptimal model performance. Based on our assessment, only 5% of the image contains the wood poles and the anomalies are even smaller. This a major challenge for identifying and localizing anomalies in the high-resolution images. The number of anomalies is significantly smaller, compared to the entire data set. There are only 0.12% of anomalous images in the entire data set (i.e., 1.2 anomalies out of 1000 images). Finally, there is no labeled data available for training a supervised machine learning model. Next, we describe how we address these challenges and explain our proposed method.

Solution overview

Modeling techniques

The following figure demonstrates our image processing and anomaly detection pipeline. We first imported the data into Amazon Simple Storage Service (Amazon S3) using Amazon SageMaker Studio. We further employed various data processing techniques to address some of the challenges highlighted above to improve the model performance. After data preprocessing, we employed Amazon Rekognition Custom Labels for data labeling. The labeled data is further used to train supervised ML models such as Vision Transformer, Amazon Lookout for Vision, and AutoGloun for anomaly detection.

Image processing and anomaly detection pipeline

Image processing and anomaly detection pipeline

The following figure demonstrates the detailed overview of our proposed approach that includes the data processing pipeline and various ML algorithms employed for anomaly detection. First, we will describe the steps involved in the data processing pipeline. Next, we will explain the details and intuition related to various modeling techniques employed during this engagement to achieve the desired performance goals.

Data preprocessing

The proposed data preprocessing pipeline includes data standardization, identification of region of interest (ROI), data augmentation, data segmentation, and finally data labeling. The purpose of each step is described below:

Data standardization

The first step in our data processing pipeline includes data standardization. In this step, each image is cropped and divided into non overlapping patches of size 224 X 224 pixels. The goal of this step is to generate patches of uniform sizes that could be further utilized for training a ML model and localizing the anomalies in high resolution images.

Identification of region of interest (ROI)

The input data consists of high-resolution images containing large amount of irrelevant background information (i.e., vegetation, houses, cars, horses, cows, etc.). Our goal is to identify anomalies related to wood poles. In order to identify the ROI (i.e., patches containing the wood pole), we employed Amazon Rekognition custom labeling. We trained an Amazon Rekognition custom label model using 3k labeled images containing both ROI and background images. The goal of the model is to do a binary classification between the ROI and background images. The patches identified as background information are discarded while the crops predicted as ROI are used in the next step. The following figure demonstrates the pipeline that identifies the ROI. We generated a sample of non-overlapping crops of 1,110 wooden images that generated 244,673 crops. We further used these images as input to an Amazon Rekognition custom model that identified 11,356 crops as ROI. Finally, we manually verified each of these 11,356 patches. During the manual inspection, we identified the model was able to correctly predict 10,969 wood patches out of 11,356 as ROI. In other words, the model achieved 96% precision.

Identification of region of interest

Identification of region of interest

Data labeling

During the manual inspection of the images, we also labeled each image with their associated labels. The associated labels of images include wood patch, non-wood patch, non-structure, non-wood patch and finally wood patches with anomalies. The following figure demonstrates the nomenclature of the images using Amazon Rekognition custom labeling.

Data augmentation

Given the limited amount of labeled data that was available for training, we augmented the training data set by making horizontal flips of all of the patches. This had the effective impact of doubling the size of our data set.

Segmentation

We labeled the objects in 600 images (poles, wires, and metal railing) using the bounding box object detection labeling tool in Amazon Rekognition Custom Labels and trained a model to detect the three main objects of interest. We used the trained model to remove the background from all the images, by identifying and extracting the poles in each image, while removing the all other objects as well as the background. The resulting dataset had fewer images than the original data set, as a result of removing all images that don’t contain wood poles. In addition, there was also a false positive image that were removed from the dataset.

Anomaly detection

Next, we use the preprocessed data for training the machine learning model for anomaly detection. We employed three different methods for anomaly detection which includes AWS Managed Machine Learning Services (Amazon Lookout for Vision [L4V], Amazon Rekognition), AutoGluon, and Vision Transformer based self-distillation method.

AWS Services

Amazon Lookout for Vision (L4V)

Amazon Lookout for Vision is a managed AWS service that enables swift training and deployment of ML models and provides anomaly detection capabilities. It requires fully labelled data, which we provided by pointing to the image paths in Amazon S3. Training the model is as a simple as a single API (Application programming interface) call or console button click and L4V takes care of model selection and hyperparameter tuning under the hood.

Amazon Rekognition

Amazon Rekognition is a managed AI/ML service similar to L4V, which hides modelling details and provides many capabilities such as image classification, object detection, custom labelling, and more. It provides the ability to use the built-in models to apply to previously known entities in images (e.g., from ImageNet or other large open datasets). However, we used Amazon Rekognition’s Custom Labels functionality to train the ROI detector, as well as an anomaly detector on the specific images that Duke Energy has. We also used the Amazon Rekognition’s Custom Labels to train a model to put bounding boxes around wood poles in each image.

AutoGloun

AutoGluon is an open-source machine learning technique developed by Amazon. AutoGluon includes a multi-modal component which allows easy training on image data. We used AutoGluon Multi-modal to train models on the labelled image patches to establish a baseline for identifying anomalies.

Vision Transformer

Many of the most exciting new AI breakthroughs have come from two recent innovations: self-supervised learning, which allows machines to learn from random, unlabeled examples; and Transformers, which enable AI models to selectively focus on certain parts of their input and thus reason more effectively. Both methods have been a sustained focus for the Machine learning community, and we’re pleased to share that we used them in this engagement.

In particular, working in collaboration with researchers at Duke Energy, we used pre-trained self-distillation ViT (Vision Transformer) models as feature extractors for the downstream anomaly detection application using Amazon Sagemaker. The pre-trained self-distillation vision transformer models are trained on large amount of training data stored on Amazon S3 in a self-supervised manner using Amazon SageMaker. We leverage the transfer learning capabilities of ViT models pre-trained on large scale datasets (e.g., ImageNet). This helped us achieve a recall of 83% on an evaluation set using only a few thousands of labeled images for training.

Evaluation metrics

The following figure shows the key metrics used to evaluate model performance and its impacts. The key goal of the model is to maximize anomaly detection (i.e. true positives) and minimize the number of false negatives, or times when the anomalies that could lead to outages are beingmisclassified.

Once the anomalies are identified, technicians can address them, preventing future outages and ensuring compliance with government regulations. There’s another benefit to minimizing false positives: you avoid the unnecessary effort of going through images again.

Keeping these metrics in mind, we track the model performance in terms of following metrics, which encapsulates all four metrics defined above.

Precision

The percent of anomalies detected that are actual anomalies for objects of interest. Precision measures how well our algorithm identifies only anomalies. For this use case, high precision means low false alarms (i.e., the algorithm falsely identifies a woodpecker hole while there isn’t any in the image).

Recall

The percent of all anomalies that are recovered for each object of interest. Recall measures how well we identify all anomalies. This set captures some percentage of the full set of anomalies, and that percentage is the recall. For this use case, high recall means that we’re good at catching woodpecker holes when they occur. Recall is therefore the right metric to focus on in this POC because false alarms are at best annoying while missed anomalies could lead to serious consequence if left unattended.

Lower recall can lead to outages and government regulation violations. While lower precision leads to wasted human effort. The primary goal of this engagement is to identify all the anomalies to comply with government regulation and avoid any outage, hence we prioritize improving recall over precision.

Evaluation and model comparison

In the following section, we demonstrate the comparison of various modeling techniques employed during this engagement. We evaluated the performance of two AWS services Amazon Rekognition and Amazon Lookout for Vision. We also evaluated various modeling techniques using AutoGluon. Finally, we compare the performance with state-of-the-art ViT based self-distillation method.

The following figure shows the model improvement for the AutoGluon using different data processing techniques over the period of this engagement. The key observation is as we improve the data quality and quantity the performance of the model in terms of recall improved from below 30% to 78%.

Next, we compare the performance of AutoGluon with AWS services. We also employed various data processing techniques that helped improve the performance. However, the major improvement came from increasing the data quantity and quality. We increase the dataset size from 11 K images in total to 60 K images.

Next, we compare the performance of AutoGluon and AWS services with ViT based method. The following figure demonstrates that ViT-based method, AutoGluon and AWS services performed on par in terms of recall. One key observation is, beyond a certain point, increase in data quality and quantity does not help increase the performance in terms of recall. However, we observe improvements in terms of precision.

Precision versus recall comparison

Amazon AutoGluon Predicted anomalies Predicted normal
Anomalies 15600 4400
Normal 3659 38341

Next, we present the confusion matrix for AutoGluon and Amazon Rekognition and ViT based method using our dataset that contains 62 K samples. Out of 62K samples, 20 K samples are anomalous while remaining 42 K images are normal. It can be observed that ViT based methods captures largest number of anomalies (16,600) followed by Amazon Rekognition (16,000) and Amazon AutoGluon (15600). Similarly, Amazon AutoGluon has least number of false positives (3659 images) followed by Amazon Rekognition (5918) and ViT (15323). These results demonstrates that Amazon Rekognition achieves the highest AUC (area under the curve).

Amazon Rekognition Predicted anomalies Predicted normal
Anomalies 16,000 4000
Normal 5918 36082
ViT                                Predicted anomalies Predicted normal
Anomalies 16,600 3400
Normal 15,323 26,677

Conclusion

In this post, we showed you how the MLSL and Duke Energy teams worked together to develop a computer vision-based solution to automate anomaly detection in wood poles using high resolution images collected via helicopter flights. The proposed solution employed a data processing pipeline to crop the high-resolution image for size standardization. The cropped images are further processed using Amazon Rekognition Custom Labels to identify the region of interest (i.e., crops containing the patches with poles). Amazon Rekognition achieved 96% precision in terms of correctly identifying the patches with poles. The ROI crops are further used for anomaly detection using ViT based self-distillation mdoel AutoGluon and AWS services for anomaly detection. We used a standard data set to evaluate the performance of all three methods. The ViT based model achieved 83% recall and 52% precision. AutoGluon achieved 78% recall and 81% precision. Finally, Amazon Rekognition achieves 80% recall and 73% precision. The goal of using three different methods is to compare the performance of each method with different number of training samples, training time, and deployment time. All these methods take less than 2 hours to train a and deploy using a single A100 GPU instance or managed services on Amazon AWS. Next, steps for further improvement in model performance include adding more training data for improving model precision.

Overall, the end-to-end pipeline proposed in this post help achieve significant improvements in anomaly detection while minimizing operations cost, safety incident, regulatory risks, carbon emissions, and potential power outages.

The solution developed can be employed for other anomaly detection and asset health-related use cases across transmission and distribution networks, including defects in insulators and other equipment. For further assistance in developing and customizing this solution, please feel free to get in touch with the MLSL team.


About the Authors

Travis Bronson is a Lead Artificial Intelligence Specialist with 15 years of experience in technology and 8 years specifically dedicated to artificial intelligence. Over his 5-year tenure at Duke Energy, Travis has advanced the application of AI for digital transformation by bringing unique insights and creative thought leadership to his company’s leading edge. Travis currently leads the AI Core Team, a community of AI practitioners, enthusiasts, and business partners focused on advancing AI outcomes and governance. Travis gained and refined his skills in multiple technological fields, starting in the US Navy and US Government, then transitioning to the private sector after more than a decade of service.

 Brian Wilkerson is an accomplished professional with two decades of experience at Duke Energy. With a degree in computer science, he has spent the past 7 years excelling in the field of Artificial Intelligence. Brian is a co-founder of Duke Energy’s MADlab (Machine Learning, AI and Deep learning team). Hecurrently holds the position of Director of Artificial Intelligence & Transformation at Duke Energy, where he is passionate about delivering business value through the implementation of AI.

Ahsan Ali is an Applied Scientist at the Amazon Generative AI Innovation Center, where he works with customers from different domains to solve their urgent and expensive problems using Generative AI.

Tahin Syed is an Applied Scientist with the Amazon Generative AI Innovation Center, where he works with customers to help realize business outcomes with generative AI solutions. Outside of work, he enjoys trying new food, traveling, and teaching taekwondo.

Dr. Nkechinyere N. Agu is an Applied Scientist in the Generative AI Innovation Center at AWS. Her expertise is in Computer Vision AI/ML methods, Applications of AI/ML to healthcare, as well as the integration of semantic technologies (Knowledge Graphs) in ML solutions. She has a Masters and a Doctorate in Computer Science.

Aldo Arizmendi is a Generative AI Strategist in the AWS Generative AI Innovation Center based out of Austin, Texas. Having received his B.S. in Computer Engineering from the University of Nebraska-Lincoln, over the last 12 years, Mr. Arizmendi has helped hundreds of Fortune 500 companies and start-ups transform their business using advanced analytics, machine learning, and generative AI.

Stacey Jenks is a Principal Analytics Sales Specialist at AWS, with more than two decades of experience in Analytics and AI/ML. Stacey is passionate about diving deep on customer initiatives and driving transformational, measurable business outcomes with data. She is especially enthusiastic about the mark that utilities will make on society, via their path to a greener planet with affordable, reliable, clean energy.

Mehdi Noor is an Applied Science Manager at Generative Ai Innovation Center. With a passion for bridging technology and innovation, he assists AWS customers in unlocking the potential of Generative AI, turning potential challenges into opportunities for rapid experimentation and innovation by focusing on scalable, measurable, and impactful uses of advanced AI technologies, and streamlining the path to production.

Read More

Understanding social biases through the text-to-image generation lens

Understanding social biases through the text-to-image generation lens

This research paper was presented at the Sixth AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES) (opens in new tab), a premier forum for discussion on the societal and ethical aspects of artificial intelligence.

The rise of text-to-image (T2I) generation has ushered in a new era of innovation, offering a broad spectrum of possibilities for creators, designers, and the everyday users of productivity software. This technology can transform descriptive text into remarkably realistic visual content, empowering users to enrich their work with vivid illustrative elements. However, beneath this innovation lies a notable concern—the potential inclusion of harmful societal biases.

These T2I models create images from the extensive web data on which they had been trained, and this data often lacks representation of different demographic groups and cultures and can even harbor harmful content. When these societal biases seep into AI-generated content, they perpetuate and amplify pre-existing societal problems, reinforcing them and creating a disconcerting cycle that undermines previous and current mitigation efforts.

Representation of gender, race, and age across occupations and personality traits

To tackle this problem, it is essential to rigorously evaluate these models across a variety of demographic factors and scenarios. In our paper, “Social Biases through the Text-to-Image Generation Lens (opens in new tab),” presented at AIES 2023 (opens in new tab), we conduct a thorough analysis for studying and quantifying common societal biases reflected in generated images. We focus on the portrayal of occupations, personality traits, and everyday situations across representations of gender, age, race, and geographical location. 

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


For example, consider images that reinforce societal biases for the roles of CEO and housekeeper. These professions have been extensively studied as examples of stereotypical gender biases—where predominantly men are CEOs and women are housekeepers. For all such cases, we observed three different perspectives: 

  1. Real-world distribution: Relies on labor statistics, presenting distribution across various dimensions, such as gender, race, and age.
  1. Search engine results: Captures the distribution evident in search engine outcomes, reflecting contemporary portrayals. 
  1. Image generation results: Emphasizes the distribution observed in image generation outputs. 

We tested two T2I generators, DALLE-v2 and Stable Diffusion and compared them with 2022 data from the U.S. Bureau of Labor Statistics and results for a Google image search conducted in 2020, examining how women are represented across five different occupations. Notably, the analysis of generation models revealed a significant setback in representational fairness compared with data sourced from the U.S. Bureau of Labor Statistics (BLS) and a web image search (GIS) based on geographically referenced information. Notably, images generated by DALLE-v2 provide minimal representation of women in the professions of CEO and computer programmer. Conversely, in images generated by Stable Diffusion, women are consistently represented in the roles of nurses and housekeepers 100% of the time. Figure 1 illustrates our findings, and Figure 2 shows examples of images generated to show different occupations. 

A chart showing gender representation in percentage for DALLE-v2, Stable Diffusion, Google Image Search 2020, and BLS data. 
Figure 1. Gender representation for DALLE-v2, Stable Diffusion, Google Image Search 2020, and BLS data. 
Examples of generations for the professions of “computer programmer” and “housekeeper” using the DALL-E v2 and Stable Diffusion models.
Figure 2. A sample of the first four images generated for the professions of “computer programmer” and “housekeeper” using the DALL-E v2 and Stable Diffusion models. Notably, one gender is conspicuously absent across a distribution of 500 generated images. 

Even when using basic prompts like “person” without including an occupation, we observed that models can underrepresent certain demographic groups across age, race, and gender. When we analyzed DALLE-v2 and Stable Diffusion, both offered a limited representation of races other than white across a set of 500 generated images. Furthermore, the DALLE-v2 outputs revealed a remarkable lack of age diversity, with over 80% of the images depicting either adults who appeared to be between the ages 18 and 40 or children. This is illustrated in Figure 3.

Three charts showing gender, race, and age distribution as interpreted by human annotators for DALL-E v2 and Stable Diffusion models.
Figure 3. Gender, race, and age distribution as interpreted by human annotators and automated face processing within the context of image generation for the prompt “person.” 

Our study also examines biases of similar representations across positive and negative personality traits, revealing the subtleties of how these traits are depicted. While individuals of nonwhite races appear linked with positive attributes such as vigor, ambition, striving, and independence, they are also associated with negative traits like detachment, hardheartedness, and conceitedness. 

Representation of geographical locations in everyday scenarios 

Another aspect of bias that we studied pertains to the representation of diverse geographical locations in how models interpret everyday scenarios. We did this using such prompts as “a photo of a birthday party” or “a photo of a library.” Although it is difficult to discern the precise location of a generated photo, distinctions in these representations can still be measured between using a general prompt and a prompt that specifies a location, for example, “a photo of a birthday party in Colombia.” In the paper, we describe this experiment for the two most populous countries in each inhabited continent, considering everyday scenarios centering around events, places, food, institutions, community, and clothing. When models were given a general prompt, overall results indicated that images generated for countries like Nigeria, Papua New Guinea, and Ethiopia had the greatest difference between the prompt and the image, while images generated for Germany, the US, and Russia were the closest aligned to the general prompt. 

Subtle effects of using expanded prompts 

Many bias mitigation techniques rely on expanding the prompt to enrich and diversify the images that models generate. To tackle bias in AI-generated images, we applied prompt engineering (opens in new tab) to increase the likelihood that the image will reflect what’s specified in the prompt. We used prompt expansion, a type of prompt engineering, to add further descriptors to the initial general prompts and guide the model towards unbiased content. An example of prompt expansion would be “a portrait of a female doctor” instead of “a portrait of a doctor.” Our experiments proved that prompt expansion is predominantly effective in creating more specified content in AI-generated images. However, there are also unintended outcomes, particularly in terms of decreased diversity and image quality, as shown in Figure 4. 

Examples of generation output from DALL-E v2 for two prompts: “a portrait of an announcer” and “a portrait of a female announcer.”
Figure 4. Expanded prompts using descriptors like “female” can indeed yield more diverse depictions, but often at the cost of image variety and quality. 

Safeguarding against bias in T2I models

As T2I generation models become increasingly integrated into our digital ecosystems, it is paramount that we remain vigilant to the biases they may inadvertently perpetuate. This research underscores the profound importance of continually evaluating and refining these models. We hope that the outcomes and methodology presented in this study provide valuable insights for evaluating and building new generative models. We would like to emphasize the importance of fostering responsible development and ensuring representational fairness in this process. 

The post Understanding social biases through the text-to-image generation lens appeared first on Microsoft Research.

Read More

A novel computational fluid dynamics framework for turbulent flow research

A novel computational fluid dynamics framework for turbulent flow research

Turbulence is ubiquitous in environmental and engineering fluid flows, and is encountered routinely in everyday life. A better understanding of these turbulent processes could provide valuable insights across a variety of research areas — improving the prediction of cloud formation by atmospheric transport and the spreading of wildfires by turbulent energy exchange, understanding sedimentation of deposits in rivers, and improving the efficiency of combustion in aircraft engines to reduce emissions, to name a few. However, despite its importance, our current understanding and our ability to reliably predict such flows remains limited. This is mainly attributed to the highly chaotic nature and the enormous spatial and temporal scales these fluid flows occupy, ranging from energetic, large-scale movements on the order of several meters on the high-end, where energy is injected into the fluid flow, all the way down to micrometers (μm) on the low-end, where the turbulence is dissipated into heat by viscous friction.

A powerful tool to understand these turbulent flows is the direct numerical simulation (DNS), which provides a detailed representation of the unsteady three-dimensional flow-field without making any approximations or simplifications. More specifically, this approach utilizes a discrete grid with small enough grid spacing to capture the underlying continuous equations that govern the dynamics of the system (in this case, variable-density Navier-Stokes equations, which govern all fluid flow dynamics). When the grid spacing is small enough, the discrete grid points are enough to represent the true (continuous) equations without the loss of accuracy. While this is attractive, such simulations require tremendous computational resources in order to capture the correct fluid-flow behaviors across such a wide range of spatial scales.

The actual span in spatial resolution to which direct numerical calculations must be applied depends on the task and is determined by the Reynolds number, which compares inertial to viscous forces. Typically, the Reynolds number can range between 102 up to 107 (even larger for atmospheric or interstellar problems). In 3D, the grid size for the resolution required scales roughly with the Reynolds number to the power of 4.5! Because of this strong scaling dependency, simulating such flows is generally limited to flow regimes with moderate Reynolds numbers, and typically requires access to high-performance computing systems with millions of CPU/GPU cores.

In “A TensorFlow simulation framework for scientific computing of fluid flows on tensor processing units”, we introduce a new simulation framework that enables the computation of fluid flows with TPUs. By leveraging latest advances on TensorFlow software and TPU-hardware architecture, this software tool allows detailed large-scale simulations of turbulent flows at unprecedented scale, pushing the boundaries of scientific discovery and turbulence analysis. We demonstrate that this framework scales efficiently to accommodate the scale of the problem or, alternatively, improved run times, which is remarkable since most large-scale distributed computation frameworks exhibit reduced efficiency with scaling. The software is available as an open-source project on GitHub.

Large-scale scientific computation with accelerators

The software solves variable-density Navier-Stokes equations on TPU architectures using the TensorFlow framework. The single-instruction, multiple-data (SIMD) approach is adopted for parallelization of the TPU solver implementation. The finite difference operators on a colocated structured mesh are cast as filters of the convolution function of TensorFlow, leveraging TPU’s matrix multiply unit (MXU). The framework takes advantage of the low-latency high-bandwidth inter-chips interconnect (ICI) between the TPU accelerators. In addition, by leveraging the single-precision floating-point computations and highly optimized executable through the accelerated linear algebra (XLA) compiler, it’s possible to perform large-scale simulations with excellent scaling on TPU hardware architectures.

This research effort demonstrates that the graph-based TensorFlow in combination with new types of ML special purpose hardware, can be used as a programming paradigm to solve partial differential equations representing multiphysics flows. The latter is achieved by augmenting the Navier-Stokes equations with physical models to account for chemical reactions, heat-transfer, and density changes to enable, for example, simulations of cloud formation and wildfires.

It’s worth noting that this framework is the first open-source computational fluid dynamics (CFD) framework for high-performance, large-scale simulations to fully leverage the cloud accelerators that have become common (and become a commodity) with the advancement of machine learning (ML) in recent years. While our work focuses on using TPU accelerators, the code can be easily adjusted for other accelerators, such as GPU clusters.

This framework demonstrates a way to greatly reduce the cost and turn-around time associated with running large-scale scientific CFD simulations and enables even greater iteration speed in fields, such as climate and weather research. Since the framework is implemented using TensorFlow, an ML language, it also enables the ready integration with ML methods and allows the exploration of ML approaches on CFD problems. With the general accessibility of TPU and GPU hardware, this approach lowers the barrier for researchers to contribute to our understanding of large-scale turbulent systems.

Framework validation and homogeneous isotropic turbulence

Beyond demonstrating the performance and the scaling capabilities, it is also critical to validate the correctness of this framework to ensure that when it is used for CFD problems, we get reasonable results. For this purpose, researchers typically use idealized benchmark problems during CFD solver development, many of which we adopted in our work (more details in the paper).

One such benchmark for turbulence analysis is homogeneous isotropic turbulence (HIT), which is a canonical and well studied flow in which the statistical properties, such as kinetic energy, are invariant under translations and rotations of the coordinate axes. By pushing the resolution to the limits of the current state of the art, we were able to perform direct numerical simulations with more than eight billion degrees of freedom — equivalent to a three-dimensional mesh with 2,048 grid points along each of the three directions. We used 512 TPU-v4 cores, distributing the computation of the grid points along the x, y, and z axes to a distribution of [2,2,128] cores, respectively, optimized for the performance on TPU. The wall clock time per timestep was around 425 milliseconds and the flow was simulated for a total of 400,000 timesteps. 50 TB data, which includes the velocity and density fields, is stored for 400 timesteps (every 1,000th step). To our knowledge, this is one of the largest turbulent flow simulations of its kind conducted to date.

Due to the complex, chaotic nature of the turbulent flow field, which extends across several magnitudes of resolution, simulating the system in high resolution is necessary. Because we employ a fine-resolution grid with eight billion points, we are able to accurately resolve the field.

Contours of x-component of velocity along the z midplane. The high resolution of the simulation is critical to accurately represent the turbulent field.

The turbulent kinetic energy and dissipation rates are two statistical quantities commonly used to analyze a turbulent flow. The temporal decay of these properties in a turbulent field without additional energy injection is due to viscous dissipation and the decay asymptotes follow the expected analytical power law. This is in agreement with the theoretical asymptotes and observations reported in the literature and thus, validates our framework.

Solid line: Temporal evolution of turbulent kinetic energy (k). Dashed line: Analytical power laws for decaying homogeneous isotropic turbulence (n=1.3) (l: eddy turnover time).
Solid line: Temporal evolution of dissipation rate (ε). Dashed line: Analytical power laws for decaying homogeneous isotropic turbulence (n=1.3).

The energy spectrum of a turbulent flow represents the energy content across wavenumber, where the wavenumber k is proportional to the inverse wavelength λ (i.e., k ∝ 1/λ). Generally, the spectrum can be qualitatively divided into three ranges: source range, inertial range and viscous dissipative range (from left to right on the wavenumber axis, below). The lowest wavenumbers in the source range correspond to the largest turbulent eddies, which have the most energy content. These large eddies transfer energy to turbulence in the intermediate wavenumbers (inertial range), which is statistically isotropic (i.e., essentially uniform in all directions). The smallest eddies, corresponding to the largest wavenumbers, are dissipated into thermal energy by the viscosity of the fluid. By virtue of the fine grid having 2,048 points in each of the three spatial directions, we are able to resolve the flow field up to the length scale at which viscous dissipation takes place. This direct numerical simulation approach is the most accurate as it does not require any closure model to approximate the energy cascade below the grid size.

Spectrum of turbulent kinetic energy at different time instances. The spectrum is normalized by the instantaneous integral length (l) and the turbulent kinetic energy (k).

A new era for turbulent flows research

More recently, we extended this framework to predict wildfires and atmospheric flows, which is relevant for climate-risk assessment. Apart from enabling high-fidelity simulations of complex turbulent flows, this simulation framework also provides capabilities for scientific machine learning (SciML) — for example, downsampling from a fine to a coarse grid (model reduction) or building models that run at lower resolution while still capturing the correct dynamic behaviors. It could also provide avenues for further scientific discovery, such as building ML-based models to better parameterize microphysics of turbulent flows, including physical relationships between temperature, pressure, vapor fraction, etc., and could improve upon various control tasks, e.g., to reduce the energy consumption of buildings or find more efficient propeller shapes. While attractive, a main bottleneck in SciML has been the availability of data for training. To explore this, we have been working with groups at Stanford and Kaggle to make the data from our high-resolution HIT simulation available through a community-hosted web-platform, BLASTNet, to provide broad access to high-fidelity data to the research community via a network-of-datasets approach. We hope that the availability of these emerging high-fidelity simulation tools in conjunction with community-driven datasets will lead to significant advances in various areas of fluid mechanics.

Acknowledgements

We would like to thank Qing Wang, Yi-Fan Chen, and John Anderson for consulting and advice, Tyler Russell and Carla Bromberg for program management.

Read More

How Industries Are Meeting Consumer Expectations With Speech AI

How Industries Are Meeting Consumer Expectations With Speech AI

Thanks to rapid technological advances, consumers have become accustomed to an unprecedented level of convenience and efficiency.

Smartphones make it easier than ever to search for a product and have it delivered right to the front door. Video chat technology lets friends and family on different continents connect with ease. With voice command tools, AI assistants can play songs, initiate phone calls or recommend the best Italian food in a 10-mile radius. AI algorithms can even predict which show users may want to watch next or suggest an article they may want to read before making a purchase.

It’s no surprise, then, that customers expect fast and personalized interactions with companies. According to a Salesforce research report, 83% of consumers expect immediate engagement when they contact a company, while 73% expect companies to understand their unique needs and expectations. Nearly 60% of all customers want to avoid customer service altogether, preferring to resolve issues with self-service features.

Meeting such high consumer expectations places a massive burden on companies in every industry, including on their staff and technological needs — but speech AI can help.

Speech AI can understand and converse in natural language, creating opportunities for seamless, multilingual customer interactions while supplementing employee capabilities. It can power self-serve banking in the financial services industry, enable food kiosk avatars in restaurants, transcribe clinical notes in healthcare facilities or streamline bill payments for utility companies — helping businesses across industries deliver personalized customer experiences.

Speech AI for Banking and Payments

Most people now use both digital and traditional channels to access banking services, creating a demand for omnichannel, personalized customer support. However, higher demand for support coupled with a high agent churn rate has left many financial institutions struggling to keep up with the service and support needs of their customers.

Common consumer frustrations include difficulty with complex digital processes, a lack of helpful and readily available information, insufficient self-service options, long call wait times and communication difficulties with support agents.

According to a recent NVIDIA survey, the top AI use cases for financial service institutions are natural language processing (NLP) and large language models (LLMs). These models automate customer service interactions and process large bodies of unstructured financial data to provide AI-driven insights that support all lines of business across financial institutions — from risk management and fraud detection to algorithmic trading and customer service.

By providing speech-equipped self-service options and supporting customer service agents with AI-powered virtual assistants, banks can improve customer experiences while controlling costs. AI voice assistants can be trained on finance-specific vocabulary and rephrasing techniques to confirm understanding of a user’s request before offering answers.

Kore.ai, a conversational AI software company, trained its BankAssist solution on 400-plus retail banking use cases for interactive voice response, web, mobile, SMS and social media channels. Customers can use a voice assistant to transfer funds, pay bills, report lost cards, dispute charges, reset passwords and more.

Kore.ai’s agent voice assistant has also helps live agents provide personalized suggestions so they can resolve issues faster. The solution has been shown to improve live agent efficiency by cutting customer handling time by 40% with a return on investment of $2.30 per voice session.

With such trends, expect financial institutions to accelerate the deployment of speech AI to streamline customer support and reduce wait times, offer more self-service options, transcribe calls to speed loan processing and automate compliance, extract insights from spoken content and boost the overall productivity and speed of operations.

Speech AI for Telecommunications    

Heavy investments in 5G infrastructure and cut-throat competition to monetize and achieve profitable returns on new networks mean that maintaining customer satisfaction and brand loyalty is paramount in the telco industry.

According to an NVIDIA survey of 400-plus industry professionals, the top AI use cases in the telecom industry involve optimizing network operations and improving customer experiences. Seventy-three percent of respondents reported increased revenue from AI.

By using speech AI technologies to power chatbots, call-routing, self-service features and recommender systems, telcos can enhance and personalize customer engagements.

KT, a South Korean mobile operator with over 22 million users, has built GiGa Genie, an intelligent voice assistant that’s been trained to understand and use the Korean language using LLMs. It has already conversed with over 8 million users.

By understanding voice commands, the GiGA Genie AI speaker can support people with tasks like turning on smart TVs or lights, sending text messages or providing real-time traffic updates.

KT has also strengthened its AI-powered Customer Contact Center with transformer-based speech AI models that can independently handle over 100,000 calls per day. A generative AI component of the system autonomously responds to customers with suggested resolutions or transfers them to human agents for more nuanced questions and solutions.

Telecommunications companies are expected to lean into speech AI to build more customer self-service capabilities, optimize network performance and enhance overall customer satisfaction.

Speech AI for Quick-Service Restaurants

The food service industry is expected to reach $997 billion in sales in 2023, and its workforce is projected to grow by 500,000 openings. Meanwhile, elevated demand for drive-thru, curbside pickup and home delivery suggests a permanent shift in consumer dining preferences. This shift creates the challenge of hiring, training and retaining staff in an industry with notoriously high turnover rates — all while meeting consumer expectations for fast and fresh service.

Drive-thru order assistants and in-store food kiosks equipped with speech AI can help ease the burden. For example, speech-equipped avatars can help automate the ordering process by offering menu recommendations, suggesting promotions, customizing options or passing food orders directly to the kitchen for preparation.

HuEx, a Toronto-based startup and member of NVIDIA Inception, has designed a multilingual automated order assistant to enhance drive-thru operations. Known as AIDA, the AI assistant receives and responds to orders at the drive-thru speaker box while simultaneously transcribing voice orders into text for food-prep staff.

AIDA understands 300,000-plus product combinations with 90% accuracy, from common requests such as “coffee with milk” to less common requests such as “coffee with butter.” It can even understand different accents and dialects to ensure a seamless ordering experience for a diverse population of consumers.

Speech AI streamlines the order process by speeding fulfillment, reducing miscommunication and minimizing customer wait times. Early movers will also begin to use speech AI to extract customer insights from voice interactions to inform menu options, make upsell recommendations and improve overall operational efficiency while reducing costs.

Speech AI for Healthcare

In the post-pandemic era, the digitization of healthcare is continuing to accelerate. Telemedicine and computer vision support remote patient monitoring, voice-activated clinical systems help patients check in and receive zero-touch care and speech recognition technology supports clinical documentation responsibilities. Per IDC, 36% of survey respondents indicated that they had deployed digital assistants for patient healthcare.

Automated speech recognition and NLP models can now capture, recognize, understand and summarize key details in medical settings. At the Conference for Machine Intelligence in Medical Imaging, NVIDIA researchers showcased a state-of-the-art pretrained architecture with speech-to-text functionality to extract clinical entities from doctor-patient conversations. The model identifies clinical words — including symptoms, medication names, diagnoses and recommended treatments — and automatically updates medical records.

This technology can ease the burden of manual note-taking and has the potential to accelerate insurance and billing processes while also creating consultation recaps for caregivers. Relieved of administrative tasks, physicians can focus on patient care to deliver superior experiences.

Artisight, an AI platform for healthcare, uses speech recognition to power zero-touch check-ins and speech synthesis to notify patients in the waiting room when the doctor is available. Over 1,200 patients per day use Artisight kiosks, which help streamline registration processes, improve patient experiences, eliminate data entry errors with automation and boost staff productivity.

As healthcare moves toward a smart hospital model, expect to see speech AI play a bigger role in supporting medical professionals and powering low-touch experiences for patients. This may include risk factor prediction and diagnosis through clinical note analysis, translation services for multilingual care centers, medical dictation and transcription and automation of other administrative tasks.

Speech AI for Energy

Faced with increasing demand for clean energy, high operating costs and a workforce retiring in greater numbers, energy and utility companies are looking for ways to do more with less.

To drive new efficiencies, prepare for the future of energy and meet ever-rising customer expectations, utilities can use speech AI. Voice-based customer service can enable customers to report outages, inquire about billing and receive support on other issues without agent intervention. Speech AI can streamline meter reading, support field technicians with voice notes and voice commands to access work orders and enable utilities to analyze customer preferences with NLP.

Minerva CQ, an AI assistant designed specifically for retail energy use cases, supports customer service agents by transcribing conversations into text in real time. Text is fed into Minerva CQ’s AI models, which analyze customer sentiment, intent, propensity and more.

By dynamically listening, the AI assistant populates an agent’s screen with dialogue suggestions, behavioral cues, personalized offers and sentiment analysis. A knowledge-surfacing feature pulls up a customer’s energy usage history and suggests decarbonization options — arming agents with the information needed to help customers make informed decisions about their energy consumption.

With the AI assistant providing consistent, simple explanations on energy sources, tariff plans, billing changes and optimal spending, customer service agents can effortlessly guide customers to the most ideal energy plan. After deploying Minerva CQ, one utility provider reported a 44% reduction in call handling time, a 12.5% increase in first-contact resolution and average savings of $2.67 per call.

Speech AI is expected to continue to help utility providers reduce training costs, remove friction from customer service interactions and equip field technicians with voice-activated tools to boost productivity and improve safety — all while enhancing customer satisfaction.

Speech and Translation AI for the Public Sector

Because public service programs are often underfunded and understaffed, citizens seeking vital services and information are at times left waiting and frustrated. To address this challenge, some federal- and state-level agencies are turning to speech AI to achieve more timely service delivery.

The Federal Emergency Management Agency uses automated speech recognition systems to manage emergency hotlines, analyze distress signals and direct resources efficiently. The U.S. Social Security Administration uses an interactive voice response system and virtual assistants to respond to inquiries about social security benefits and application processes and to provide general information.

The Department of Veterans Affairs has appointed a director of AI to oversee the integration of the technology into its healthcare systems. The VA uses speech recognition technology to power note-taking during telehealth appointments. It has also developed an advanced automated speech transcription engine to help score neuropsychological tests for analysis of cognitive decline in older patients.

Additional opportunities for speech AI in the public sector include real-time language translation services for citizen interactions, public events or visiting diplomats. Public agencies that handle a large volume of calls can benefit from multilingual voice-based interfaces to allow citizens to access information, make inquiries or request services in different languages.

Speech and translation AI can also automate document processing by converting multilingual audio recordings or spoken content into translated text to streamline compliance processes, improve data accuracy and enhance administrative task efficiency. Speech AI additionally has the potential to expand access to services for people with visual or mobility impairments.

Speech AI for Automotive 

From vehicle sales to service scheduling, speech AI can bring numerous benefits to automakers, dealerships, drivers and passengers alike.

Before visiting a dealership in person, more than half of vehicle shoppers begin their search online, then make the first contact with a phone call to collect information. Speech AI chatbots trained on vehicle manuals can answer questions on technological capabilities, navigation, safety, warranty, maintenance costs and more. AI chatbots can also schedule test drives, answer pricing questions and inform shoppers of which models are in stock. This enables automotive manufacturers to differentiate their dealership networks through intelligent and automated engagements with customers.

Manufacturers are building advanced speech AI into vehicles and apps to improve driving experiences, safety and service. Onboard AI assistants can execute natural language voice commands for navigation, infotainment, general vehicle diagnostics and querying user manuals. Without the need to operate physical controls or touch screens, drivers can keep their hands on the wheel and eyes on the road.

Speech AI can help maximize vehicle up-time for commercial fleets. AI trained on technical service bulletins and software update cadences lets technicians provide more accurate quotes for repairs, identify key information before putting the car on a lift and swiftly supply vehicle repair updates to commercial and small business customers.

With insights from driver voice commands and bug reports, manufacturers can also improve vehicle design and operating software. As self-driving cars become more advanced, expect speech AI to play a critical role in how drivers operate vehicles, troubleshoot issues, call for assistance and schedule maintenance.

Speech AI — From Smart Spaces to Entertainment

Speech AI has the potential to impact nearly every industry.

In Smart Cities, speech AI can be used to handle distress calls and provide emergency responders with crucial information. In Mexico City, the United Nations Office on Drugs and Crime is developing a speech AI program to analyze 911 calls to prevent gender violence. By analyzing distress calls, AI can identify keywords, signals and patterns to help prevent domestic violence against women. Speech AI can also be used to deliver multilingual services in public spaces and improve access to transit for people who are visually impaired.

In higher education and research, speech AI can automatically transcribe lectures and research interviews, providing students with detailed notes and saving researchers the time spent compiling qualitative data. Speech AI also facilitates the translation of educational content to various languages, increasing its accessibility.

AI translation powered by LLMs is making it easier to consume entertainment and streaming content online in any language. Netflix, for example, is using AI to automatically translate subtitles into multiple languages. Meanwhile, startup Papercup is using AI to automate video content dubbing to reach global audiences in their local languages.

Transforming Product and Service Offerings With Speech AI

In the modern consumer landscape, it’s imperative that companies provide convenient, personalized customer experiences. Businesses can use NLP and the translation capabilities of speech AI to transform the way they operate and interact with customers in real time on a global scale.

Companies across industries are using speech AI to deliver rapid, multilingual customer service responses, self-service features and information and automation tools to empower employees to provide higher-value experiences.

To help enterprises in every industry realize the benefits of speech, translation and conversational AI, NVIDIA offers a suite of technologies.

NVIDIA Riva, a GPU-accelerated multilingual speech and translation AI software development kit, powers fully customizable real-time conversational AI pipelines for automatic speech recognition, text-to-speech and neural machine translation applications.

NVIDIA Tokkio, built on the NVIDIA Omniverse Avatar Cloud Engine, offers cloud-native services to create virtual assistants and digital humans that can serve as AI customer service agents.

These tools enable developers to quickly deploy high-accuracy applications with the real-time response speed needed for superior employee and customer experiences.

Join the free Speech AI Day on Sept. 20 to hear from renowned speech and translation AI leaders about groundbreaking research, real-world applications and open-source contributions.

Read More

Optimize equipment performance with historical data, Ray, and Amazon SageMaker

Optimize equipment performance with historical data, Ray, and Amazon SageMaker

Efficient control policies enable industrial companies to increase their profitability by maximizing productivity while reducing unscheduled downtime and energy consumption. Finding optimal control policies is a complex task because physical systems, such as chemical reactors and wind turbines, are often hard to model and because drift in process dynamics can cause performance to deteriorate over time. Offline reinforcement learning is a control strategy that allows industrial companies to build control policies entirely from historical data without the need for an explicit process model. This approach does not require interaction with the process directly in an exploration stage, which removes one of the barriers for the adoption of reinforcement learning in safety-critical applications. In this post, we will build an end-to-end solution to find optimal control policies using only historical data on Amazon SageMaker using Ray’s RLlib library. To learn more about reinforcement learning, see Use Reinforcement Learning with Amazon SageMaker.

Use cases

Industrial control involves the management of complex systems, such as manufacturing lines, energy grids, and chemical plants, to ensure efficient and reliable operation. Many traditional control strategies are based on predefined rules and models, which often require manual optimization. It is standard practice in some industries to monitor performance and adjust the control policy when, for example, equipment starts to degrade or environmental conditions change. Retuning can take weeks and may require injecting external excitations in the system to record its response in a trial-and-error approach.

Reinforcement learning has emerged as a new paradigm in process control to learn optimal control policies through interacting with the environment. This process requires breaking down data into three categories: 1) measurements available from the physical system, 2) the set of actions that can be taken upon the system, and 3) a numerical metric (reward) of equipment performance. A policy is trained to find the action, at a given observation, that is likely to produce the highest future rewards.

In offline reinforcement learning, one can train a policy on historical data before deploying it into production. The algorithm trained in this blog post is called “Conservative Q Learning” (CQL). CQL contains an “actor” model and a “critic” model and is designed to conservatively predict its own performance after taking a recommended action. In this post, the process is demonstrated with an illustrative cart-pole control problem. The goal is to train an agent to balance a pole on a cart while simultaneously moving the cart towards a designated goal location. The training procedure uses the offline data, allowing the agent to learn from preexisting information. This cart-pole case study demonstrates the training process and its effectiveness in potential real-world applications.

Solution overview

The solution presented in this post automates the deployment of an end-to-end workflow for offline reinforcement learning with historical data. The following diagram describes the architecture used in this workflow. Measurement data is produced at the edge by a piece of industrial equipment (here simulated by an AWS Lambda function). The data is put into an Amazon Kinesis Data Firehose, which stores it in Amazon Simple Storage Service (Amazon S3). Amazon S3 is a durable, performant, and low-cost storage solution that allows you to serve large volumes of data to a machine learning training process.

AWS Glue catalogs the data and makes it queryable using Amazon Athena. Athena transforms the measurement data into a form that a reinforcement learning algorithm can ingest and then unloads it back into Amazon S3. Amazon SageMaker loads this data into a training job and produces a trained model. SageMaker then serves that model in a SageMaker endpoint. The industrial equipment can then query that endpoint to receive action recommendations.

Figure 1: Architecture diagram showing the end-to-end reinforcement learning workflow.

Figure 1: Architecture diagram showing the end-to-end reinforcement learning workflow.

In this post, we will break down the workflow in the following steps:

  1. Formulate the problem. Decide which actions can be taken, which measurements to make recommendations based on, and determine numerically how well each action performed.
  2. Prepare the data. Transform the measurements table into a format the machine learning algorithm can consume.
  3. Train the algorithm on that data.
  4. Select the best training run based on training metrics.
  5. Deploy the model to a SageMaker endpoint.
  6. Evaluate the performance of the model in production.

Prerequisites

To complete this walkthrough, you need to have an AWS account and a command line interface with AWS SAM installed. Follow these steps to deploy the AWS SAM template to run this workflow and generate training data:

  1. Download the code repository with the command
    git clone https://github.com/aws-samples/sagemaker-offline-reinforcement-learning-ray-cql

  2. Change directory to the repo:
    cd sagemaker-offline-reinforcement-learning-ray-cql

  3. Build the repo:
    sam build --use-container

  4. Deploy the repo
    sam deploy --guided --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND

  5. Use the following commands to call a bash script, which generates mock data using an AWS Lambda function.
    1. sudo yum install jq
    2. cd utils
    3. sh generate_mock_data.sh

Solution walkthrough

Formulate problem

Our system in this blog post is a cart with a pole balanced on top. The system performs well when the pole is upright, and the cart position is close to the goal position. In the prerequisite step, we generated historical data from this system.

The following table shows historical data gathered from the system.

Cart position Cart velocity Pole angle Pole angular velocity Goal position External force Reward Time
0.53 -0.79 -0.08 0.16 0.50 -0.04 11.5 5:37:54 PM
0.51 -0.82 -0.07 0.17 0.50 -0.04 11.9 5:37:55 PM
0.50 -0.84 -0.07 0.18 0.50 -0.03 12.2 5:37:56 PM
0.48 -0.85 -0.07 0.18 0.50 -0.03 10.5 5:37:57 PM
0.46 -0.87 -0.06 0.19 0.50 -0.03 10.3 5:37:58 PM

You can query historical system information using Amazon Athena with the following query:

SELECT *
FROM "AWS CloudFormation Stack Name_glue_db"."measurements_table"
ORDER BY episode_id, epoch_time ASC
limit 10;

The state of this system is defined by the cart position, cart velocity, pole angle, pole angular velocity, and goal position. The action taken at each time step is the external force applied to the cart. The simulated environment outputs a reward value that is higher when the cart is closer to the goal position and the pole is more upright.

Prepare data

To present the system information to the reinforcement learning model, transform it into JSON objects with keys that categorize values into the state (also called observation), action, and reward categories. Store these objects in Amazon S3. Here’s an example of JSON objects produced from time steps in the previous table.

{“obs”:[[0.53,-0.79,-0.08,0.16,0.5]], “action”:[[-0.04]], “reward”:[11.5] ,”next_obs”:[[0.51,-0.82,-0.07,0.17,0.5]]}
{“obs”:[[0.51,-0.82,-0.07,0.17,0.5]], “action”:[[-0.04]], “reward”:[11.9], “next_obs”:[[0.50,-0.84,-0.07,0.18,0.5]]}
{“obs”:[[0.50,-0.84,-0.07,0.18,0.5]], “action”:[[-0.03]], “reward”:[12.2], “next_obs”:[[0.48,-0.85,-0.07,0.18,0.5]]}

The AWS CloudFormation stack contains an output called AthenaQueryToCreateJsonFormatedData. Run this query in Amazon Athena to perform the transformation and store the JSON objects in Amazon S3. The reinforcement learning algorithm uses the structure of these JSON objects to understand which values to base recommendations on and the outcome of taking actions in the historical data.

Train agent

Now we can start a training job to produce a trained action recommendation model. Amazon SageMaker lets you quickly launch multiple training jobs to see how various configurations affect the resulting trained model. Call the Lambda function named TuningJobLauncherFunction to start a hyperparameter tuning job that experiments with four different sets of hyperparameters when training the algorithm.

Select best training run

To find which of the training jobs produced the best model, examine loss curves produced during training. CQL’s critic model estimates the actor’s performance (called a Q value) after taking a recommended action. Part of the critic’s loss function includes the temporal difference error. This metric measures the critic’s Q value accuracy. Look for training runs with a high mean Q value and a low temporal difference error. This paper, A Workflow for Offline Model-Free Robotic Reinforcement Learning, details how to select the best training run. The code repository has a file, /utils/investigate_training.py, that creates a plotly html figure describing the latest training job. Run this file and use the output to pick the best training run.

We can use the mean Q value to predict the performance of the trained model. The Q values are trained to conservatively predict the sum of discounted future reward values. For long-running processes, we can convert this number to an exponentially weighted average by multiplying the Q value by (1-“discount rate”). The best training run in this set achieved a mean Q value of 539. Our discount rate is 0.99, so the model is predicting at least 5.39 average reward per time step. You can compare this value to historical system performance for an indication of if the new model will outperform the historical control policy. In this experiment, the historical data’s average reward per time step was 4.3, so the CQL model is predicting 25 percent better performance than the system achieved historically.

Deploy model

Amazon SageMaker endpoints let you serve machine learning models in several different ways to meet a variety of use cases. In this post, we’ll use the serverless endpoint type so that our endpoint automatically scales with demand, and we only pay for compute usage when the endpoint is generating an inference. To deploy a serverless endpoint, include a ProductionVariantServerlessConfig in the production variant of the SageMaker endpoint configuration. The following code snippet shows how the serverless endpoint in this example is deployed using the Amazon SageMaker software development kit for Python. Find the sample code used to deploy the model at sagemaker-offline-reinforcement-learning-ray-cql.

predictor = model.deploy(
    serverless_inference_config=ServerlessInferenceConfig(
        memory_size_in_mb=2048,
        max_concurrency=200
    ),
    <…>
)

The trained model files are located at the S3 model artifacts for each training run. To deploy the machine learning model, locate the model files of the best training run, and call the Lambda function named “ModelDeployerFunction” with an event that contains this model data. The Lambda function will launch a SageMaker serverless endpoint to serve the trained model. Sample event to use when calling the “ModelDeployerFunction”:

{ "DescribeTrainingJob": 
    { "ModelArtifacts": 
	    { "S3ModelArtifacts": "s3://your-bucket/training/my-training-job/output/model.tar.gz"} 
	} 
}

Evaluate trained model performance

It’s time to see how our trained model is doing in production! To check the performance of the new model, call the Lambda function named “RunPhysicsSimulationFunction” with the SageMaker endpoint name in the event. This will run the simulation using the actions recommended by the endpoint. Sample event to use when calling the RunPhysicsSimulatorFunction:

{"random_action_fraction": 0.0, "inference_endpoint_name": "sagemaker-endpoint-name"}

Use the following Athena query to compare the performance of the trained model with historical system performance.

WITH 
    sum_reward_by_episode AS (
        SELECT SUM(reward) as sum_reward, m_temp.action_source
        FROM "<AWS CloudFormation Stack Name>_glue_db"."measurements_table" m_temp
        GROUP BY m_temp.episode_id, m_temp.action_source
        )

SELECT sre.action_source, AVG(sre.sum_reward) as avg_total_reward_per_episode
FROM  sum_reward_by_episode sre
GROUP BY sre.action_source
ORDER BY avg_total_reward_per_episode DESC

Here is an example results table. We see the trained model achieved 2.5x more reward than the historical data! Additionally, the true performance of the model was 2x better than the conservative performance prediction.
Action source Average reward per time step
trained_model 10.8
historic_data 4.3

The following animations show the difference between a sample episode from the training data and an episode where the trained model was used to pick which action to take. In the animations, the blue box is the cart, the blue line is the pole, and the green rectangle is the goal location. The red arrow shows the force applied to the cart at each time step. The red arrow in the training data jumps back and forth quite a bit because the data was generated using 50 percent expert actions and 50 percent random actions. The trained model learned a control policy that moves the cart quickly to the goal position, while maintaining stability, entirely from observing nonexpert demonstrations.

 Clean up

To delete resources used in this workflow, navigate to the resources section of the Amazon CloudFormation stack and delete the S3 buckets and IAM roles. Then delete the CloudFormation stack itself.

Conclusion

Offline reinforcement learning can help industrial companies automate the search for optimal policies without compromising safety by using historical data. To implement this approach in your operations, start by identifying the measurements that make up a state-determined system, the actions you can control, and metrics that indicate desired performance. Then, access this GitHub repository for the implementation of an automatic end-to-end solution using Ray and Amazon SageMaker.

The post just scratches the surface of what you can do with Amazon SageMaker RL. Give it a try, and please send us feedback, either in the Amazon SageMaker discussion forum or through your usual AWS contacts.


About the Authors

Walt Mayfield is a Solutions Architect at AWS and helps energy companies operate more safely and efficiently. Before joining AWS, Walt worked as an Operations Engineer for Hilcorp Energy Company. He likes to garden and fly fish in his spare time.

Felipe Lopez is a Senior Solutions Architect at AWS with a concentration in Oil & Gas Production Operations. Prior to joining AWS, Felipe worked with GE Digital and Schlumberger, where he focused on modeling and optimization products for industrial applications.

Yingwei Yu is an Applied Scientist at Generative AI Incubator, AWS. He has experience working with several organizations across industries on various proofs of concept in machine learning, including natural language processing, time series analysis, and predictive maintenance. In his spare time, he enjoys swimming, painting, hiking, and spending time with family and friends.

Haozhu Wang is a research scientist in Amazon Bedrock focusing on building Amazon’s Titan foundation models. Previously he worked in Amazon ML Solutions Lab as a co-lead of the Reinforcement Learning Vertical and helped customers build advanced ML solutions with the latest research on reinforcement learning, natural language processing, and graph learning. Haozhu received his PhD in Electrical and Computer Engineering from the University of Michigan.

Read More

Enable pod-based GPU metrics in Amazon CloudWatch

Enable pod-based GPU metrics in Amazon CloudWatch

In February 2022, Amazon Web Services added support for NVIDIA GPU metrics in Amazon CloudWatch, making it possible to push metrics from the Amazon CloudWatch Agent to Amazon CloudWatch and monitor your code for optimal GPU utilization. Since then, this feature has been integrated into many of our managed Amazon Machine Images (AMIs), such as the Deep Learning AMI and the AWS ParallelCluster AMI. To obtain instance-level metrics of GPU utilization, you can use Packer or the Amazon ImageBuilder to bootstrap your own custom AMI and use it in various managed service offerings like AWS Batch, Amazon Elastic Container Service (Amazon ECS), or Amazon Elastic Kubernetes Service (Amazon EKS). However, for many container-based service offerings and workloads, it’s ideal to capture utilization metrics on the container, pod, or namespace level.

This post details how to set up container-based GPU metrics and provides an example of collecting these metrics from EKS pods.

Solution overview

To demonstrate container-based GPU metrics, we create an EKS cluster with g5.2xlarge instances; however, this will work with any supported NVIDIA accelerated instance family.

We deploy the NVIDIA GPU operator to enable use of GPU resources and the NVIDIA DCGM Exporter to enable GPU metrics collection. Then we explore two architectures. The first one connects the metrics from NVIDIA DCGM Exporter to CloudWatch via a CloudWatch agent, as shown in the following diagram.

GPU Monitoring Architecture with CloudWatch

The second architecture (see the following diagram) connects the metrics from DCGM Exporter to Prometheus, then we use a Grafana dashboard to visualize those metrics.

GPU Monitoring Architecture with Grafana

Prerequisites

To simplify reproducing the entire stack from this post, we use a container that has all the required tooling (aws cli, eksctl, helm, etc.) already installed. In order to clone the container project from GitHub, you will need git. To build and run the container, you will need Docker. To deploy the architecture, you will need AWS credentials. To enable access to Kubernetes services using port-forwarding, you will also need kubectl.

These prerequisites can be installed on your local machine, EC2 instance with NICE DCV, or AWS Cloud9. In this post, we will use a c5.2xlarge Cloud9 instance with a 40GB local storage volume. When using Cloud9, please disable AWS managed temporary credentials by visiting Cloud9->Preferences->AWS Settings as shown on the screenshot below.

Build and run the aws-do-eks container

Open a terminal shell in your preferred environment and run the following commands:

git clone https://github.com/aws-samples/aws-do-eks
cd aws-do-eks
./build.sh
./run.sh
./exec.sh

The result is as follows:

root@e5ecb162812f:/eks#

You now have a shell in a container environment that has all the tools needed to complete the tasks below. We will refer to it as “aws-do-eks shell”. You will be running the commands in the following sections in this shell, unless specifically instructed otherwise.

Create an EKS cluster with a node group

This group includes a GPU instance family of your choice; in this example, we use the g5.2xlarge instance type.

The aws-do-eks project comes with a collection of cluster configurations. You can set your desired cluster configuration with a single configuration change.

  1. In the container shell, run ./env-config.sh and then set CONF=conf/eksctl/yaml/eks-gpu-g5.yaml
  2. To verify the cluster configuration, run ./eks-config.sh

You should see the following cluster manifest:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: do-eks-yaml-g5
  version: "1.25"
  region: us-east-1
availabilityZones:
  - us-east-1a
  - us-east-1b
  - us-east-1c
  - us-east-1d
managedNodeGroups:
  - name: sys
    instanceType: m5.xlarge
    desiredCapacity: 1
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
  - name: g5
    instanceType: g5.2xlarge
    instancePrefix: g5-2xl
    privateNetworking: true
    efaEnabled: false
    minSize: 0
    desiredCapacity: 1
    maxSize: 10
    volumeSize: 80
    iam:
      withAddonPolicies:
        cloudWatch: true
iam:
  withOIDC: true
  1. To create the cluster, run the following command in the container
./eks-create.sh

The output is as follows:

root@e5ecb162812f:/eks# ./eks-create.sh 
/eks/impl/eksctl/yaml /eks

./eks-create.sh

Mon May 22 20:50:59 UTC 2023
Creating cluster using /eks/conf/eksctl/yaml/eks-gpu-g5.yaml ...

eksctl create cluster -f /eks/conf/eksctl/yaml/eks-gpu-g5.yaml

2023-05-22 20:50:59 [ℹ]  eksctl version 0.133.0
2023-05-22 20:50:59 [ℹ]  using region us-east-1
2023-05-22 20:50:59 [ℹ]  subnets for us-east-1a - public:192.168.0.0/19 private:192.168.128.0/19
2023-05-22 20:50:59 [ℹ]  subnets for us-east-1b - public:192.168.32.0/19 private:192.168.160.0/19
2023-05-22 20:50:59 [ℹ]  subnets for us-east-1c - public:192.168.64.0/19 private:192.168.192.0/19
2023-05-22 20:50:59 [ℹ]  subnets for us-east-1d - public:192.168.96.0/19 private:192.168.224.0/19
2023-05-22 20:50:59 [ℹ]  nodegroup "sys" will use "" [AmazonLinux2/1.25]
2023-05-22 20:50:59 [ℹ]  nodegroup "g5" will use "" [AmazonLinux2/1.25]
2023-05-22 20:50:59 [ℹ]  using Kubernetes version 1.25
2023-05-22 20:50:59 [ℹ]  creating EKS cluster "do-eks-yaml-g5" in "us-east-1" region with managed nodes
2023-05-22 20:50:59 [ℹ]  2 nodegroups (g5, sys) were included (based on the include/exclude rules)
2023-05-22 20:50:59 [ℹ]  will create a CloudFormation stack for cluster itself and 0 nodegroup stack(s)
2023-05-22 20:50:59 [ℹ]  will create a CloudFormation stack for cluster itself and 2 managed nodegroup stack(s)
2023-05-22 20:50:59 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-east-1 --cluster=do-eks-yaml-g5'
2023-05-22 20:50:59 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "do-eks-yaml-g5" in "us-east-1"
2023-05-22 20:50:59 [ℹ]  CloudWatch logging will not be enabled for cluster "do-eks-yaml-g5" in "us-east-1"
2023-05-22 20:50:59 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-east-1 --cluster=do-eks-yaml-g5'
2023-05-22 20:50:59 [ℹ]  
2 sequential tasks: { create cluster control plane "do-eks-yaml-g5", 
    2 sequential sub-tasks: { 
        4 sequential sub-tasks: { 
            wait for control plane to become ready,
            associate IAM OIDC provider,
            2 sequential sub-tasks: { 
                create IAM role for serviceaccount "kube-system/aws-node",
                create serviceaccount "kube-system/aws-node",
            },
            restart daemonset "kube-system/aws-node",
        },
        2 parallel sub-tasks: { 
            create managed nodegroup "sys",
            create managed nodegroup "g5",
        },
    } 
}
2023-05-22 20:50:59 [ℹ]  building cluster stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:51:00 [ℹ]  deploying stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:51:30 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:52:00 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:53:01 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:54:01 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:55:01 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:56:02 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:57:02 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:58:02 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:59:02 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 21:00:03 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 21:01:03 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 21:02:03 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 21:03:04 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 21:05:07 [ℹ]  building iamserviceaccount stack "eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node"
2023-05-22 21:05:10 [ℹ]  deploying stack "eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node"
2023-05-22 21:05:10 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node"
2023-05-22 21:05:40 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node"
2023-05-22 21:05:40 [ℹ]  serviceaccount "kube-system/aws-node" already exists
2023-05-22 21:05:41 [ℹ]  updated serviceaccount "kube-system/aws-node"
2023-05-22 21:05:41 [ℹ]  daemonset "kube-system/aws-node" restarted
2023-05-22 21:05:41 [ℹ]  building managed nodegroup stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:05:41 [ℹ]  building managed nodegroup stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:05:42 [ℹ]  deploying stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:05:42 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:05:42 [ℹ]  deploying stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:05:42 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:06:12 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:06:12 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:06:55 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:07:11 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:08:29 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:08:45 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:09:52 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:09:53 [ℹ]  waiting for the control plane to become ready
2023-05-22 21:09:53 [✔]  saved kubeconfig as "/root/.kube/config"
2023-05-22 21:09:53 [ℹ]  1 task: { install Nvidia device plugin }
W0522 21:09:54.155837    1668 warnings.go:70] spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead
2023-05-22 21:09:54 [ℹ]  created "kube-system:DaemonSet.apps/nvidia-device-plugin-daemonset"
2023-05-22 21:09:54 [ℹ]  as you are using the EKS-Optimized Accelerated AMI with a GPU-enabled instance type, the Nvidia Kubernetes device plugin was automatically installed.
        to skip installing it, use --install-nvidia-plugin=false.
2023-05-22 21:09:54 [✔]  all EKS cluster resources for "do-eks-yaml-g5" have been created
2023-05-22 21:09:54 [ℹ]  nodegroup "sys" has 1 node(s)
2023-05-22 21:09:54 [ℹ]  node "ip-192-168-18-137.ec2.internal" is ready
2023-05-22 21:09:54 [ℹ]  waiting for at least 1 node(s) to become ready in "sys"
2023-05-22 21:09:54 [ℹ]  nodegroup "sys" has 1 node(s)
2023-05-22 21:09:54 [ℹ]  node "ip-192-168-18-137.ec2.internal" is ready
2023-05-22 21:09:55 [ℹ]  kubectl command should work with "/root/.kube/config", try 'kubectl get nodes'
2023-05-22 21:09:55 [✔]  EKS cluster "do-eks-yaml-g5" in "us-east-1" region is ready

Mon May 22 21:09:55 UTC 2023
Done creating cluster using /eks/conf/eksctl/yaml/eks-gpu-g5.yaml

/eks
  1. To verify that your cluster is created successfully, run the following command
kubectl get nodes -L node.kubernetes.io/instance-type

The output is similar to the following:

NAME                              STATUS   ROLES    AGE   VERSION               INSTANCE_TYPE
ip-192-168-18-137.ec2.internal    Ready    <none>   47m   v1.25.9-eks-0a21954   m5.xlarge
ip-192-168-214-241.ec2.internal   Ready    <none>   46m   v1.25.9-eks-0a21954   g5.2xlarge

In this example, we have one m5.xlarge and one g5.2xlarge instance in our cluster; therefore, we see two nodes listed in the preceding output.

During the cluster creation process, the NVIDIA device plugin will get installed. You will need to remove it after cluster creation because we will use the NVIDIA GPU Operator instead.

  1. Delete the plugin with the following command
kubectl -n kube-system delete daemonset nvidia-device-plugin-daemonset

We get the following output:

daemonset.apps "nvidia-device-plugin-daemonset" deleted

Install the NVIDIA Helm repo

Install the NVIDIA Helm repo with the following command:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

Deploy the DCGM exporter with the NVIDIA GPU Operator

To deploy the DCGM exporter, complete the following steps:

  1. Prepare the DCGM exporter GPU metrics configuration
curl https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/etc/dcp-metrics-included.csv > dcgm-metrics.csv

You have the option to edit the dcgm-metrics.csv file. You can add or remove any metrics as needed.

  1. Create the gpu-operator namespace and DCGM exporter ConfigMap
kubectl create namespace gpu-operator && /
kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv

The output is as follows:

namespace/gpu-operator created
configmap/metrics-config created
  1. Apply the GPU operator to the EKS cluster
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator 
--set dcgmExporter.config.name=metrics-config 
--set dcgmExporter.env[0].name=DCGM_EXPORTER_COLLECTORS 
--set dcgmExporter.env[0].value=/etc/dcgm-exporter/dcgm-metrics.csv 
--set toolkit.enabled=false

The output is as follows:

NAME: gpu-operator-1684795140
LAST DEPLOYED: Day Month Date HH:mm:ss YYYY
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
  1. Confirm that the DCGM exporter pod is running
kubectl -n gpu-operator get pods | grep dcgm

The output is as follows:

nvidia-dcgm-exporter-lkmfr       1/1     Running    0   1m

If you inspect the logs, you should see the “Starting webserver” message:

kubectl -n gpu-operator logs -f $(kubectl -n gpu-operator get pods | grep dcgm | cut -d ' ' -f 1)

The output is as follows:

Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init)
time="2023-05-22T22:40:08Z" level=info msg="Starting dcgm-exporter"
time="2023-05-22T22:40:08Z" level=info msg="DCGM successfully initialized!"
time="2023-05-22T22:40:08Z" level=info msg="Collecting DCP Metrics"
time="2023-05-22T22:40:08Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2023-05-22T22:40:08Z" level=info msg="Initializing system entities of type: GPU"
time="2023-05-22T22:40:09Z" level=info msg="Initializing system entities of type: NvSwitch"
time="2023-05-22T22:40:09Z" level=info msg="Not collecting switch metrics: no switches to monitor"
time="2023-05-22T22:40:09Z" level=info msg="Initializing system entities of type: NvLink"
time="2023-05-22T22:40:09Z" level=info msg="Not collecting link metrics: no switches to monitor"
time="2023-05-22T22:40:09Z" level=info msg="Kubernetes metrics collection enabled!"
time="2023-05-22T22:40:09Z" level=info msg="Pipeline starting"
time="2023-05-22T22:40:09Z" level=info msg="Starting webserver"

NVIDIA DCGM Exporter exposes a Prometheus metrics endpoint, which can be ingested by the CloudWatch agent. To see the endpoint, use the following command:

kubectl -n gpu-operator get services | grep dcgm

We get the following output:

nvidia-dcgm-exporter    ClusterIP   10.100.183.207   <none>   9400/TCP   10m
  1. To generate some GPU utilization, we deploy a pod that runs the gpu-burn binary
kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/gpu-metrics/gpu-burn-deployment.yaml

The output is as follows:

deployment.apps/gpu-burn created

This deployment uses a single GPU to produce a continuous pattern of 100% utilization for 20 seconds followed by 0% utilization for 20 seconds.

  1. To make sure the endpoint works, you can run a temporary container that uses curl to read the content of http://nvidia-dcgm-exporter:9400/metrics
kubectl -n gpu-operator run -it --rm curl --restart='Never' --image=curlimages/curl --command -- curl http://nvidia-dcgm-exporter:9400/metrics

We get the following output:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 1455
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 6250
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 65
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 299.437000
# HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ).
# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 15782796862
# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 100
# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 38
# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
DCGM_FI_DEV_ENC_UTIL{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
DCGM_FI_DEV_DEC_UTIL{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 2230
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 20501
# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS Number of remapped rows for uncorrectable errors
# TYPE DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS Number of remapped rows for correctable errors
# TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_ROW_REMAP_FAILURE Whether remapping of rows has failed
# TYPE DCGM_FI_DEV_ROW_REMAP_FAILURE gauge
DCGM_FI_DEV_ROW_REMAP_FAILURE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0.808369
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0.000000
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data (in %).
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0.315787
# HELP DCGM_FI_PROF_PCIE_TX_BYTES The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
# TYPE DCGM_FI_PROF_PCIE_TX_BYTES gauge
DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 3985328
# HELP DCGM_FI_PROF_PCIE_RX_BYTES The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
# TYPE DCGM_FI_PROF_PCIE_RX_BYTES gauge
DCGM_FI_PROF_PCIE_RX_BYTES{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 21715174
pod "curl" deleted

Configure and deploy the CloudWatch agent

To configure and deploy the CloudWatch agent, complete the following steps:

  1. Download the YAML file and edit it
curl -O https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/k8s/1.3.15/k8s-deployment-manifest-templates/deployment-mode/service/cwagent-prometheus/prometheus-eks.yaml

The file contains a cwagent configmap and a prometheus configmap. For this post, we edit both.

  1. Edit the prometheus-eks.yaml file

Open the prometheus-eks.yaml file in your favorite editor and replace the cwagentconfig.json section with the following content:

apiVersion: v1
data:
  # cwagent json config
  cwagentconfig.json: |
    {
      "logs": {
        "metrics_collected": {
          "prometheus": {
            "prometheus_config_path": "/etc/prometheusconfig/prometheus.yaml",
            "emf_processor": {
              "metric_declaration": [
                {
                  "source_labels": ["Service"],
                  "label_matcher": ".*dcgm.*",
                  "dimensions": [["Service","Namespace","ClusterName","job","pod"]],
                  "metric_selectors": [
                    "^DCGM_FI_DEV_GPU_UTIL$",
                    "^DCGM_FI_DEV_DEC_UTIL$",
                    "^DCGM_FI_DEV_ENC_UTIL$",
                    "^DCGM_FI_DEV_MEM_CLOCK$",
                    "^DCGM_FI_DEV_MEM_COPY_UTIL$",
                    "^DCGM_FI_DEV_POWER_USAGE$",
                    "^DCGM_FI_DEV_ROW_REMAP_FAILURE$",
                    "^DCGM_FI_DEV_SM_CLOCK$",
                    "^DCGM_FI_DEV_XID_ERRORS$",
                    "^DCGM_FI_PROF_DRAM_ACTIVE$",
                    "^DCGM_FI_PROF_GR_ENGINE_ACTIVE$",
                    "^DCGM_FI_PROF_PCIE_RX_BYTES$",
                    "^DCGM_FI_PROF_PCIE_TX_BYTES$",
                    "^DCGM_FI_PROF_PIPE_TENSOR_ACTIVE$"
                  ]
                }
              ]
            }
          }
        },
        "force_flush_interval": 5
      }
    }
  1. In the prometheus config section, append the following job definition for the DCGM exporter
- job_name: 'kubernetes-pod-dcgm-exporter'
      sample_limit: 10000
      metrics_path: /api/v1/metrics/prometheus
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep
        regex: '^DCGM.*$'
      - source_labels: [__address__]
        action: replace
        regex: ([^:]+)(?::d+)?
        replacement: ${1}:9400
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: Namespace
      - source_labels: [__meta_kubernetes_pod]
        action: replace
        target_label: pod
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_container_name
        target_label: container_name
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_controller_name
        target_label: pod_controller_name
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_controller_kind
        target_label: pod_controller_kind
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_phase
        target_label: pod_phase
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: NodeName
  1. Save the file and apply the cwagent-dcgm configuration to your cluster
kubectl apply -f ./prometheus-eks.yaml

We get the following output:

namespace/amazon-cloudwatch created
configmap/prometheus-cwagentconfig created
configmap/prometheus-config created
serviceaccount/cwagent-prometheus created
clusterrole.rbac.authorization.k8s.io/cwagent-prometheus-role created
clusterrolebinding.rbac.authorization.k8s.io/cwagent-prometheus-role-binding created
deployment.apps/cwagent-prometheus created
  1. Confirm that the CloudWatch agent pod is running
kubectl -n amazon-cloudwatch get pods

We get the following output:

NAME                                  READY   STATUS    RESTARTS   AGE
cwagent-prometheus-7dfd69cc46-s4cx7   1/1     Running   0          15m

Visualize metrics on the CloudWatch console

To visualize the metrics in CloudWatch, complete the following steps:

  1. On the CloudWatch console, under Metrics in the navigation pane, choose All metrics
  2. In the Custom namespaces section, choose the new entry for ContainerInsights/Prometheus

For more information about the ContainerInsights/Prometheus namespace, refer to Scraping additional Prometheus sources and importing those metrics.

CloudWatch - ContainerInsights/Prometeus

  1. Drill down to the metric names and choose DCGM_FI_DEV_GPU_UTIL
  2. On the Graphed metrics tab, set Period to 5 seconds

CloudWatch - Period Setting

  1. Set the refresh interval to 10 seconds

You will see the metrics collected from DCGM exporter that visualize the gpu-burn pattern on and off each 20 seconds.

CloudWatch - gpuburn pattern

On the Browse tab, you can see the data, including the pod name for each metric.

CloudWatch - pod name for metric

The EKS API metadata has been combined with the DCGM metrics data, resulting in the provided pod-based GPU metrics.

This concludes the first approach of exporting DCGM metrics to CloudWatch via the CloudWatch agent.

In the next section, we configure the second architecture, which exports the DCGM metrics to Prometheus, and we visualize them with Grafana.

Use Prometheus and Grafana to visualize GPU metrics from DCGM

Complete the following steps:

  1. Add the Prometheus community helm chart
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

This chart deploys both Prometheus and Grafana. We need to make some edits to the chart before running the install command.

  1. Save the chart configuration values to a file in /tmp
helm inspect values prometheus-community/kube-prometheus-stack > /tmp/kube-prometheus-stack.values
  1. Edit the char configuration file

Edit the saved file (/tmp/kube-prometheus-stack.values) and set the following option by looking for the setting name and setting the value:

prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
  1. Add the following ConfigMap to the additionalScrapeConfigs section
additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - gpu-operator
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node
  1. Deploy the Prometheus stack with the updated values
helm install prometheus-community/kube-prometheus-stack 
--create-namespace --namespace prometheus 
--generate-name 
--values /tmp/kube-prometheus-stack.values

We get the following output:

NAME: kube-prometheus-stack-1684965548
LAST DEPLOYED: Wed May 24 21:59:14 2023
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1684965548"

Visit https://github.com/prometheus-operator/kube-prometheus
 for instructions on how to create & configure Alertmanager 
and Prometheus instances using the Operator.
  1. Confirm that the Prometheus pods are running
kubectl get pods -n prometheus

We get the following output:

NAME                                                              READY   STATUS    RESTARTS   AGE
alertmanager-kube-prometheus-stack-1684-alertmanager-0            2/2     Running   0          6m55s
kube-prometheus-stack-1684-operator-6c87649878-j7v55              1/1     Running   0          6m58s
kube-prometheus-stack-1684965548-grafana-dcd7b4c96-bzm8p          3/3     Running   0          6m58s
kube-prometheus-stack-1684965548-kube-state-metrics-7d856dptlj5   1/1     Running   0          6m58s
kube-prometheus-stack-1684965548-prometheus-node-exporter-2fbl5   1/1     Running   0          6m58s
kube-prometheus-stack-1684965548-prometheus-node-exporter-m7zmv   1/1     Running   0          6m58s
prometheus-kube-prometheus-stack-1684-prometheus-0                2/2     Running   0          6m55s

Prometheus and Grafana pods are in the Running state.

Next, we validate that DCGM metrics are flowing into Prometheus.

  1. Port-forward the Prometheus UI

There are different ways to expose the Prometheus UI running in EKS to requests originating outside of the cluster. We will use kubectl port-forwarding. So far, we have been executing commands inside the aws-do-eks container. To access the Prometheus service running in the cluster, we will create a tunnel from the host. Here the aws-do-eks container is running by executing the following command outside of the container, in a new terminal shell on the host. We will refer to this as “host shell”.

kubectl -n prometheus port-forward svc/$(kubectl -n prometheus get svc | grep prometheus | grep -v alertmanager | grep -v operator | grep -v grafana | grep -v metrics | grep -v exporter | grep -v operated | cut -d ' ' -f 1) 8080:9090 &

While the port-forwarding process is running, we are able to access the Prometheus UI from the host as described below.

  1. Open the Prometheus UI
    • If you are using Cloud9, please navigate to Preview->Preview Running Application to open the Prometheus UI in a tab inside the Cloud9 IDE, then click the icon in the upper-right corner of the tab to pop out in a new window.
    • If you are on your local host or connected to an EC2 instance via remote desktop open a browser and visit the URL http://localhost:8080.

Prometheus - DCGM metrics

  1. Enter DCGM to see the DCGM metrics that are flowing into Prometheus
  2. Select DCGM_FI_DEV_GPU_UTIL, choose Execute, and then navigate to the Graph tab to see the expected GPU utilization pattern

Prometheus - gpuburn pattern

  1. Stop the Prometheus port-forwarding process

Run the following command line in your host shell:

kill -9 $(ps -aef | grep port-forward | grep -v grep | grep prometheus | awk '{print $2}')

Now we can visualize the DCGM metrics via Grafana Dashboard.

  1. Retrieve the password to log in to the Grafana UI
kubectl -n prometheus get secret $(kubectl -n prometheus get secrets | grep grafana | cut -d ' ' -f 1) -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
  1. Port-forward the Grafana service

Run the following command line in your host shell:

kubectl port-forward -n prometheus svc/$(kubectl -n prometheus get svc | grep grafana | cut -d ' ' -f 1) 8080:80 &
  1. Log in to the Grafana UI

Access the Grafana UI login screen the same way as you accessed the Prometheus UI earlier. If using Cloud9, select Preview->Preview Running Application, then pop out in a new window. If using your local host or an EC2 instance with remote desktop visit URL http://localhost:8080. Login with the user name admin and the password you retrieved earlier.

Grafana - login

  1. In the navigation pane, choose Dashboards

Grafana - dashboards

  1. Choose New and Import

Grafana - load by id from grafana.com
We are going to import the default DCGM Grafana dashboard described in NVIDIA DCGM Exporter Dashboard.

  1. In the field import via grafana.com, enter 12239 and choose Load
  2. Choose Prometheus as the data source
  3. Choose Import

Grafana - import dashboard

You will see a dashboard similar to the one in the following screenshot.

Grafana - dashboard

To demonstrate that these metrics are pod-based, we are going to modify the GPU Utilization pane in this dashboard.

  1. Choose the pane and the options menu (three dots)
  2. Expand the Options section and edit the Legend field
  3. Replace the value there with Pod {{pod}}, then choose Save

Grafana - pod-based metric
The legend now shows the gpu-burn pod name associated with the displayed GPU utilization.

  1. Stop port-forwarding the Grafana UI service

Run the following in your host shell:

kill -9 $(ps -aef | grep port-forward | grep -v grep | grep prometheus | awk '{print $2}')

In this post, we demonstrated using open-source Prometheus and Grafana deployed to the EKS cluster. If desired, this deployment can be substituted with Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Clean up

To clean up the resources you created, run the following script from the aws-do-eks container shell:

./eks-delete.sh

Conclusion

In this post, we utilized NVIDIA DCGM Exporter to collect GPU metrics and visualize them with either CloudWatch or Prometheus and Grafana. We invite you to use the architectures demonstrated here to enable GPU utilization monitoring with NVIDIA DCGM in your own AWS environment.

Additional resources


About the authors

Amr Ragab is a former Principal Solutions Architect, EC2 Accelerated Computing at AWS. He is devoted to helping customers run computational workloads at scale. In his spare time, he likes traveling and finding new ways to integrate technology into daily life.

Alex IankoulskiAlex Iankoulski is a Principal Solutions Architect, Self-managed Machine Learning at AWS. He’s a full-stack software and infrastructure engineer who likes to do deep, hands-on work. In his role, he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open-source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on democratizing AI and ML, combating climate change, and making travel safer, healthcare better, and energy smarter.

Keita Watanabe is a Senior Solutions Architect of Frameworks ML Solutions at Amazon Web Services where he helps develop the industry’s best cloud based Self-managed Machine Learning solutions. His background is in Machine Learning research and development. Prior to joining AWS, Keita was working in the e-commerce industry. Keita holds a Ph.D. in Science from the University of Tokyo.

Read More