The professor of collective intelligence and robotics at the University of Cambridge earned a 2019 Amazon Research Award for “Learning Explicit Communication for Multi-Robot Path Planning”.Read More
Introducing Amazon Kendra tabular search for HTML Documents
Amazon Kendra is an intelligent search service powered by machine learning (ML). Kendra reimagines enterprise search for your websites and applications so your employees and customers can easily find the content they’re looking for, even when it’s scattered across multiple locations and content repositories within your organization.
Amazon Kendra users can now quickly find the information they need from tables on a webpage (HTML tables) using Amazon Kendra tabular search. Tables contains useful information in structured format so it can be easily interpreted by making visual associations between row and column headers. With Amazon Kendra tabular search, you can now get specific information from the cell or certain rows and columns relevant to your query, as well as preview of the table.
In this post, we provide an example of how to use Amazon Kendra tabular search.
Tabular search in Amazon Kendra
Let’s say you have a webpage in HTML format that contains a table with inflation rates and annual changes in the US from 2012–2021, as shown in the following screenshot.
When you search for “Inflation rate in US”, Amazon Kendra presents the top three rows in the preview and up to five columns, as shown in the following screenshot. You can then see if this article has the relevant details that you’re looking for and decide to either use this information or open the link to get additional details. Amazon Kendra tabular search can also handle merged rows.
Let’s do another search and get specific information from the table by asking “What was the annual change of inflation rate in 2017?”. As shown in the following screenshot, Amazon Kendra tabular search highlights the specific cell that contains the answer to your question.
Now let’s search for “Which year had top inflation rate?”, Amazon Kendra searches the table, sorts the results, and gives you the year that had the highest inflation rate.
Amazon Kendra can also find the range of column information that you’re looking for. For example, let’s search for “Inflation rate from 2012 and 2014.” Amazon Kendra displays the rows and columns between 2012–2014 in the preview.
Get started with Amazon Kendra tabular search
Amazon Kendra tabular search is turned on by default and no special configuration is required to enable it. For newer documents, Amazon Kendra tabular search will work by default. For existing HTML pages that contain tables, you can either update the document and sync (if you only have a few documents), or reach out to AWS Support .
To test tabular search on your internal or external webpage, complete the following steps:
- Create an index.
- Add data sources by using the web crawler or downloading the HTML page and uploading it to an Amazon Simple Storage Service (Amazon S3) bucket.
- Go to the Search Indexed Content tab and test it out.
Limitations and considerations
Keep the following in mind when using this feature:
- In this release, Amazon Kendra only supports HTML formatted tables or HTML tables within the table tag. This doesn’t include nested tables or other forms of tables.
- Amazon Kendra can search through tables up to 30 columns and 60 rows, and up to 500 total table cells. If you have a table with a higher numbers of rows, columns, or table cells, Amazon Kendra will not search within that table.
- Amazon Kendra doesn’t display tabular search results if the confidence score of query result for the column and row is very low. You can look at the confidence score within ScoreAttributes using the
QueryResultItem
API.
Conclusion
With Amazon Kendra’s tabular search for HTML in Amazon Kendra, you can now search across both unstructured data from various data sources and structured data in the form of tables. This further enhances the user experiences and you can get factual responses from your natural language query as well as from the tables. The table preview with Kendra’s suggested answers allows you to quickly asses if the HTML document table contains relevant information you are looking for, thereby saving time.
Amazon Kendra tabular search is available in the following AWS regions during launch: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Asia Pacific (Sydney), Asia Pacific (Singapore), Canada (Central) and AWS GovCloud (US-West).
To learn more about Amazon Kendra, visit the Amazon Kendra product page.
About the authors
Vikas Shah is an Enterprise Solutions Architect at Amazon web services. He is a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His areas of interest are ML, IoT, robotics and storage. In his spare time, Vikas enjoys building robots, hiking, and traveling.
Enterprise administrative controls, simple sign-up, and expanded programming language support for Amazon CodeWhisperer
Amazon CodeWhisperer is a machine learning (ML)-powered service that helps improve developer productivity by generating code recommendations based on developers’ prior code and comments. Today, we are excited to announce that AWS administrators can now enable CodeWhisperer for their organization with single sign-n (SSO) authentication. Administrators can easily integrate CodeWhisperer with their existing workforce identity solutions, provide access to users and groups, and configure organization-wide settings. Additionally, individual users who don’t have AWS accounts can now use CodeWhisperer using their personal email with AWS Builder ID. The sign-up process takes only a few minutes and enables developers to start using CodeWhisperer immediately without any waitlist. We’re also expanding programming language support for CodeWhisperer. In addition to Python, Java, and JavaScript, developers can now use CodeWhisperer to accelerate development of their C# and TypeScript projects.
In this post, we discuss enterprise administrative controls, the new AWS Builder ID sign-up for CodeWhisperer, and support for new programming languages.
Enable CodeWhisperer for your organization
CodeWhisperer is now available on the AWS Management Console. Any user with an AWS administrator role can enable CodeWhisperer, add and remove users, and centrally manage settings for your organization via the console.
As a prerequisite, your AWS administrators have to set up SSO via AWS IAM Identity Center (successor to AWS Single Sign-On), if not already enabled for your organization. IAM Identity Center enables you to use your organization’s SSO to access AWS services by integrating your existing workforce identity solution with AWS. After SSO authentication is set up, your administrators can enable CodeWhisperer and assign access to users and groups, as shown in the following screenshot.
In addition to managing users, AWS administrators can also configure settings for the reference tracker and data sharing. The CodeWhisperer reference tracker detects whether a code recommendation might be similar to particular CodeWhisperer training data and can provide those references to you. CodeWhisperer learns, in part, from open-source projects. Sometimes, a suggestion it’s giving you may be similar to a specific piece of training data. The reference tracker setting enables administrators to decide whether CodeWhisperer is allowed to offer suggestions in such cases. When allowed, CodeWhisperer will also provide references, so that you can learn more about where the training data comes from. AWS administrators can also opt out of data sharing for the purpose of CodeWhisperer service improvement on behalf of your organization (see AI services opt-out policies). Once configured by the administrator, the settings are applied across your organization.
Developers who were given access can start using CodeWhisperer in their preferred IDE by simply logging in using their SSO login credentials. CodeWhisperer is available as part of the AWS Toolkit extensions for major IDEs, including JetBrains, Visual Studio Code, and AWS Cloud9.
In your preferred IDE, choose the SSO login option and follow the prompts to get authenticated and start getting recommendations from CodeWhisperer, as shown in the following screenshots.
Sign up within minutes using your personal email
If you’re an individual developer who doesn’t have access to an AWS account, you can use your personal email to sign up and enable CodeWhisperer in your preferred IDE. The sign-up process takes only a few minutes.
We’re introducing a new method of authentication with AWS Builder ID. AWS Builder ID is a new form of authentication that allows you to sign up securely with just your personal email and a password. After you create an AWS Builder account, simply log in and enable CodeWhisperer for your IDE, as shown in the following screenshot. For more information, see AWS Builder ID docs.
Build apps faster with TypeScript and C# programming languages
Keeping up with multiple programming languages, frameworks, and software libraries is an arduous task even for the most experienced developers. Looking up correct programming syntax and searching code snippets from web to programming tasks takes a significant amount of time, especially if you consider the cost of distractions.
CodeWhisperer provides ready-to-use real-time recommendations in your IDE to help you finish your coding tasks faster. Today, we’re expanding our support to include TypeScript and C# programming languages, in addition to Python, Java, and JavaScript.
CodeWhisperer understands your intent and provides recommendations based on the most commonly used best practices for a programming language. The following example shows how CodeWhisperer can generate the entire function in TypeScript to render JSON to a table.
CodeWhisperer also makes it easy for developers to use AWS services by providing code recommendations for AWS application programming interfaces (APIs) across the most popular services, including Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, and Amazon Simple Storage Service (Amazon S3). We also offer a reference tracker with our recommendations that provides valuable information about the similarity of the recommendation to particular CodeWhisperer training data. Furthermore, we have implemented techniques to detect and filter biased code that might be unfair. The following example shows how CodeWhisperer can generate an entire function based on prompts provided in C#.
Get started with CodeWhisperer
During the preview period, CodeWhisperer is available to all developers across the world for free. To access the service in preview, you can enable it for your organization using the console, or you can use the AWS Builder ID to get started as an individual developer. For more information about the service, visit Amazon CodeWhisperer.
About the Authors
Bharadwaj Tanikella is a Senior Product Manager for Amazon CodeWhisperer. He has a background in Machine Learning, both as a developer and a Product Manager. In his spare time he loves to bike, read non-fiction and learning new languages.
Ankur Desai is a Principal Product Manager within the AWS AI Services team.
Amazon SURE program hosts three Amazon Days
Students had a chance to gain industry exposure and interact with Amazon researchers.Read More
Optimize hyperparameters with Amazon SageMaker Automatic Model Tuning
Machine learning (ML) models are taking the world by storm. Their performance relies on using the right training data and choosing the right model and algorithm. But it doesn’t end here. Typically, algorithms defer some design decisions to the ML practitioner to adopt for their specific data and task. These deferred design decisions manifest themselves as hyperparameters.
What does that name mean? The result of ML training, the model, can be largely seen as a collection of parameters that are learned during training. Therefore, the parameters that are used to configure the ML training process are then called hyperparameters—parameters describing the creation of parameters. At any rate, they are of very practical use, such as the number of epochs to train, the learning rate, the max depth of a decision tree, and so forth. And we pay much attention to them because they have a major impact on the ultimate performance of your model.
Just like turning a knob on a radio receiver to find the right frequency, each hyperparameter should be carefully tuned to optimize performance. Searching the hyperparameter space for the optimal values is referred to as hyperparameter tuning or hyperparameter optimization (HPO), and should result in a model that gives accurate predictions.
In this post, we set up and run our first HPO job using Amazon SageMaker Automatic Model Tuning (AMT). We learn about the methods available to explore the results, and create some insightful visualizations of our HPO trials and the exploration of the hyperparameter space!
Amazon SageMaker Automatic Model Tuning
As an ML practitioner using SageMaker AMT, you can focus on the following:
- Providing a training job
- Defining the right objective metric matching your task
- Scoping the hyperparameter search space
SageMaker AMT takes care of the rest, and you don’t need to think about the infrastructure, orchestrating training jobs, and improving hyperparameter selection.
Let’s start by using SageMaker AMT for our first simple HPO job, to train and tune an XGBoost algorithm. We want your AMT journey to be hands-on and practical, so we have shared the example in the following GitHub repository. This post covers the 1_tuning_of_builtin_xgboost.ipynb
notebook.
In an upcoming post, we’ll extend the notion of just finding the best hyperparameters and include learning about the search space and to what hyperparameter ranges a model is sensitive. We’ll also show how to turn a one-shot tuning activity into a multi-step conversation with the ML practitioner, to learn together. Stay tuned (pun intended)!
Prerequisites
This post is for anyone interested in learning about HPO and doesn’t require prior knowledge of the topic. Basic familiarity with ML concepts and Python programming is helpful though. For the best learning experience, we highly recommend following along by running each step in the notebook in parallel to reading this post. And at the end of the notebook, you also get to try out an interactive visualization that makes the tuning results come alive.
Solution overview
We’re going to build an end-to-end setup to run our first HPO job using SageMaker AMT. When our tuning job is complete, we look at some of the methods available to explore the results, both via the AWS Management Console and programmatically via the AWS SDKs and APIs.
First, we familiarize ourselves with the environment and SageMaker Training by running a standalone training job, without any tuning for now. We use the XGBoost algorithm, one of many algorithms provided as a SageMaker built-in algorithm (no training script required!).
We see how SageMaker Training operates in the following ways:
- Starts and stops an instance
- Provisions the necessary container
- Copies the training and validation data onto the instance
- Runs the training
- Collects metrics and logs
- Collects and stores the trained model
Then we move to AMT and run an HPO job:
- We set up and launch our tuning job with AMT
- We dive into the methods available to extract detailed performance metrics and metadata for each training job, which enables us to learn more about the optimal values in our hyperparameter space
- We show you how to view the results of the trials
- We provide you with tools to visualize data in a series of charts that reveal valuable insights into our hyperparameter space
Train a SageMaker built-in XGBoost algorithm
It all starts with training a model. In doing so, we get a sense of how SageMaker Training works.
We want to take advantage of the speed and ease of use offered by the SageMaker built-in algorithms. All we need are a few steps to get started with training:
- Prepare and load the data – We download and prepare our dataset as input for XGBoost and upload it to our Amazon Simple Storage Service (Amazon S3) bucket.
- Select our built-in algorithm’s image URI – SageMaker uses this URI to fetch our training container, which in our case contains a ready-to-go XGBoost training script. Several algorithm versions are supported.
- Define the hyperparameters – SageMaker provides an interface to define the hyperparameters for our built-in algorithm. These are the same hyperparameters as used by the open-source version.
- Construct the estimator – We define the training parameters such as instance type and number of instances.
- Call the fit() function – We start our training job.
The following diagram shows how these steps work together.
Provide the data
To run ML training, we need to provide data. We provide our training and validation data to SageMaker via Amazon S3.
In our example, for simplicity, we use the SageMaker default bucket to store our data. But feel free to customize the following values to your preference:
In the notebook, we use a public dataset and store the data locally in the data
directory. We then upload our training and validation data to Amazon S3. Later, we also define pointers to these locations to pass them to SageMaker Training.
In this post, we concentrate on introducing HPO. For illustration, we use a specific dataset and task, so that we can obtain measurements of objective metrics that we then use to optimize the selection of hyperparameters. However, for the overall post neither the data nor the task matter. To present you with a complete picture, let us briefly describe what we do: we train an XGBoost model that should classify handwritten digits from the
Optical Recognition of Handwritten Digits Data Set [1] via Scikit-learn. XGBoost is an excellent algorithm for structured data and can even be applied to the Digits dataset. The values are 8×8 images, as in the following example showing a
0
a
5
and a
4
.
Select the XGBoost image URI
After choosing our built-in algorithm (XGBoost), we must retrieve the image URI and pass this to SageMaker to load onto our training instance. For this step, we review the available versions. Here we’ve decided to use version 1.5.1, which offers the latest version of the algorithm. Depending on the task, ML practitioners may write their own training script that, for example, includes data preparation steps. But this isn’t necessary in our case.
If you want to write your own training script, then stay tuned, we’ve got you covered in our next post! We’ll show you how to run SageMaker Training jobs with your own custom training scripts.
For now, we need the correct image URI by specifying the algorithm, AWS Region, and version number:
That’s it. Now we have a reference to the XGBoost algorithm.
Define the hyperparameters
Now we define our hyperparameters. These values configure how our model will be trained, and eventually influence how the model performs against the objective metric we’re measuring against, such as accuracy in our case. Note that nothing about the following block of code is specific to SageMaker. We’re actually using the open-source version of XGBoost, just provided by and optimized for SageMaker.
Although each of these hyperparameters are configurable and adjustable, the objective metric multi:softmax
is determined by our dataset and the type of problem we’re solving for. In our case, the Digits dataset contains multiple labels (an observation of a handwritten digit could be 0
or 1,2,3,4,5,6,7,8,9
), meaning it is a multi-class classification problem.
For more information about the other hyperparameters, refer to XGBoost Hyperparameters.
Construct the estimator
We configure the training on an estimator object, which is a high-level interface for SageMaker Training.
Next, we define the number of instances to train on, the instance type (CPU-based or GPU-based), and the size of the attached storage:
We now have the infrastructure configuration that we need to get started. SageMaker Training will take care of the rest.
Call the fit() function
Remember the data we uploaded to Amazon S3 earlier? Now we create references to it:
A call to fit()
launches our training. We pass in the references to the training data we just created to point SageMaker Training to our training and validation data:
Note that to run HPO later on, we don’t actually need to call fit()
here. We just need the estimator object later on for HPO, and could just jump to creating our HPO job. But because we want to learn about SageMaker Training and see how to run a single training job, we call it here and review the output.
After the training starts, we start to see the output below the cells, as shown in the following screenshot. The output is available in Amazon CloudWatch as well as in this notebook.
The black text is log output from SageMaker itself, showing the steps involved in training orchestration, such as starting the instance and loading the training image. The blue text is output directly from the training instance itself. We can observe the process of loading and parsing the training data, and visually see the training progress and the improvement in the objective metric directly from the training script running on the instance.
Also note that at the end of the output job, the training duration in seconds and billable seconds are shown.
Finally, we see that SageMaker uploads our training model to the S3 output path defined on the estimator object. The model is ready to be deployed for inference.
In a future post, we’ll create our own training container and define our training metrics to emit. You’ll see how SageMaker is agnostic of what container you pass it for training. This is very handy for when you want to get started quickly with a built-in algorithm, but then later decide to pass your own custom training script!
Inspect current and previous training jobs
So far, we have worked from our notebook with our code and submitted training jobs to SageMaker. Let’s switch perspectives and leave the notebook for a moment to check out what this looks like on the SageMaker console.
SageMaker keeps a historic record of training jobs it ran, their configurations such as hyperparameters, algorithms, data input, the billable time, and the results. In the list in the preceding screenshot, you see the most recent training jobs filtered for XGBoost. The highlighted training job is the job we just trained in the notebook, whose output you saw earlier. Let’s dive into this individual training job to get more information.
The following screenshot shows the console view of our training job.
We can review the information we received as cell output to our fit()
function in the individual training job within the SageMaker console, along with the parameters and metadata we defined in our estimator.
Recall the log output from the training instance we saw earlier. We can access the logs of our training job here too, by scrolling to the Monitor section and choosing View logs.
This shows us the instance logs inside CloudWatch.
Also remember the hyperparameters we specified in our notebook for the training job. We see them here in the same UI of the training job as well.
In fact, the details and metadata we specified earlier for our training job and estimator can be found on this page on the SageMaker console. We have a helpful record of the settings used for the training, such as what training container was used and the locations of the training and validation datasets.
You might be asking at this point, why exactly is this relevant for hyperparameter optimization? It’s because you can search, inspect, and dive deeper into those HPO trials that we’re interested in. Maybe the ones with the best results, or the ones that show interesting behavior. We’ll leave it to you what you define as “interesting.” It gives us a common interface for inspecting our training jobs, and you can use it with SageMaker Search.
Although SageMaker AMT orchestrates the HPO jobs, the HPO trials are all launched as individual SageMaker Training jobs and can be accessed as such.
With training covered, let’s get tuning!
Train and tune a SageMaker built-in XGBoost algorithm
To tune our XGBoost model, we’re going to reuse our existing hyperparameters and define ranges of values we want to explore for them. Think of this as extending the borders of exploration within our hyperparameter search space. Our tuning job will sample from the search space and run training jobs for new combinations of values. The following code shows how to specify the hyperparameter ranges that SageMaker AMT should sample from:
The ranges for an individual hyperparameter are specified by their type, like ContinuousParameter. For more information and tips on choosing these parameter ranges, refer to Tune an XGBoost Model.
We haven’t run any experiments yet, so we don’t know the ranges of good values for our hyperparameters. Therefore, we start with an educated guess, using our knowledge of algorithms and our documentation of the hyperparameters for the built-in algorithms. This defines a starting point to define the search space.
Then we run a tuning job sampling from hyperparameters in the defined ranges. As a result, we can see which hyperparameter ranges yield good results. With this knowledge, we can refine the search space’s boundaries by narrowing or widening which hyperparameter ranges to use. We demonstrate how to learn from the trials in the next and final section, where we investigate and visualize the results.
In our next post, we’ll continue our journey and dive deeper. In addition, we’ll learn that there are several strategies that we can use to explore our search space. We’ll run subsequent HPO jobs to find even more performant values for our hyperparameters, while comparing these different strategies. We’ll also see how to run a warm start with SageMaker AMT to use the knowledge gained from previously explored search spaces in our exploration beyond those initial boundaries.
For this post, we focus on how to analyze and visualize the results of a single HPO job using the Bayesian search strategy, which is likely to be a good starting point.
If you follow along in the linked notebook, note that we pass the same estimator that we used for our single, built-in XGBoost training job. This estimator object acts as a template for new training jobs that AMT creates. AMT will then vary the hyperparameters inside the ranges we defined.
By specifying that we want to maximize our objective metric, validation:accuracy
, we’re telling SageMaker AMT to look for these metrics in the training instance logs and pick hyperparameter values that it believes will maximize the accuracy metric on our validation data. We picked an appropriate objective metric for XGBoost from our documentation.
Additionally, we can take advantage of parallelization with max_parallel_jobs
. This can be a powerful tool, especially for strategies whose trials are selected independently, without considering (learning from) the outcomes of previous trials. We’ll explore these other strategies and parameters further in our next post. For this post, we use Bayesian, which is an excellent default strategy.
We also define max_jobs
to define how many trials to run in total. Feel free to deviate from our example and use a smaller number to save money.
We once again call fit()
, the same way as when we launched a single training job earlier in the post. But this time on the tuner object, not the estimator object. This kicks off the tuning job, and in turn AMT starts training jobs.
The following diagram expands on our previous architecture by including HPO with SageMaker AMT.
We see that our HPO job has been submitted. Depending on the number of trials, defined by n_jobs
and the level of parallelization, this may take some time. For our example, it may take up to 30 minutes for 50 trials with only a parallelization level of 3.
When this tuning job is finished, let’s explore the information available to us on the SageMaker console.
Investigate AMT jobs on the console
Let’s find our tuning job on the SageMaker console by choosing Training in the navigation pane and then Hyperparameter tuning jobs. This gives us a list of our AMT jobs, as shown in the following screenshot. Here we locate our bayesian-221102-2053
tuning job and find that it’s now complete.
Let’s have a closer look at the results of this HPO job.
We have explored extracting the results programmatically in the notebook. First via the SageMaker Python SDK, which is a higher level open-source Python library, providing a dedicated API to SageMaker. Then through Boto3, which provides us with lower-level APIs to SageMaker and other AWS services.
Using the SageMaker Python SDK, we can obtain the results of our HPO job:
This allowed us to analyze the results of each of our trials in a Pandas DataFrame, as seen in the following screenshot.
Now let’s switch perspectives again and see what the results look like on the SageMaker console. Then we’ll look at our custom visualizations.
On the same page, choosing our bayesian-221102-2053
tuning job provides us with a list of trials that were run for our tuning job. Each HPO trial here is a SageMaker Training job. Recall earlier when we trained our single XGBoost model and investigated the training job in the SageMaker console. We can do the same thing for our trials here.
As we investigate our trials, we see that bayesian-221102-2053-048-b59ec7b4
created the best performing model, with a validation accuracy of approximately 89.815%. Let’s explore what hyperparameters led to this performance by choosing the Best training job tab.
We can see a detailed view of the best hyperparameters evaluated.
We can immediately see what hyperparameter values led to this superior performance. However, we want to know more. Can you guess what? We see that alpha
takes on an approximate value of 0.052456 and, likewise, eta
is set to 0.433495. This tells us that these values worked well, but it tells us little about the hyperparameter space itself. For example, we might wonder whether 0.433495 for eta
was the highest value tested, or whether there’s room for growth and model improvement by selecting higher values.
For that, we need to zoom out, and take a much wider view to see how other values for our hyperparameters performed. One way to look at a lot of data at once is to plot our hyperparameter values from our HPO trials on a chart. That way we see how these values performed relatively. In the next section, we pull this data from SageMaker and visualize it.
Visualize our trials
The SageMaker SDK provides us with the data for our exploration, and the notebooks give you a peek into that. But there are many ways to utilize and visualize it. In this post, we share a sample using the Altair statistical visualization library, which we use to produce a more visual overview of our trials. These are found in the amtviz
package, which we are providing as part of the sample:
The power of these visualizations becomes immediately apparent when plotting our trials’ validation accuracy (y-axis) over time (x-axis). The following chart on the left shows validation accuracy over time. We can clearly see the model performance improving as we run more trials over time. This is a direct and expected outcome of running HPO with a Bayesian strategy. In our next post, we see how this compares to other strategies and observe that this doesn’t need to be the case for all strategies.
After reviewing the overall progress over time, now let’s look at our hyperparameter space.
The following charts show validation accuracy on the y-axis, with each chart showing max_depth
, alpha
, eta
, and min_child_weight
on the x-axis, respectively. We’ve plotted our entire HPO job into each chart. Each point is a single trial, and each chart contains all 50 trials, but separated for each hyperparameter. This means that our best performing trial, #48, is represented by exactly one blue dot in each of these charts (which we have highlighted for you in the following figure). We can visually compare its performance within the context of all other 49 trials. So, let’s look closely.
Fascinating! We see immediately which regions of our defined ranges in our hyperparameter space are most performant! Thinking back to our eta
value, it’s clear now that sampling values closer to 0 yielded worse performance, whereas moving closer to our border, 0.5, yields better results. The reverse appears to be true for alpha
, and max_depth
appears to have a more limited set of preferred values. Looking at max_depth
, you can also see how using a Bayesian strategy instructs SageMaker AMT to sample more frequently values it learned worked well in the past.
Looking at our eta
value, we might wonder whether it’s worth exploring more to the right, perhaps beyond 0.45? Does it continue to trail off to lower accuracy, or do we need more data here? This wondering is part of the purpose of running our first HPO job. It provides us with insights into which areas of the hyperparameter space we should explore further.
If you’re keen to know more, and are as excited as we are by this introduction to the topic, then stay tuned for our next post, where we’ll talk more about the different HPO strategies, compare them against each other, and practice training with our own Python script.
Clean up
To avoid incurring unwanted costs when you’re done experimenting with HPO, you must remove all files in your S3 bucket with the prefix amt-visualize-demo
and also shut down Studio resources.
Run the following code in your notebook to remove all S3 files from this post.
If you wish to keep the datasets or the model artifacts, you may modify the prefix in the code to amt-visualize-demo/data
to only delete the data or amt-visualize-demo/output
to only delete the model artifacts.
Conclusion
In this post, we trained and tuned a model using the SageMaker built-in version of the XGBoost algorithm. By using HPO with SageMaker AMT, we learned about the hyperparameters that work well for this particular algorithm and dataset.
We saw several ways to review the outcomes of our hyperparameter tuning job. Starting with extracting the hyperparameters of the best trial, we also learned how to gain a deeper understanding of how our trials had progressed over time and what hyperparameter values are impactful.
Using the SageMaker console, we also saw how to dive deeper into individual training runs and review their logs.
We then zoomed out to view all our trials together, and review their performance in relation to other trials and hyperparameters.
We learned that based on the observations from each trial, we were able to navigate the hyperparameter space to see that tiny changes to our hyperparameter values can have a huge impact on our model performance. With SageMaker AMT, we can run hyperparameter optimization to find good hyperparameter values efficiently and maximize model performance.
In the future, we’ll look into different HPO strategies offered by SageMaker AMT and how to use our own custom training code. Let us know in the comments if you have a question or want to suggest an area that we should cover in upcoming posts.
Until then, we wish you and your models happy learning and tuning!
References
- Amazon SageMaker Automatic Model Tuning
- Perform Automatic Model Tuning with SageMaker
- How XGBoost Works
- SageMaker API reference to describe the hyperparameter job
Citations:
[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.About the authors
Andrew Ellul is a Solutions Architect with Amazon Web Services. He works with small and medium-sized businesses in Germany. Outside of work, Andrew enjoys exploring nature on foot or by bike.
Elina Lesyk is a Solutions Architect located in Munich. Her focus is on enterprise customers from the Financial Services Industry. In her free time, Elina likes learning guitar theory in Spanish to cross-learn and going for a run.
Mariano Kamp is a Principal Solutions Architect with Amazon Web Services. He works with financial services customers in Germany on machine learning. In his spare time, Mariano enjoys hiking with his wife.
How JPMorgan Chase & Co. uses AWS DeepRacer events to drive global cloud adoption
This is a guest post by Stephen Carrad, Vice President at JP Morgan Chase & Co.
JPMorgan & Chase Co. started its cloud journey four years ago, building the integrations required to deploy cloud-native applications into the cloud in a resilient and secure manner. In the first year, three applications tentatively dipped their toes into the cloud, and today, we have an ambitious cloud-first agenda.
Operating in the cloud requires a change in culture and a fundamental reeducation towards a new normal. An on-premises server is like your car: you own it, power it, maintain it, and upgrade it. In the cloud, a server is like a rideshare: you press a few buttons, the car appears, you use it for a certain time, and when you’ve finished with it you walk away and someone else uses it. To adapt to a cloud first agenda, our engineers are learning a new operating model, new tools, and new processes.
JPMorgan Chase’s AWS DeepRacer learning program was born in Chicago in 2019. A child of the Chicago Innovation team led, it’s designed to upskill our employees in an enjoyable way by allowing them to compete internally with their local peer groups, globally against other cities, and externally against other firms, universities, and individuals. We started with physical tracks in Chicago and London, and now have tracks in most of our 20+ technology centers around the globe and several racers participating in the DeepRacer Championship Cup at AWS re:Invent.
It started small, but immediately provided value and, more importantly, entertainment to the participants of the program. People who had never used the AWS Management Console before logged on and learned how it worked, played with AWS DeepRacer, and started to write code and learn about reinforcement learning. They also started to collaborate with one another—someone would have an idea to reduce costs or provide visualization of the log analysis, and other people would partner with them to build new tools. It grew beyond teaching people about AWS products and machine learning to people across the world collaborating, building tools, and creating quizzes. We also have the JPMorgan Chase International Speedway, developed in Tampa, where we host our companywide annual finals.
Our AWS DeepRacer learning program now runs in 20 cities and 3,500 people have participated over the past two years. They have gained knowledge of the AWS console, Python, Amazon SageMaker, Jupyter notebooks, and reinforcement learning. Our biggest success is watching people change roles due to their participation.
We recently introduced the AWS DeepRacer Driving License, so hiring managers can see that applicants have attained a recognized standard. It includes a training curriculum that people can follow that enables them to both be knowledgeable and competitive. They also need to attain a certain lap time to prove they have been able to apply the knowledge they have gained.
JPMorgan Chase is now a cloud first organization. With the excitement and interest in the Drivers License, application teams have started to look towards the cloud and have found they are more likely to have technologists in their team with AWS skills. These individuals have then been able to apply their new skills in their day-to-day work.
In 2021, more than 80,000 participants from over 150 countries participated in AWS DeepRacer. As a testament to the work our employees have done with AWS DeepRacer, seven of the 40 racers in AWS’s global championships were JPMorgan Chase technologists. When the dust had settled, our employees topped the podium with first, second, and seventh place finishes. This was a huge achievement against some excellent competitors, and I apologize to anyone sitting near us in the arena at AWS re:Invent for all the shouting and screams of excitement.
This year’s entry to the AWS Championship finals can be achieved by racing on either virtual or physical tracks. We’re looking to get our tracks out and invite our competitors to come and learn, share ideas, enjoy pizza and practice on our tracks. We have also open-sourced two tools that we have created:
- DeepRacer on the Spot – This tool placed third in our Annual Hackathon in Houston. It allows teams to train models on Amazon Elastic Compute Cloud (Amazon EC2) instances using Spot pricing, which can be up to 90% cheaper than training on the console.
- Guru – Developed by one of our participants in London, this log analysis tool provides visualization of what the car is doing on the track at any point and how it is being rewarded.
Racing this year is going to be particularly interesting we continue to expand our presence with top racers. Yousef, Roger, and Tyler will be trying to knock Sairam off the podium, and a couple of groups of MDs are forming their own teams—look out for Managing Directions! I would say that my money is on our graduate talent, but that might be career limiting. We look forward to collaborating with our fellow racers on the tools we are releasing and invite you to race on our tracks.
AWS DeepRacer is at the forefront of making us a cloud-ready organization. To learn more about how you can drive collaboration and ML learning like JPMorgan Chase with AWS DeepRacer, join my session on Wednesday, November 30th at 2:30 PM.
About the author
Stephen Carrad is a DevOps Manager at JPMorgan Chase. He also leads the JPMorgan Chase DeepRacer Learning Program to grow his team building skills and support the firm’s widespread public cloud adoption. Outside of work, Stephen enjoys trying to keep up with his teenage children whilst skiing or cycling and coaching his local under-16 rugby team.
Apply fine-grained data access controls with AWS Lake Formation and Amazon EMR from Amazon SageMaker Studio
Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) that enables data scientists and developers to perform every step of the ML workflow, from preparing data to building, training, tuning, and deploying models. Studio comes with built-in integration with Amazon EMR so that data scientists can interactively prepare data at petabyte scale using open-source frameworks such as Apache Spark, Hive, and300 Presto right from within Studio notebooks. Data is often stored in data lakes managed by AWS Lake Formation, enabling you to apply fine-grained access control through a simple grant or revoke mechanism. We’re excited to announce that Studio now supports applying this fine-grained data access control with Lake Formation when accessing data through Amazon EMR.
Until now, when you ran multiple data processing jobs on an EMR cluster, all the jobs used the same AWS Identity and Access Management (IAM) role for accessing data—namely, the cluster’s Amazon Elastic Compute Cloud (Amazon EC2) instance profile. Therefore, to run jobs that needed access to different data sources such as different Amazon Simple Storage Service (Amazon S3) buckets, you had to configure the EC2 instance profile with policies that allowed access to the union of all such data sources. Additionally, for enabling groups of users with differential access to data, you had to create multiple separate clusters, one for each group, resulting in operational overheads. Separately, jobs submitted to Amazon EMR from Studio notebooks were unable to apply fine-grained data access control with Lake Formation.
Starting with the release of Amazon EMR 6.9, when you connect to EMR clusters from Studio notebooks, you can visually browse and choose an IAM role on the fly called the runtime IAM role. Subsequently, all your Apache Spark, Apache Hive, or Presto jobs created from Studio notebooks will access only the data and resources permitted by policies attached to the runtime role. Also, when data is accessed from data lakes managed with Lake Formation, you can enforce table-level and column-level access using policies attached to the runtime role.
With this new capability, multiple Studio users can connect to the same EMR cluster, each using a runtime IAM role scoped with permissions matching their individual level of access to data. Their user sessions are also completely isolated from one another on the shared cluster. With this ability to control fine-grained access to data on the same shared cluster, you can simplify provisioning of EMR clusters, thereby reducing operational overhead and saving costs.
In this post, we demonstrate how to use a Studio notebook to connect to an EMR cluster using runtime roles. We provide a sample Studio Lifecycle Configuration that can help configure the EMR runtime roles that a Studio user profile has access to. Additionally, we manage data access in a data lake via Lake Formation by enforcing row-level and column-level permissions to the EMR runtime roles.
Solution overview
We demonstrate this solution with an end-to-end use case using a sample dataset, the TPC data model. This data represents transaction data for products and includes information such as customer demographics, inventory, web sales, and promotions. To demonstrate fine-grained data access permissions, we consider the following two users:
- David, a data scientist on the marketing team. He is tasked with building a model on customer segmentation, and is only permitted to access non-sensitive customer data.
- Tina, a data scientist on the sales team. She is tasked with building the sales forecast model, and needs access to sales data for the particular region. She is also helping the product team with innovation, and therefore needs access to product data as well.
The architecture is implemented as follows:
- Lake Formation manages the data lake, and the raw data is available in S3 buckets
- Amazon EMR is used to query the data from the data lake and perform data preparation using Spark
- IAM roles are used to manage data access using Lake Formation
- Studio is used as the single visual interface to interactively query and prepare the data
The following diagram illustrates this architecture.
The following sections walk through the steps required to enable runtime IAM roles for Amazon EMR integration with an existing Studio domain. You can use the provided AWS CloudFormation stack in the Deploy the solution section below to set up the architectural components for this solution.
Prerequisites
Before you get started, make sure you have the following prerequisites:
- An AWS account
- An IAM user with administrator access
Set up Amazon EMR with runtime roles
The EMR cluster should be created with IAM runtime roles enabled. For more details on using runtime roles with Amazon EMR, see Configure runtime roles for Amazon EMR steps. Associating runtime roles with EMR clusters is supported in Amazon EMR 6.9. Make sure the following configuration is in place:
- The EMR runtime role’s trust policy should allow the EMR EC2 instance profile to assume the role
- The EMR EC2 instance profile role should be able to assume the EMR runtime roles
- The EMR cluster should be created with encryption in transit
You can optionally choose to pass the SourceIdentity
(the Studio user profile name) for monitoring the user resource access. Follow the steps outlined in Monitoring user resource access from Amazon SageMaker Studio to enable SourceIdentity
for your Studio domain.
Finally, refer to Prepare Data using Amazon EMR for detailed setup and networking instructions on integrating Studio with EMR clusters.
Create bootstrap action for the cluster
You need to run a bootstrap action on the cluster to ensure Studio notebook’s connectivity with EMR through runtime roles. Complete the following steps:
- Download the bootstrap script from
s3://emr-data-access-control-<region>/customer-bootstrap-actions/gcsc/replace-rpms.sh
, replacing region with your region - Download this RPM file from
s3://emr-data-access-control-<region>/customer-bootstrap-actions/gcsc/emr-secret-agent-1.18.0-SNAPSHOT20221121212949.noarch.rpm
- Upload both files to an S3 bucket in your account and region
- When creating your EMR cluster, include the following bootstrap action:
--bootstrap-actions "Path=<S3-URI-of-the-bootstrap-script>,Args=[<S3-URI-of-the-RPM-file>]"
Update the Studio execution role
Your Studio user’s execution role needs to be updated to allow the GetClusterSessionCredentials
API action. Add the following policy to the Studio execution role, replacing the resource with the cluster ARNs you wish to allow your users to connect to:
You can also use conditions to control which EMR execution roles can be used by the Studio execution role.
Alternatively, you can attach a role such as below, which restricts access to clusters based on resource tags. This allows for tag-based access control, and you can use the same policy statements across user roles, instead of explicitly adding cluster ARNs.
Set up role configurations through Studio LCC
By default, the Studio UI uses the Studio execution role to connect to the EMR cluster. If your user can access multiple roles, they can update the EMR cluster connection commands with the role ARN they want to pass as a runtime role. For a better user experience, you can set up a configuration file on the user’s home directory on Amazon Elastic File System (Amazon EFS), which automatically informs the Studio UI of the roles that are available to connect for the user. You can also automate this process through Studio Lifecycle Configurations. We provide the following sample Lifecycle Configuration script to configure the roles:
Connect to clusters from the Studio UI
After the role and Lifecycle Configuration scripts are set up, you can launch the Studio UI and connect to the clusters when you create a new notebook using any of the following kernels:
- DataScience – Python 3 kernel
- DataScience 2.0 – Python 3 kernel
- DataScience 3.0 – Python 3 kernel
- SparkAnalytics 1.0 – SparkMagic and PySpark kernels
- SparkAnalytics 2.0 – SparkMagic and PySpark kernels
- SparkMagic – PySpark kernel
Note: The Studio UI for connecting to EMR clusters using runtime roles work only on JupyterLab version 3. See Jupyter versioning for details on upgrading to JL3.
Deploy the solution
To test out the solution end to end, we provide a CloudFormation template that sets up the services included in the architecture, to enable repeatable deployments. This template creates the following resources:
- An S3 bucket for the data lake.
- An EMR cluster with EMR runtime roles enabled.
- IAM roles for accessing the data in data lake, with fine-grained permissions:
Marketing-data-access-role
Sales-data-access-role
Electronics-data-access-role
- A Studio domain and two user profiles. The Studio execution roles for the users allow the users to assume their corresponding EMR runtime roles.
- A Lifecycle Configuration to enable the selection of the role to use for the EMR connection.
- A Lake Formation database populated with the TPC data.
- Networking resources required for the setup, such as VPC, subnets, and security groups.
To deploy the solution, complete the following steps:
- Choose Launch Stack to launch the CloudFormation stack:
- Enter a stack name, provide the following parameters –
- An idle timeout for the EMR cluster (to avoid paying for the cluster when it’s not being used).
- An S3 URI with the EMR encryption key. You can follow the steps in the EMR documentation here to generate a key and zip file specific to your region. If you are deploying in US East (N. Virginia), remember to use
CN=*.ec2.internal
, as specified in the documentation here. Make sure to upload the zip file on a S3 bucket in the same region as your CloudFormation stack deployment.
- Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- Choose Create stack.
Once the Stack is created, allow Amazon EMR to query Lake Formation by updating the External Data Filtering settings on Lake Formation. Follow the instructions provided in the Lake Formation guide here, and choose ‘Amazon EMR’ for Session tag values, and enter your AWS account ID under AWS account IDs.
Test role-based data access
With the infrastructure in place, you’re ready to test out the fine-grained data access for the two Studio users. To recap, the user David should only be able to access non-sensitive customer data. Tina can access data in two tables: sales and product information. Let’s test each user profile.
David’s user profile
To test your data access with David’s user profile, complete the following steps:
- Log in to the AWS console.
- From the created Studio domain, launch Studio from the user profile
david-non-sensitive-customer
. - In the Studio UI, start a notebook with any of the supported kernels, e.g., SparkMagic image with the PySpark kernel.
The cluster is pre-created in the account.
- Connect to the cluster by choosing Cluster in your notebook and choosing the cluster
<StackName>-emr-cluster
. In the role selector pop-up, choose the<StackName>-marketing-data-access-role
. - Choose Connect.
This will automatically create a notebook cell with magic commands to connect to the cluster. Wait for the cell to execute and the connection to be established before proceeding with the remaining steps.
Now let’s query the marketing table from the notebook.
- In a new cell, enter the following query and run the cell:
After the cell runs successfully, you can view the first 10 records in the table. Note that you can’t view the customers’ name, as the user only has permissions to read non-sensitive data, through column-level filtering.
Let’s test to make sure David can’t read any sensitive customer data.
- In a new cell, run the following query:
This cell should throw an Access Denied error.
Tina’s user profile
Tina’s Studio execution role allows her to access the Lake Formation database using two EMR execution roles. This is achieved by listing the role ARNs in a configuration file in Tina’s file directory. These roles can be set using Studio Lifecycle Configurations to persist the roles across app restarts. To test Tina’s access, complete the following steps:
- Launch Studio from the user profile
tina-sales-electronics
.
It’s a good practice to close any previous Studio sessions on your browser when switching user profiles. There can only be one active Studio user session at a time.
- In the Studio UI, start a notebook with any of the supported kernels, e.g., SparkMagic image with the PySpark kernel.
- Connect to the cluster by choosing Cluster in your notebook and choosing the cluster
<StackName>-emr-cluster
. - Choose Connect.
Because Tina’s profile is set up with multiple EMR roles, you’re prompted with a UI drop-down that allows you to connect using multiple roles.
The Studio execution role is also available in the dropdown, as the clusters connect using the user’s execution role by default to connect to the cluster.
You can directly provide Lake Formation access to the user’s execution role as well.This will automatically create a notebook cell with magic commands to connect to the cluster, using the chosen role.Now let’s query the sales table from the notebook.
- In a new cell, enter the following query and run the cell:
After the cell runs successfully, you can view the first 10 records in the table.
Now let’s try accessing the product table.
- Choose Cluster again, and choose the cluster.
- In the role prompt pop-up, choose the role
<StackName>-electronics-data-access-role
and connect to the cluster. - After you’re connected successfully to the cluster with the electronics data access role, create a new cell and run the following query:
This cell should complete successfully, and you can view the first 10 records in the products table.
With a single Studio user profile, you have now successfully assumed multiple roles, and queried data in Lake Formation using multiple roles, without the need for restarting the notebooks or creating additional clusters. Now that you’re able to access the data using appropriate roles, you can interactively explore the data, visualize the data, and prepare data for training. You also used different user profiles to provide your users in different teams access to a specific table or columns and rows, without the need for additional clusters.
Clean up
When you’re finished experimenting with this solution, clean up your resources:
- Shut down the Studio apps for the user profiles. See Shut Down and Update SageMaker Studio and Studio Apps for instructions. Ensure that all apps are deleted before deleting the stack.
The EMR cluster will be automatically deleted after the idle timeout value.
- Delete the EFS volume created for the domain. You can view the EFS volume attached with the domain by using a DescribeDomain API call.
- Make sure to empty the S3 buckets created by this stack.
- Delete the stack from the AWS CloudFormation console.
Conclusion
This post showed you how you can use runtime roles to connect Studio with Amazon EMR to apply fine-grained data access control with Lake Formation. We also demonstrated how multiple Studio users canconnect to the same EMR cluster, each using a runtime IAM role scoped with permissions matching their individual level of access to data. We detailed the steps required to manually set up the integration, and provided a CloudFormation template to set up the infrastructure end to end. This feature is available in the following AWS regions: Europe (Paris), US East (N. Virginia and Ohio) and US West (Oregon), and the CloudFormation template will deploy in US East (N. Virginia and Ohio) and US West (Oregon).
To learn more about using EMR with SageMaker Studio, visit Prepare Data using Amazon EMR. We encourage you to try out this new functionality, and connect with the Machine Learning & AI community if you have any questions or feedback!
About the authors
Durga Sury is a ML Solutions Architect in the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 3 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and hikes with her four-year old husky.
Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing Tennis, binge-watching TV shows, and playing Tabla.
Maira Ladeira Tanke is an ML Specialist Solutions Architect at AWS. With a background in data science, she has 9 years of experience architecting and building ML applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through emerging technologies and innovative solutions. In her free time, Maira enjoys traveling and spending time with her family someplace warm.
Sumedha Swamy is a Principal Product Manager at Amazon Web Services. He leads SageMaker Studio team to build it into the IDE of choice for interactive data science and data engineering workflows. He has spent the past 15 years building customer-obsessed consumer and enterprise products using Machine Learning. In his free time he likes photographing the amazing geology of the American Southwest.
Jun Lyu is a Software Engineer on the SageMaker Notebooks team. He has a Master’s degree in engineering from Duke University. He has been working for Amazon since 2015 and has contributed to AWS services like Amazon Machine Learning, Amazon SageMaker Notebooks, and Amazon SageMaker Studio. In his spare time, he enjoys spending time with his family, reading, cooking, and playing video games.
AWS Cloud technology for near-real-time cardiac anomaly detection using data from wearable devices
Cardiovascular diseases (CVDs) are the number one cause of death globally: more people die each year from CVDs than from any other cause.
The COVID-19 pandemic made organizations change healthcare delivery to reduce staff contact with sick people and the overall pressure on the healthcare system. This technology enables organizations to deliver telehealth solutions, which monitor and detect conditions that can put patient health at risk.
In this post, we present an AWS architecture that processes live electrocardiogram (ECG) feeds from common wearable devices, analyzes the data, provides near-real-time information via a web dashboard. If a potential critical condition is detected, it sends real-time alerts to subscribed individuals.
Solution overview
The architecture is divided in six different layers:
- Data ingestion
- Live ECG stream storage
- ECG data processing
- Historic ECG pathology archive
- Live alerts
- Visualization dashboard
The following diagram shows the high-level architecture.
In the following sections, we discuss each layer in more detail.
Data ingestion
The data ingestion layer uses AWS IoT Core as the connection point between the external remote sensors and the AWS Cloud architecture, which is capable of storing, transforming, analyzing, and showing insights from the acquired live feeds from remote wearable devices.
When the data from the remote wearable devices reaches AWS IoT Core, it can be sent using an AWS IoT rule and associated actions.
In the proposed architecture, we use one rule and one action. The rule extracts data from the raw stream using a simple SQL statement, as outlined by the following AWS IoT Core rule definition SQL code.
SELECT device_id, ecg, ppg, bpm, timestamp() as timestamp FROM ‘dt/sensor/#’
The action writes the extracted data from the rule into an Amazon Timestream database.
For more information on how to implement workloads using AWS IoT Core, refer to Implementing time-critical cloud-to-device IoT message patterns on AWS IoT Core.
Live ECG stream storage
Live data arriving from connected ECG sensors is immediately stored in Timestream, which is purposely designed to store time series data.
From Timestream, data is periodically extracted into shards and subsequently processed by AWS Lambda to generate spectrograms and by Amazon Rekognition to perform ECG spectrogram classification.
You can create and manage a Timestream database via the AWS Management Console, from the AWS Command Line Interface (AWS CLI), or via API calls.
On the Timestream console, you can observe and monitor various database metrics, as shown in the following screenshot.
In addition, you can run various queries against a given database.
ECG data processing
The processing layer is composed of Amazon EventBridge, Lambda, and Amazon Rekognition.
The core of the detection centers on the ability to create spectrograms from a time series stride and use Amazon Rekognition Custom Labels, trained with an archive of spectrograms generated from time series strides of ECG data from patients affected by various pathologies, to perform a classification of the incoming ECG data live stream transformed into spectrograms by Lambda.
EventBridge event details
With EventBridge, it’s possible to create event-driven applications at scale across AWS.
In the case of the ECG near-real-time analysis, EventBridge is used to create an event (SpectrogramPeriodicGeneration
) that periodically triggers a Lambda function to generate spectrograms from the raw ECG data and send a request to Amazon Rekognition to analyze the spectrograms to detect signs of anomalies.
The following screenshot shows the configuration details of the SpectrogramPeriodicGeneration
event.
Lambda function details
The Lambda function GenerateSpectrogramsFromTimeSeries
, written entirely in Python, functions as orchestrator among the different steps needed to perform a classification of an ECG spectrogram. It’s a crucial piece of the processing layer that detects if an incoming ECG signal presents signs of possible anomalies.
The Lambda function has three main purposes:
- Fetch a 1-minute stride from the live ECG stream
- Generate spectrograms from it
- Initiate an Amazon Rekognition job to perform classification of the generated spectrograms
Amazon Rekognition details
The ECG analysis to detect if anomalies are present is based on the classification of spectrograms generated from 1-minute-long ECG trace strides.
To accomplish this classification job, we use Rekognition Custom Labels to train a model capable of identifying different cardiac pathologies found in spectrograms generated from ECG traces of people with various cardiac conditions.
To start using Rekognition Custom Labels, we need to specify the locations of the datasets, which contain the data that Amazon Rekognition uses for labeling, training, and validation.
Looking inside of the defined datasets, it’s possible to see more details that Amazon Rekognition has extracted from the given Amazon Simple Storage Service (Amazon S3) bucket.
From this page, we can see the labels that Amazon Rekognition has automatically generated by matching the folder names present in the S3 bucket.
In addition, Amazon Rekognition provides a preview of the labeled images.
The following screenshot shows the details of the S3 bucket used by Amazon Rekognition.
After you have defined a dataset, you can use Rekognition Custom Labels to train on your data, and deploy the model for inference afterwards.
The Rekognition Custom Labels project pages provide details about each available project and a tree representation of all the models that have been created.
Moreover, the project pages show the status of the available models and their performances.
You can choose the models on the Rekognition Custom Labels console to see more details of each model, as shown in the following screenshot.
Further details about the model are available on the Model details tab.
For further assessment of model performance, choose View test results. The following screenshot shows an example of test results from our model.
Historic ECG pathology archive
The pathology archive layer receives raw time series ECG data, generates spectrograms, and stores those in a separate bucket that you can use to further train your Rekognition Custom Labels model.
Visualization dashboard
The live visualization dashboard, responsible for showing real-time ECGs, PPG traces, and live BPM, is implemented via Amazon Managed Grafana.
Amazon Managed Grafana is a fully managed service that is developed together with Grafana Labs and based on open-source Grafana. Enhanced with enterprise capabilities, Amazon Managed Grafana makes it easy for you to visualize and analyze your operational data at scale.
On the Amazon Managed Grafana console, you can create workspaces, which are logically isolated Grafana servers where you can create Grafana dashboards. The following screenshot shows a list of our available workspaces.
You can also set up the following on the Workspaces page:
- Users
- User groups
- Data sources
- Notification channels
The following screenshot shows the details of our workspace and its users.
In the Data sources section, we can review and set up all the source feeds that populate the Grafana dashboard.
In the following screenshot, we have three sources configured:
You can choose Configure in Grafana for a given data source to configure it directly in Amazon Managed Grafana.
You’re asked to authenticate within Grafana. For this post, we use AWS IAM Identity Center (Successor to AWS Single Sign-On)
After you log in, you’re redirected to the Grafana home page. From here, you can view your saved dashboards. As shown in the following screenshot, we can access our Heart Health Monitoring dashboard.
You can also choose the gear icon in the navigation pane and perform various configuration tasks on the following:
- Data sources
- Users
- User groups
- Statistics
- Plugins
- Preferences
For example, if we choose Data Sources, we can add sources that will feed Grafana boards.
The following screenshot shows the configuration panel for Timestream.
If we navigate to the Heart Health Monitoring dashboard from the Grafana home page, we can review the widgets and information included within the dashboard.
Conclusion
With services like AWS IoT Core, Lambda, Amazon SNS, and Grafana, you can build a serverless solution with an event-driven architecture capable of ingesting, processing, and monitoring data streams in near-real time from a variety of devices, including common wearable devices.
In this post, we explored one way to ingest, process, and monitor live ECG data generated from a synthetic wearable device in order to provide insights to help determine if anomalies might be present in the ECG data stream.
To learn more about how AWS is accelerating innovation in healthcare, visit AWS for Health.
About the Author
Benedetto Carollo is a Senior Solution Architect for medical imaging and healthcare at Amazon Web Services in Europe, Middle East, and Africa. His work focuses on helping medical imaging and healthcare customers solve business problems by leveraging technology. Benedetto has over 15 years of experience of technology and medical imaging and has worked for companies like Canon Medical Research and Vital Images. Benedetto received his summa cum laude MSc in Software Engineering from the University of Palermo – Italy.
How a NeurIPS workshop is increasing women’s visibility in AI
Three questions with Sergül Aydöre, a senior applied scientist at Amazon and general chair of this year’s Women in Machine Learning workshop.Read More
Identifying landmarks with Amazon Rekognition Custom Labels
Amazon Rekognition is a computer vision service that makes it simple to add image and video analysis to your applications using proven, highly scalable, deep learning technology that does not require machine learning (ML) expertise. With Amazon Rekognition, you can identify objects, people, text, scenes, and activities in images and videos and detect inappropriate content. Amazon Rekognition also provides highly accurate facial analysis and facial search capabilities that you can use to detect, analyze, and compare faces for a wide variety of use cases.
Amazon Rekognition Custom Labels is a feature of Amazon Rekognition that makes it simple to build your own specialized ML-based image analysis capabilities to detect unique objects and scenes integral to your specific use case.
Some common use cases of Rekognition Custom Labels include finding your logo in social media posts, identifying your products on store shelves, classifying machine parts in an assembly line, distinguishing between healthy and infected plants, and more.
Amazon Rekognition Labels supports popular landmarks like the Brooklyn Bridge, Colosseum, Eiffel Tower, Machu Picchu, Taj Mahal, and more. If you have other landmarks or buildings not yet supported by Amazon Rekognition, you can still use Amazon Rekognition Custom Labels.
In this post, we demonstrate using Rekognition Custom Labels to detect the Amazon Spheres building in Seattle.
With Rekognition Custom Labels, AWS takes care of the heavy lifting for you. Rekognition Custom Labels builds off the existing capabilities of Amazon Rekognition, which is already trained on tens of millions of images across many categories. Instead of thousands of images, you simply need to upload a small set of training images (typically a few hundred images or less) that are specific to your use case via our straightforward console. Amazon Rekognition can begin training in just a few clicks. After Amazon Rekognition begins training from your image set, it can produce a custom image analysis model for you within few minutes or hours. Behind the scenes, Rekognition Custom Labels automatically loads and inspects the training data, selects the suitable ML algorithms, trains a model, and provides model performance metrics. You can then use your custom model via the Rekognition Custom Labels API and integrate it into your applications.
Solution overview
For our example, we use the Amazon Spheres building in Seattle. We train a model using Rekognition Custom Labels; whenever similar images are used, the algorithm should identify it as Amazon Spheres
instead of Dome
, Architecture
, Glass building
, or other labels.
Let’s first show an example of using the label detection feature of Amazon Rekognition, where we feed the image of Amazon Spheres without any custom training. We use the Amazon Rekognition console to open the label detection demo and upload our photo.
After the image is uploaded and analyzed, we see labels with their confidence scores under Results. In this case, Dome
was detected with confidence score of 99.2%, Architecture
with 99.2%, Building
with 99.2%, Metropolis
with 79.4%, and so on.
We want to use custom labeling to produce a computer vision model that can label the image Amazon Spheres
.
In the following sections, we walk you through preparing your dataset, creating a Rekognition Custom Labels project, training the model, evaluating the results, and testing it with additional images.
Prerequisites
Before starting with the steps, there are quotas for Rekognition Custom Labels that you need to be aware of. If you want to change the limits, you can request a service limit increase.
Create your dataset
If this is your first time using Rekognition Custom Labels, you’ll be prompted to create an Amazon Simple Storage Service (Amazon S3) bucket to store your dataset.
For this blog demonstration, we have used images of the Amazon Spheres, which we captured while we visited Seattle, WA. Feel free to use your own images as per your need.
Copy your dataset to the newly created bucket, which stores your images inside their respective prefixes.
Create a project
To create your Rekognition Custom Labels project, complete the following steps:
- On the Rekognition Custom Labels console, choose Create a project.
- For Project name, enter a name.
- Choose Create project.
Now we specify the configuration and path of your training and test dataset. - Choose Create dataset.
You can start with a project that has a single dataset, or a project that has separate training and test datasets. If you start with a single dataset, Rekognition Custom Labels splits your dataset during training to create a training dataset (80%) and a test dataset (20%) for your project.
Additionally, you can create training and test datasets for a project by importing images from one of the following locations:
- An S3 bucket
- A local computer
- Amazon SageMaker Ground Truth
- An existing dataset
For this post, we use our own custom dataset of Amazon Spheres.
- Select Start with a single dataset.
- Select Import images from S3 bucket.
- For S3 URI, enter the path to your S3 bucket.
- If you want Rekognition Custom Labels to automatically label the images for you based on the folder names in your S3 bucket, select Automatically assign image-level labels to images based on the folder name.
- Choose Create dataset.
A page opens that shows you the images with their labels. If you see any errors in the labels, refer to Debugging datasets.
Train the model
After you have reviewed your dataset, you can now train the model.
- Choose train model.
- For Choose project, enter the ARN for your project if it’s not already listed.
- Choose Train model.
In the Models section of the project page, you can check the current status in the Model status column, where the training is in progress. Training time typically takes 30 minutes to 24 hours to complete, depending on several factors such as number of images and number of labels in the training set, and types of ML algorithms used to train your model.
When the model training is complete, you can see the model status as TRAINING_COMPLETED
. If the training fails, refer to Debugging a failed model training.
Evaluate the model
Open the model details page. The Evaluation tab shows metrics for each label, and the average metric for the entire test dataset.
The Rekognition Custom Labels console provides the following metrics as a summary of the training results and as metrics for each label:
You can view the results of your trained model for individual images, as shown in the following screenshot.
Test the model
Now that we’ve viewed the evaluation results, we’re ready to start the model and analyze new images.
You can start the model on the Use model tab on the Rekognition Custom Labels console, or by using the StartProjectVersion operation via the AWS Command Line Interface (AWS CLI) or Python SDK.
When the model is running, we can analyze the new images using the DetectCustomLabels API. The result from DetectCustomLabels
is a prediction that the image contains specific objects, scenes, or concepts. See the following code:
In the output, you can see the label with its confidence score:
As you can see from the result, just with few simple clicks, you can use Rekognition Custom Labels to achieve accurate labeling outcomes. You can use this for a multitude of image use cases, such as identifying custom labeling for food products, pets, machine parts, and more.
Clean up
To clean up the resources you created as part of this post and avoid any potential recurring costs, complete the following steps:
- On the Use model tab, stop the model.
Alternatively, you can stop the model using the StopProjectVersion operation via the AWS CLI or Python SDK.Wait until the model is in theStopped
state before continuing to the next steps. - Delete the model.
- Delete the project.
- Delete the dataset.
- Empty the S3 bucket contents and delete the bucket.
Conclusion
In this post, we showed how to use Rekognition Custom Labels to detect building images.
You can get started with your custom image datasets, and with a few simple clicks on the Rekognition Custom Labels console, you can train your model and detect objects in images. Rekognition Custom Labels can automatically load and inspect the data, select the right ML algorithms, train a model, and provide model performance metrics. You can review detailed performance metrics such as precision, recall, F1 scores, and confidence scores.
The day has come when we can now identify popular buildings like Empire State Building in New York City, the Taj Mahal in India, and many others across the world pre-labeled and ready to use for intelligence in your applications. But if you have other landmarks currently not yet supported by Amazon Rekognition Labels, look no further and try out Amazon Rekognition Custom Labels.
For more information about using custom labels, see What Is Amazon Rekognition Custom Labels? Also, visit our GitHub repo for an end-to-end workflow of Amazon Rekognition custom brand detection.
About the Authors:
Suresh Patnam is a Principal BDM – GTM AI/ML Leader at AWS. He works with customers to build IT strategy, making digital transformation through the cloud more accessible by leveraging Data & AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family.
Bunny Kaushik is a Solutions Architect at AWS. He is passionate about building AI/ML solutions on AWS and helping customers innovate on the AWS platform. Outside of work, he enjoys hiking, climbing, and swimming.