Increase ML model performance and reduce training time using Amazon SageMaker built-in algorithms with pre-trained models

Model training forms the core of any machine learning (ML) project, and having a trained ML model is essential to adding intelligence to a modern application. A performant model is the output of a rigorous and diligent data science methodology. Not implementing a proper model training process can lead to high infrastructure and personnel costs because it underlines the experimental phase of the ML process and by nature tends to be highly iterative.

Generally speaking, training a model from scratch is time-consuming and compute intensive. When the training data is small, we can’t expect to train a very performant model. A better alternative is to fine-tune a pretrained model on the target dataset. For certain use cases, Amazon SageMaker provides high-quality pretrained models that were trained on very large datasets. Fine-tuning these models takes a fraction of the training time compared to training a model from scratch.

To validate this assertion, we ran a study using built-in algorithms with pretrained models. We also compared two types of pretrained models within Amazon SageMaker Studio, Type 1 (legacy) and Type 2 (latest), against a model trained from scratch using Defect Detection Network (DDN) with regards to training time and infrastructure cost. To demonstrate the training process, we used the default detection dataset from the post Visual inspection automation using Amazon SageMaker JumpStart. This post showcases the results of the study. We also provide a Studio notebook, which you can modify to run the experiments using your own dataset and an algorithm or model of your choosing.

Model training in Studio

SageMaker is a fully managed ML service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment.

There are many ways with which you can train ML models using SageMaker, such as using Amazon SageMaker Debugger, Spark MLLib, or using custom Python code with TensorFlow, PyTorch, or Apache MXNet. You can also bring your own custom algorithm or choose an algorithm from AWS Marketplace.

Furthermore, SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and ML practitioners get started on training and deploying ML models quickly.

You can use built-in algorithms for either classification or regression problems, or for a variety of unsupervised learning tasks. Other built-in algorithms include text analysis and image processing. You can train a model from scratch using a built-in algorithm for a specific use case. For a full list of available built-in algorithms, see Common Information About Built-in Algorithms.

Some built-in algorithms also include pre-trained models for popular problem types that use the SageMaker SDK as well as Studio. These pre-trained models can greatly reduce the training time as well as infrastructure cost for common use cases such as semantic segmentation, object detection, text summarization, and question answering. For a complete list of pre-trained models, see Models.

For choosing the best model, SageMaker automatic model tuning, also known as hyperparameter tuning or hyperparameter optimization (HPO), can be very useful because it finds the best version of a model by running a slew of training jobs on your dataset using the algorithm and hyperparameters that you specify. Depending on the number of hyperparameters and the size of the search space, finding the best model can require thousands or even tens of thousands of training runs. Automatic model tuning provides a built-in HPO algorithm that removes the undifferentiated heavy lifting required to build your own HPO algorithm. Automatic model tuning provides the option of parallelizing model runs in order to reduce the time and cost of finding the best fit.

After the automatic model tuning has completed multiple runs for a set of hyperparameters, it chooses the hyperparameter values that result in the model with the best performance, as measured by the loss function specific to the model.

Training and validation loss is just one of the metrics needed to pick the best model for the use case. With so many options, it’s not always easy to make the right choice, and picking the best model boils down to the training time, cost of infrastructure, complexity, and quality of the resulting model, among other factors. There are other extraneous costs such as platform and personnel costs that we don’t take into account for this study.

In the subsequent sections, we discuss the study design and the results.

Dataset

We use the NEU-CLS dataset and a detector on the NEU-DET dataset. This dataset contains 1,800 images and 4,189 bounding boxes in total. The type of defects in our dataset are as follows:

  • Crazing (class: Cr, label: 0)
  • Inclusion (class: In, label: 1)
  • Pitted surface (class: PS, label: 2)
  • Patches (class: Pa, label: 3)
  • Rolled-in scale (class: RS, label: 4)
  • Scratches (class: Sc, label: 5)

For more details about the dataset, refer to Visual inspection automation using Amazon SageMaker JumpStart.

Models

We introduced the Defect Detection Network in the post Visual inspection automation using Amazon SageMaker JumpStart. We trained this model from scratch with the default hyperparameters, so we could have a benchmark to evaluate the rest of the models.

For object detection use cases, SageMaker provides the following set of built-in object models:

Aside from training a model from scratch, we used these models to evaluate four approaches that typically reflect an ML model training process. The output of each approach is a trained ML model. In cases 1 and 3, a set of fixed hyperparameters are provided to train a single model, whereas in cases 2 and 4, SageMaker produces the best model and the set of hyperparameters that led to the best fit.

  1. Type 1 (legacy) model – We use the model with a ResNet backbone, which is pre-trained on ImageNet with default hyperparameters and no optimizer.
  2. Fine-tune Type 1 (legacy) with HPO – Now we run HPO to find better hyperparameters that lead to a better model. For a list of all parameters you can fine-tune, refer to Tune an Object Detection Model. In this notebook, we only fine-tune learning rate, momentum, and weight decay. We use automatic model tuning to run HPO. We need to provide hyperparameter ranges for learning rate, momentum, and weight decay. Automatic model tuning will monitor the log and parse the objective metrics. For object detection, we use Mean Average Precision (mAP) on the validation dataset as our metric.
  3. Fine-tune Type 2 (latest) model – For the Type 2 (latest) object detection model, we follow the instructions in Fine-tune a Model and Deploy to a SageMaker Endpoint and use standard SageMaker APIs. You can find all fine-tunable Type 2 (latest) object detection models in the Built-in Algorithms with pre-trained Model table and set FineTunable?=True. Currently, there are nine fine-tunable object detection models. We use the one with the VGG backend and pretrained on VOC dataset. We fine-tune using a set of static hyperparameters.
  4. Fine-tune Type 2 (latest) model with HPO – We provide a range for the ADAM learning rate; the rest of the hyperparameters stay default. Also, note that the Type 2 (latest) model training reports Val_CrossEntropy loss and Val_SmoothL1 loss instead of mAP on the validation dataset. Because we can only specify one evaluation metric for automatic model tuning, we choose to minimize Val_CrossEntropy.

For details on the hyperparameters, you can go through the Studio notebook.

Metrics

Next, we compare the results from the approaches based on important metrics and the infrastructure cost:

  • Loss function difference across models – All the different algorithms define the same loss function for object detection task: cross-entropy and smooth L1 loss. However, we use them differently:

    • The Type 1 (legacy) object detection algorithm has defined mAP on the validation data, and we use it as the metric to find a training job that maximizes mAP.
    • The Type 2 (latest) object detection algorithm, however, doesn’t define mAP. Instead, it defines Val_SmoothL1 loss and Val_CrossEntropy loss on the validation data. During model training with HPO, we need to specify one metric for automatic model tuning to monitor and parse. Therefore, we use Val_CrossEntropy loss as the metric and find the training job that minimizes it.
  • Validation metric (mAP) – We use the mAP on the validation dataset as our metric, where average precision is the average of precision and recall. mAP is the standard evaluation metric used in the COCO challenge for object detection tasks. For more information about the applicability of mAP for object detection, refer to mAP (mean Average Precision) for Object Detection. Because there is a difference in loss function between Type 1 and Type 2 models, we manually calculate the mAP for each type of model on the test dataset. We accomplish this by deploying the models behind a SageMaker endpoint and calling the model endpoint to score on the subset of the dataset. The results are then compared against the ground truth to calculate the mAP for each model type.
  • Training Instances Runtime cost – For simplicity, we only report the infrastructure cost incurred for each of the four approaches highlighted in the previous section. The cost is reported in dollars and calculated based on the runtime of the underlying Amazon Elastic Compute Cloud (Amazon EC2) instances.

Notebook

The Studio notebook is available on GitHub.

Results

The steel surface dataset has a total of 1,800 images in six categories. As discussed in the previous section, because there is a difference in the loss function that Type 1 (legacy) and Type 2 (latest) models maximize to find the best model, we first perform a train/test split on the dataset. In the final phase of the study, we run inference on the test dataset, so that we can compare across the four approaches using the same metric (mAP).

The test set contains 20% of the original dataset, which we randomly allocate from the full dataset. The remaining 80% is used for the model training phase, which requires us to define the training as well as the validation dataset. Therefore, for the training phase, we do a further 80/20 split on the data, where 80% of the training data is used for training and 20% for validation. See the following table.

Data Number of Samples Percentage of Original Dataset
Full 1,800 100
Train 1,152 64
Validation 288 16
Test 360 20

The output of each of the four approaches was a trained ML model. We plot the results from each of the four approaches alongside the bounding boxes from ground truth as well as the DDN model. The following plot also shows the confidence score for the class prediction.

A confidence score is provided as an evaluation standard. This confidence score shows the probability of the object of interest being detected correctly by the algorithm and is given as a percentage. The scores are taken on the mAP at different IoU (Intersection over Union) thresholds.

For the purpose of generating the mAP score against the test dataset, we deployed each model behind its own SageMaker real-time endpoint. Each inferencing test produced a mAP score.

A larger mAP score implies a higher accuracy of the model test results. Clearly, the Type 2 (latest) models outperforms the Type 1 (legacy) models in regards to accuracy, with or without using HPO. Type 2 with HPO has a slighter edge (mAP 0.375) over one without HPO (mAP 0.371).

We also measured the cost of training for each of the four approaches. We used the P3 instance types, specifically the ml.p3.2xlarge instances for each of the approaches. Each ml.p3.2xlarge instance costs $3.06/hour. Both the inference test mAP score and the cost of training are summarized in the following chart for comparison.

For simplicity, we did a cost comparison on the runtime of the training instances only.

For a more granular estimate of the total cost incurred, including the cost of Studio notebooks as well as the real-time endpoints used for inferencing, refer to the AWS Pricing Calculator for SageMaker.

The results indicate considerable gains in accuracy when moving from the Type 1 (legacy) to Type 2 (latest) model. The mAP score went up from 0.067 to 0.371 without using HPO and 0.226 to 0.375 with HPO respectively. The Type 2 model also took longer to train with the same instance type, implying that the accuracy gains also meant higher infrastructure cost. However, all mentioned approaches outperformed the DDN model (introduced in Visual inspection automation using Amazon SageMaker JumpStart) on all metrics. Training the Type 1 (legacy) model took 34 minutes, the Type 2 (latest) model took 1 hour, and the DDN model took over 8 hours. This indicates that fine-tuning a pre-trained model is much more efficient than training a model from scratch.

We also found that HPO (SageMaker automatic model tuning) is extremely effective, especially for models with large hyperparameter search spaces with 4x improvement in mAP score for Type 1 (legacy) model. We noted that we yielded much better model accuracy results when fine-tuning on three hyperparameters (learning rate, momentum, and weight decay) for the Type 1 (legacy) models as opposed to only one hyperparameter (ADAM learning rate) for the Type 2 (latest) model. This is because there is a relatively larger search space and therefore more room for improvement for the Type 1 (legacy) model. However, we need to trade off model performance with infrastructure cost and training time when running HPO.

Conclusion

In this post, we walked through the many ML model training options available with SageMaker and focused specifically on SageMaker built-in algorithms and pre-trained models. We introduced Type 1 (legacy) and Type 2 (latest) models. The built-in Sagemaker object detection models discussed in this post were pre-trained on large-scale datasets—the ImageNet dataset includes 14,197,122 images for 21,841 categories, and the PASCAL VOC dataset includes 11,530 images for 20 categories. The pre-trained models have learned rich and diverse low-level features, and can efficiently transfer knowledge to fine-tuned models and focus on learning high-level semantic features for the target dataset. You can find all built-in algorithms and fine-tunable pre-trained models at Built-in Algorithms with pre-trained Model Table and choose one for your use case. The use cases span from text summarization and question answering to computer vision and regression or classification.

In the beginning, we made an assertion that fine-tuning a SageMaker pre-trained model will take a fraction of training time that training a model from scratch. We trained a DNN model from scratch and introduced two types of SageMaker Built-in algorithms with pretrained models: Type (legacy) and Type 2 (latest). We further showcased four approaches, two of which used SageMaker automated model tuning, and finally arrived at the most performant model. When considering both training time as well as runtime cost, all SageMaker built-in algorithms outperformed the DDN model, thereby validating our assertion.

Although both Type 1 (legacy) and Type 2 (latest) outperformed training the DDN model from scratch, visual and numerical comparison confirmed that the Type 2 (latest) model and Type 2 (latest) model with HPO outperforms Type 1 (legacy) models. HPO had a big impact on accuracy for Type 1 models; however, it saw modest gains using HPO for Type 2 models, due to a constricted hyperparameter space.

In summary, for certain use cases, fine-tuning a pretrained model is both more efficient and more performant. We suggest taking advantage of the pre-trained Sagemaker built-in pretrained models and fine-tune on your target datasets. To get started, you need a Studio environment. For more information, refer to the Studio Development Guide and make sure to enable SageMaker projects and JumpStart. When your Studio setup is complete, navigate to the Studio Launcher to find the full list of JumpStart solutions and models. To recreate or modify the experiment in this post, choose the “Product Defect Detection” solution, which comes prepackaged with the notebook used to experiment, as shown in the following video. After you launch the solution, you can access the mentioned work in the notebook titled visual_object_detection.ipynb.


About the authors

Vedant Jain is a Sr. AI/ML Specialist Solutions Architect, helping customers derive value out of the Machine Learning ecosystem at AWS. Prior to joining AWS, Vedant has held ML/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) & JP Morgan Chase. Outside of his work, Vedant is passionate about making music, using Science to lead a meaningful life & exploring delicious vegetarian cuisine from around the world.

Tao Sun is an Applied Scientist in Amazon Search. He obtained his Ph.D. in Computer Science from University of Massachusetts, Amherst. His research interests lie in deep reinforcement learning and probabilistic modeling. In the past, Tao worked for AWS Sagemaker Reinforcement Learning team and contributed to RL research and applications. Tao is now working on Page Template Optimization at Amazon Search.

Read More

InformedIQ automates verifications for Origence’s auto lending using machine learning

This post was co-written with Robert Berger and Adine Deford from InformedIQ.

InformedIQ is the leader in AI-based software used by the nation’s largest financial institutions to automate loan processing verifications and consumer credit applications in real time per the lenders’ policies. They improve regulatory compliance, reduce cost, and increase accuracy by decreasing human error rates that are caused by the repetitive nature of tasks. Informed partnered with Origence (the nation’s leading lending technology solutions and services provider for 1,130 credit unions serving over 64 million members) to power Origence’s document process automation functionality for indirect lending to automatically identify documents and validate financing policies, creating a better credit union and dealer experience for their network of over 15,000 dealers. To date, $110 billion in auto loans have originated with Informed’s automation, which is 8% of all US auto loans. Six of the top 10 consumer lenders trust Informed’s technology.

In this post, we learn about the challenges faced and how machine learning (ML) solved the problems.

Problem statement

Manual loan verification document processing is time-consuming. The verification includes consumer stipulations like proof of residence, identity, insurance, and income. It can be prone to human error due to the repetitive nature of tasks.

With ML and automation, Informed can provide a software solution that is available 24/7, over holidays and weekends. The solution works accurately without conscious or unconscious bias to calculate and clear stipulations in under 30 seconds, vs. an average of 7 days for loan verifications, with 99% accuracy.

Solution overview

Informed uses a wide range of AWS offerings and capabilities, including Amazon SageMaker and Amazon Textract in their ML stack to power Origence’s document process automation functionality. The solution automatically extracts data and classifies documents (for example, driver’s license, paystub, W2 form, or bank statement), providing the required fields for the consumer verifications used to determine if the lender will grant the loan. Through accurate income calculations and validation of applicant data, loan documents, and documented classification, loans are processed faster and more accurately, with reduced human errors and fraud risk, and added operational efficiency. This helps in creating a better consumer, credit union, and dealer experience.

To classify and extract information needed to validate information in accordance with a set of configurable funding rules, Informed uses a series of proprietary rules and heuristics, text-based neural networks, and image-based deep neural networks, including Amazon Textract OCR via the DetectDocumentText API and other statistical models. The Informed API model can be broken down into five functional steps, as shown in the following diagram: image processing, classification, image feature computations, extractions, and stipulation verification rules, before determining the decision.

Given a sequence of pages for various document types (bank statement, driver’s license, paystub, SSI award letter, and so on), the image processing step performs the necessary image enhancements for each page and invokes multiple APIs, including Amazon Textract OCR for image to text conversion. The rest of the processing steps use the OCR text obtained from image processing and the image for each page.

Main advantages

Informed provides solutions to the auto lending industry that reduce manual processes, support compliance and quality, mitigate risk, and deliver significant cost savings to their customers. Let’s dive into two main advantages of the solution.

Automation at scale with efficiency

The adoption of AWS Cloud technologies and capabilities has helped Informed address a wider range of document types and onboard new partners. Informed has developed integrated, AI/ML-enabled solutions, and continuously strives for innovation to better serve clients.

Almost the entirety of the Informed SaaS service is hosted and enabled by AWS services. Informed is able to offload the undifferentiated heavy lifting for scalable infrastructure and focus on their business objectives. Their architecture includes load balancers, Amazon API Gateway, Amazon Elastic Container Service (Amazon ECS) containers, serverless AWS Lambda, Amazon DynamoDB, and Amazon Relational Database Service (Amazon RDS), in addition to ML technologies like Amazon Textract and SageMaker.

Reducing cost in document extraction

Informed uses new features from Amazon Textract to improve the accuracy of data extraction from documents such as bank statements and paystubs. Amazon Textract is an AI/ML service that automatically extracts text, handwriting, and other forms of metadata from scanned documents, forms, and tables in ways that make further ML processing more efficient and accurate. Informed uses AWS Textract OCR and Analyze Document APIs for both tables and forms as part of the verification process. Informed’s artificial intelligence modeling engine performs complex calculations, ensuring accuracy, identifying omissions, and combating fraud. With AWS, they continue to advance the accuracy and speed of the solution, helping lenders become more efficient by lowering loan processing costs and reducing time to process and fund. With a 99% accuracy rate for field prediction, dealers and credit unions can now focus less on collecting and validating data and more on developing strong customer relationships.

“Partnering with Informed.IQ to integrate their leading AI-based technology allows us to advance our lending systems’ capabilities and performance, further streamlining the overall loan process for our credit unions and their members”

– Brian Hendricks, Chief Product Officer at Origence.

Conclusion

Informed is constantly improving the accuracy, efficiency, and breadth of their automated loan document verifications. This solution can benefit any lending document verification process like personal and student loans, HELOCs, and powersports. The adoption of AWS Cloud technologies and capabilities has helped Informed address the growing complexity of the lending process and improve the dealer and customer experience. With AWS, the company continues to add enhancements that help lenders become more efficient, lower loan processing costs, and provide serverless computing.

Now that you have learned about how ML and automation can solve the loan document verification process, you can get started using Amazon Textract. You can also try out intelligent document processing workshops. Visit Automated data processing from documents to learn more about reference architectures, code samples, industry use cases, blog posts, and more.


About the authors

Robert Berger is the Chief Architect at InformedIQ. He is leading the transformation of the InformedIQ SaaS into a full Serverless Microservice architecture leveraging AWS Cloud, DevOps and Data Oriented Programming. Principal or founder in several other start-ups including InterNex, MetroFi, UltraDevices, Runa, Mist Systems and Omnyway.

Adine Deford is the VP of Marketing at Informed.IQ. She has more than 25 years of technology marketing experience serving industry leaders, world class marketing agencies and technology start-ups.

Jessica Oliveira is an Account Manager at AWS who provides guidance and support to SMB customers in Northern California. She is passionate about building strategic collaborations to help ensure her customer’s success. Outside of work, she enjoys traveling, learning about different languages and cultures, and spending time with her family.

Malini Chatterjee is a Senior Solutions Architect at AWS. She provides guidance to AWS customers on their workloads across a variety of AWS technologies. She brings a breadth of expertise in Data Analytics and Machine Learning. Prior to joining AWS she was architecting data solutions in financial industries. She is very interested in Amazon Future Engineer program enabling middle-school, high-school kids see the art of the possible in STEM. She is very passionate about semi-classical dancing and performs in community events. She loves traveling and spending time with her family.

Read More

Fall Into October With 25 New Games Streaming on GeForce NOW

Cooler weather, the changing colors of the leaves, the needless addition of pumpkin spice to just about everything, and discount Halloween candy are just some things to look forward to in the fall.

GeForce NOW members can add one more thing to the list — 25 games joining the cloud gaming library in October, including day-and-date releases like A Plague Tale: Requiem, Victoria 3 and others.

Let’s start off the cooler months with the six games streaming on GeForce NOW today.

Arriving in October

There’s a heap of gaming goodness in store for GeForce NOW members this month.

A tale continues when A Plague Tale: Requiem releases Tuesday, Oct. 18, enhanced with ray-traced effects for RTX 3080 and Priority members.

After escaping their devastated homeland in the critically acclaimed A Plague Tale: Innocence, siblings Amicia and Hugo venture south of 14th-century France to new regions and vibrant cities. But when Hugo’s powers reawaken, death and destruction return in a flood of devouring rats. Forced to flee once more, the siblings place their hopes in a prophesied island that may hold the key to saving Hugo.

The new adventure begins soon — streaming to even Macs and mobile devices with the power of the cloud — so make sure to add the game to your wishlist to start playing when it’s released.

On top of that, check out the rest of the games coming this month:

  • Asterigos: Curse of the Stars (New release on Steam, Oct. 11)
  • Kamiwaza: Way of the Thief (New release on Steam, Oct. 11)
  • Ozymandias: Bronze Age Empire Sim (New release on Steam, Oct. 11)
  • LEGO Bricktales (New release on Steam, Oct. 12)
  • PC Building Simulator 2 (New release on Epic Games Store, Oct 12)
  • The Last Oricru (New release on Steam, Oct. 13)
  • Scorn (New release on Steam and Epic Games Store, Oct. 14)
  • A Plague Tale: Requiem (New release on Steam and Epic Games Store, Oct. 18)
  • Warhammer 40,000: Shootas, Blood & Teef (New release on Steam Oct. 20)
  • FAITH: The Unholy Trinity (New release on Steam, Oct. 21)
  • Victoria 3 (New release on Steam, Oct. 25)
  • The Unliving (New release on Steam, Oct. 31)
  • Commandos 3 – HD Remaster (Steam and Epic Games Store)
  • Draw Slasher (Steam)
  • Guild Wars: Game of the Year (Steam)
  • Guild Wars: Trilogy (Steam)
  • Labyrinthine (Steam)
  • Volcanoids (Steam)
  • Monster Outbreak (Steam and Epic Games Store)

Gotta Go Fast

The great thing about GFN Thursday is that there are new games every week, so there’s no need to wait until Halloween to treat yourself to great gaming. Six games arrive today, including the new release of Dakar Desert Rally with support for NVIDIA DLSS technology.

Dakar Desert Rally on GeForce NOW
Honestly, don’t even bother going to the car wash. You’ll just get it dirty again.

Dakar Desert Rally captures the speed and excitement of Amaury Sport Organisation’s largest rally race, with a wide variety of licensed vehicles from the world’s top makers. An in-game dynamic weather system means racers will need to overcome the elements as well as the competition to win. Unique challenges and fierce, online multiplayer races are available for all members, whether an off-road simulation diehard or a casual racing fan.

This week also brings the latest season of Ubisoft’s Roller Champions. “Dragon’s Way” includes new maps, effects, cosmetics, emotes, gear and other seasonal goodies to bring out gamers’ inner beasts.

Here’s the full list of new games coming to the cloud this week:

  • Marauders (New release on Steam)
  • Dakar Desert Rally (New release on Steam)
  • Lord of Rigel (New release on Steam)
  • Priest Simulator (New release on Steam)
  • Barotrauma (Steam)
  • Black Desert Online – North America and Europe (Pearl Abyss Launcher)

Pssst – Wake Up, September Ended

Don’t sleep on these extra 13 titles that came to the cloud on top of the 22 games announced in September.

For some frightful fun as we enter Spooky Season, let us know what game still haunts your dreams on Twitter or in the comments below.

The post Fall Into October With 25 New Games Streaming on GeForce NOW appeared first on NVIDIA Blog.

Read More

Low-Rank Optimal Transport: Approximation, Statistics and Debiasing

The matching principles behind optimal transport (OT) play an increasingly important role in machine learning, a trend which can be observed when OT is used to disambiguate datasets in applications (e.g. single-cell genomics) or used to improve more complex methods (e.g. balanced attention in transformers or self-supervised learning). To scale to more challenging problems, there is a growing consensus that OT requires solvers that can operate on millions, not thousands, of points. The low-rank optimal transport (LOT) approach advocated in (Scetbon et al., 2021) holds several promises in that…Apple Machine Learning Research

Prevent account takeover at login with the new Account Takeover Insights model in Amazon Fraud Detector

Digital is the new normal, and there’s no going back. Every year, consumers visit, on average, 191 websites or services requiring a user name and password, and the digital footprint is expected to grow exponentially. So much exposure naturally brings added risks like account takeover (ATO).

Each year, bad actors compromise billions of accounts through stolen credentials, phishing, social engineering, and multiple forms of ATO. To put it into perspective: account takeover fraud increased by 90% to an estimated $11.4 billion in 2021 compared with 2020. Beyond the financial impact, ATOs damage the customer experience, threaten brand loyalty and reputation, and strain fraud teams as they manage chargebacks and customer claims.

Many companies, even those with sophisticated fraud teams, use rules-based solutions to detect compromised accounts because they’re simple to create. To bolster their defenses and reduce friction for legitimate users, businesses are increasingly investing in AI and machine learning (ML) to detect account takeovers.

AWS can help you improve your fraud mitigation with solutions like Amazon Fraud Detector. This fully managed AI service allows you to identify potentially fraudulent online activities by enabling you to train custom ML fraud detection models without ML expertise.

This post discusses how to create a real-time detector endpoint using the new Account Takeover Insights (ATI) model in Amazon Fraud Detector.

Overview of solution

Amazon Fraud Detector relies on specific models with tailored algorithms, enrichments, and feature transformations to detect fraudulent events across multiple use cases. The newly launched ATI model is a low-latency fraud detection ML model designed to detect potentially compromised accounts and ATO fraud. The ATI model detects up to four times more ATO fraud than traditional rules-based account takeover solutions while minimizing the level of friction for legitimate users.

The ATI model is trained using a dataset containing your business’s historical login events. Event labels are optional for model training because the ATI model uses an innovative approach to unsupervised learning. The model differentiates events generated by the actual account owner (legit events) from those generated by bad actors (anomalous events).

Amazon Fraud Detector derives the user’s past behavior by continuously aggregating the data provided. Examples of user behavior include the number of times the user signed in from a specific IP address. With these additional enrichments and aggregates, Amazon Fraud Detector can generate strong model performance from a small set of inputs from your login events.

For a real-time prediction, you call the GetEventPrediction API after a user presents valid login credentials to quantify the risk of ATO. In response, you receive a model score between 0–1000, where 0 shows low fraud risk and 1000 shows high fraud risk, and an outcome based on a set of business rules you define. You can then take the appropriate action on your end: approve the login, deny the login, or challenge the user by enforcing an additional identity verification.

You can also use the ATI model to asynchronously evaluate account logins and take action based on the outcome, such as adding the account to an investigation queue so a human reviewer can determine if further action should be taken.

The following steps outline the process of training an ATI model and publishing a detector endpoint to generate fraud predictions:

  • Prepare and validate the data.
  • Define the entity, event and event variables, and event label (optional).
  • Upload event data.
  • Initiate model training.
  • Evaluate the model.
  • Create a detector endpoint and define business rules.
  • Get real-time predictions.

Prerequisites

Before getting started, complete the following prerequisite steps:

Prepare and validate the data

Amazon Fraud Detector requires that you provide your user account login data in a CSV file encoded in the UTF-8 format. For the ATI, you must provide certain event metadata and event variables in the header line of your CSV file.

The required event metadata is as follows:

  • EVENT_ID – A unique identifier for the login event.
  • ENTITY_TYPE – The entity that performs the login event, such as a merchant or a customer.
  • ENTITY_ID – An identifier for the entity performing the login event.
  • EVENT_TIMESTAMP – The timestamp when the login event occurred. The timestamp format must be in ISO 8601 standard in UTC.
  • EVENT_LABEL (optional) – A label that classifies the event as fraudulent or legitimate. You can use any labels, such as fraud, legit, 1, or 0.

Event metadata must be in uppercase letters. Labels aren’t required for login events. However, we recommend including EVENT_LABEL metadata and providing labels for your login events if available. If you provide labels, Amazon Fraud Detector uses them to automatically calculate an Account Takeover Discovery Rate and display it in the model performance metrics.

The ATI model has both required and optional variables. Event variable names must be in lowercase letters.

The following table summarizes the mandatory variables.

Category Variable type Description
IP address IP_ADDRESS The IP address used in the login event
Browser and device USERAGENT The browser, device, and OS used in the login event
Valid credentials VALIDCRED Indicates if the credentials that were used for login are valid

The following table summarizes the optional variables.

Category Type Description
Browser and device FINGERPRINT The unique identifier for a browser or device fingerprint
Session ID SESSION_ID The identifier for an authentication session
Label EVENT_LABEL A label that classifies the event as fraudulent or legitimate (such as fraud, legit, 1, or 0)
Timestamp LABEL_TIMESTAMP The timestamp when the label was last updated; this is required if EVENT_LABEL is provided

You can provide additional variables. However, Amazon Fraud Detector won’t include these variables for training an ATI model.

Dataset preparation

As you start to prepare your login data, you must meet the following requirements:

  • Provide at least 1,500 entities (individual user accounts), each with at least two associated login events
  • Your dataset must cover at least 30 days of login events

The following configurations are optional:

  • Your dataset can include examples of unsuccessful login events
  • You can optionally label these unsuccessful logins as fraudulent or legitimate
  • You can prepare historical data with login events spanning more than 6 months and include 100,000 entities

We provide a sample dataset for testing purposes that you can use to get started.

Data validation

Before creating your ATI model, Amazon Fraud Detector checks if the metadata and variables you included in your dataset for training the model meet the size and format requirements. For more information, see Dataset validation. If the dataset doesn’t pass validation, a model isn’t created. For details on common dataset errors, see Common event dataset errors.

Define the entity, event type, and event variables

In this section, we walk through the steps to create an entity, event type, and event variables. Optionally, you can also define event labels.

Define the entity

The entity defines who is performing the event. To create an entity, complete the following steps:

  • On the Amazon Fraud Detector console, in the navigation pane, choose Entities.
  • Choose Create.
  • Enter an entity name and optional description.
  • Choose Create entity.

Define the event and event variables

An event is a business activity evaluated for fraud risk; this event is performed by the entity we just created. The event type defines the structure for an event sent to Amazon Fraud Detector, including variables of the event, the entity performing the event, and, if available, the labels that classify the event.

To create an event, complete the following steps:

  • On the Amazon Fraud Detector console, in the navigation pane, choose Events.
  • Choose Create.
  • For Name, enter a name for your event type.
  • For Entity, choose the entity created in the previous step.

Define the event variables

For event variables, complete the following steps:

  • In the Create IAM role section, enter the specific bucket name where you uploaded your training data.
    The name of the S3 bucket must be the name where you uploaded your dataset. Otherwise, you get an access denied exception error.
  • Choose Create role.

  • For Data location, enter the path to your training data, the path is the S3 URI you copied during the prerequisite steps, and choose Upload.

Amazon Fraud Detector extracts the headers from your training dataset and creates a variable for each header. Make sure to assign the variable to the correct variable type. As part of the model training process, Amazon Fraud Detector uses the variable type associated with the variable to perform variable enrichment and feature engineering. For more details about variable types, see Variable types.

Define event labels (optional)

Labels are used to categorize individual events as either fraud or legitimate. Event labels are optional for model training because the ATI model uses an innovative approach to unsupervised learning. The model differentiates events generated by the actual account owner (legit events) from those generated by abusive actors (anomalous events). We recommend you include EVENT_LABEL metadata and provide labels for your login events if available. If you provide labels, Amazon Fraud Detector uses them to automatically calculate an Account Takeover Discovery Rate and display it in the model performance metrics.

To create an event, complete the following steps:

  • Define two labels (for this post, 1 and 0).
  • Choose Create event type.

Upload event data

In this session, we walk through the steps to upload events data to the service for model training.

ATI models are trained on a dataset stored internally in Amazon Fraud Detector. By storing event data in Amazon Fraud Detector, you can train models that use auto-computed variables to improve performance, simplify model retraining, and update fraud labels to close the machine learning feedback loop. See Stored events for more information on storing your event dataset with Amazon Fraud Detector.

After you define your event, navigate to the Stored events tab. On the Stored events tab, you can see information about your dataset, such as the number of events stored and the total size of the dataset in MB. Because you just created this event type, there are no stored events yet. On this page, you can turn event ingestion on or off. When event ingestion is on, you can upload historical event data to Amazon Fraud Detector and automatically store event data from predictions in real time.

The easiest way to store historical data is by uploading a CSV file and importing the events. Alternatively, you can stream the data into Amazon Fraud Detector using the SendEvent API (see our GitHub repository for sample notebooks). To import the event from a CSV file, complete the following steps:

  • Under Import events data, choose New import.
    You likely need to create a new IAM role. The import events feature requires both read and write access to Amazon S3.

  • Create a new IAM role and provide the S3 buckets for input and output files.
    The IAM role you create grants Amazon Fraud Detector access to these buckets to read input files and store output files. If you don’t plan to store output files in a separate bucket, enter the same bucket name for both.
  • Choose Create role.

  • Enter the location of the CSV file that contains your event data. This should be the S3 URI you copied earlier.
  • Chose Start to start importing the events.

The import time varies based on the number of events you’re importing. For a dataset with 20,000 events, the process takes around 12 minutes, and after you refresh the page, the status changes to Completed. If the status changes to Error, choose the job name to show why the import failed.

Initiate model training

After successfully importing the events, you have all the pieces to initiate model training. To train a model, complete the following steps:

  • On the Amazon Fraud Detector console, in the navigation pane, choose Models.
  • Choose Add model and select Create model.
  • For Model name, enter the desired name for your model
  • For Model type, select Takeover Account Insights.
  • For Event type, choose the event type you created earlier.

  • Under Historical event data, you can specify the date range of events to train the model if needed.
  • Choose Next.

  • For this post, you configure training by identifying the variables used as inputs to the model.
  • After evaluating the variables, choose Next.

It’s a best practice to include all the available variables, even if you’re unsure about their value to the model. After the model is trained, Amazon Fraud Detector provides a ranked list of each variable’s impact on the model performance, so you can know whether to include that variable in future model training. If labels are provided, Amazon Fraud Detector uses them to evaluate and display model performance in terms of the model’s discovery rate.

If labels aren’t provided, Amazon Fraud Detector uses negative sampling to provide examples or analogous login attempts that help the model distinguish between legitimate and fraudulent activities. This produces precise risk scores that improve the model’s ability to capture incorrectly flagged legitimate activities.

After reviewing the model configured in the first two steps, choose Create and train the model.

You can see the model in training status in the console page. Creating and training the model takes approximately 45 minutes to complete. When the model has stopped training, you can check model performance by choosing the model version.

Evaluate model performance and deploy the model

In this session, we walk through the steps to review and evaluate the model performance.

Amazon Fraud Detector validates model performance using 15% of your data that wasn’t used to train the model and provides performance metrics. You need to consider these metrics and your business objectives to define a threshold that aligns with your business model. For further details on the metrics and how to determine thresholds, see Model performance metrics.

ATI is an anomaly detection model rather than a classification model; therefore, the evaluation metrics differ from classification models. When your ATI model has finished training, you can see the Anomaly Separation Index (ASI), a holistic measure of the model’s ability to identify high-risk anomalous logins. An ASI of 75% or more is considered good, 90% or more is considered high, and below 75% is considered poor.

To assist in choosing the right balance, Amazon Fraud Detector provides the following metrics to evaluate ATI model performance:

  • Anomaly Separation Index (ASI) – Summarizes the overall ability of the model to separate anomalous activities from the expected behavior of users. A model with no separability power will have the lowest possible ASI score of 0.5. In contrast, the model with a high separability power will have the highest possible ASI score of 1.0.
  • Challenge Rate (CR) – The score threshold indicates the percentage of login events the model would recommend challenging in the form of a one-time password, multi-factor authentication, identify verification, investigation, and so on.
  • Anomaly Discovery Rate (ADR) – Quantifies the percentage of anomalies the model can detect at the selected score threshold. A lower score threshold increases the percentage of anomalies captured by the model. Still, it would also require challenging a more significant percentage of login events, leading to higher customer friction.
  • ATO Discovery Rate (ATODR) – Quantifies the percentage of account compromise events that the model can detect at the selected score threshold. This metric is only available if 50 or more entities with at least one labeled ATO event are present in the ingested dataset.

In the following example, we have an ASI of 0.96 (high), which indicates a high ability to separate anomalous activities from the normal behavior of users. By writing a rule using a model score threshold of 500, you challenge or create friction on 6% of all login activities catching 96% of anomalous activities.

Another important metric is the model variable importance. Variable importance gives you an understanding of how the different variables relate to the model performance. You can have two types of variables: raw and aggregate variables. Raw variables are the ones that were defined based on the dataset, whereas aggregate variables are a combination of multiple variables that are enriched and have an aggregated importance value.

For more information about variable importance, see Model variable importance.

A variable (raw or aggregate) with a much higher number relative to the rest could indicate that the model might be overfitting. In contrast, variables with relatively lowest numbers could just be noise.

After reviewing the model performance and deciding what model score thresholds align with your business model, you can deploy the model version. For that, on the Actions menu, choose Deploy model version. With the model deployed, we create a detector endpoint and perform real-time prediction.

Create a detector endpoint and define business rules

Amazon Fraud Detector uses detector endpoints to generate fraud prediction. A detector contains detection logic, such as trained models and business rules, for a specific event you want to evaluate for fraud. Detection logic uses rules to tell Amazon Fraud Detector how to interpret the data associated with the model.

To create a detector, complete the following steps:

  • On the Amazon Fraud Detector console, in the navigation pane, choose Detectors.
  • Choose Create detector.
  • For Detector name, enter a name.
  • Optionally, describe your detector.
  • For Event type, choose the same event type as the model created earlier.
  • Choose Next.

  • On the Add model (optional) page, choose Add model.

  • To add a model, choose the model you trained and published during the model training steps and choose the active version.
  • Choose Add model.

As part of the next step, you create the business rules that define an outcome. A rule is a condition that tells Amazon Fraud Detector how to interpret variable values during a fraud prediction. A rule consists of one or more variables, a logic expression, and one or more outcomes. An outcome is the result of a fraud prediction and is returned if the rule matches during an evaluation.

  • Define decline_rule as $<your_model_name_insightscore >= 950 with outcome deny_login.
  • Define friction_rule as $ your_model_name _insightscore >= 855 and $ your_model_name_insightscore >= 950 with outcome challenge_login.
  • Define approve_rule as $account_takeover_model_insightscore < 855 with outcome approve_login.

Outcomes are strings returned in the GetEventPrediction API response. You can use outcomes to trigger events by calling applications and downstream systems or to simply identify who is likely to be fraud or legitimate.

  • On the Add Rules page, choose Next after you finish adding all your rules.

  • In the Configure rule execution section, choose the mode for your rules engine.
    The Amazon Fraud Detector rules engine has two modes: first matched or all matched. First matched mode is for sequential rule runs, returning the outcome for the first condition met. The other mode is all matched, which evaluates all rules and returns outcomes from all the matching rules. In this example, we use the first matched mode for our detector.

After this process, you’re ready to create your detector and run some tests.

  • To run a test, go to your newly created detector and choose the detector version you want to use.
  • Provide the variable values as requested and choose Run test.

As a result of the test, you receive the risk score and the outcome based on your business rules.

You can also search past predictions by going to the left panel and choosing Search past predictions. The prediction is based on each variable’s contribution to the overall likelihood of a fraudulent event. The following screenshot is an example of a past prediction showing the input variables and how they influenced the fraud prediction score.

Get real-time predictions

To get real-time predictions and integrate Amazon Fraud Detector into your workflow, we need to publish the detector endpoint. Complete the following steps:

  • Go to the newly created detector and choose the detector version, which will be version 1.
  • On the Actions menu, choose Publish.

You can perform real-time predictions with the published detector by calling the GetEventPrediction API. The following is a sample Python code for calling the GetEventPrediction API:

import boto3
fraudDetector = boto3.client('frauddetector')

fraudDetector.get_event_prediction(
detectorId = 'sample_detector',
eventId = '802454d3-f7d8-482d-97e8-c4b6db9a0428',
eventTypeName = 'sample_transaction',
eventTimestamp = '2021-01-13T23:18:21Z',
entities = [{'entityType':'customer', 'entityId':'12345'}],
eventVariables = {
    'email_address' : 'johndoe@exampledomain.com',
    'ip_address' : '1.2.3.4'
}
)

Conclusion

Amazon Fraud Detector relies on specific models with tailored algorithms, enrichments, and feature transformations to detect fraudulent events across multiple use cases. In this post, you learned how to ingest data, train and deploy a model, write business rules, and publish a detector to generate real-time fraud prediction on potentially compromised accounts.

Visit Amazon Fraud Detector to learn more about Amazon Fraud Detector or our GitHub repo for code samples, notebook, and synthetic datasets.


About the authors

Marcel Pividal is a Sr. AI Services Solutions Architect in the World-Wide Specialist Organization. Marcel has more than 20 years of experience solving business problems through technology for Fintechs, Payment Providers, Pharma, and government agencies. His current areas of focus are Risk Management, Fraud Prevention, and Identity Verification.

Mike Ames is a data scientist turned identity verification solution specialist, he has extensive experience developing machine learning and AI solutions to protect organizations from fraud, waste and abuse. In his spare time, you can find him hiking, mountain biking or playing freebee with his dog Max.

Read More

Metrics for evaluating content moderation in Amazon Rekognition and other content moderation services

Content moderation is the process of screening and monitoring user-generated content online. To provide a safe environment for both users and brands, platforms must moderate content to ensure that it falls within preestablished guidelines of acceptable behavior that are specific to the platform and its audience.

When a platform moderates content, acceptable user-generated content (UGC) can be created and shared with other users. Inappropriate, toxic, or banned behaviors can be prevented, blocked in real time, or removed after the fact, depending on the content moderation tools and procedures the platform has in place.

You can use Amazon Rekognition Content Moderation to detect content that is inappropriate, unwanted, or offensive, to create a safer user experience, provide brand safety assurances to advertisers, and comply with local and global regulations.

In this post, we discuss the key elements needed to evaluate the performance aspect of a content moderation service in terms of various accuracy metrics, and a provide an example using Amazon Rekognition Content Moderation API’s.

What to evaluate

When evaluating a content moderation service, we recommend the following steps.

Before you can evaluate the performance of the API on your use cases, you need to prepare a representative test dataset. The following are some high-level guidelines:

  • Collection – Take a large enough random sample (images or videos) of the data you eventually want to run through Amazon Rekognition. For example, if you plan to moderate user-uploaded images, you can take a week’s worth of user images for the test. We recommend choosing a set that has enough images without getting too large to process (such as 1,000–10,000 images), although larger sets are better.
  • Definition – Use your application’s content guidelines to decide which types of unsafe content you’re interested in detecting from the Amazon Rekognition moderation concepts taxonomy. For example, you may be interested in detecting all types of explicit nudity and graphic violence or gore.
  • Annotation – Now you need a human-generated ground truth for your test set using the chosen labels, so that you can compare machine predictions against them. This means that each image is annotated for the presence or absence of your chosen concepts. To annotate your image data, you can use Amazon SageMaker Ground Truth (GT)to manage image annotation. You can refer to GT for image labeling, consolidating annotations and processing annotation output.

Get predictions on your test dataset with Amazon Rekognition

Next, you want to get predictions on your test dataset.

The first step is to decide on a minimum confidence score (a threshold value, such as 50%) at which you want to measure results. Our default threshold is set to 50, which offers a good balance between retrieving large amounts of unsafe content without incurring too many false predictions on safe content. However, your platform may have different business needs, so you should customize this confidence threshold as needed. You can use the MinConfidence parameter in your API requests to balance detection of content (recall) vs the accuracy of detection (precision). If you reduce MinConfidence, you are likely to detect most of the inappropriate content, but are also likely to pick up content that is not actually inappropriate. If you increase MinConfidence you are likely to ensure that all your detected content is truly inappropriate but some content may not be tagged. We suggest experimenting with a few MinConfidence values on your dataset and quantitatively select the best value for your data domain.

Next, run each sample (image or video) of your test set through the Amazon Rekognition moderation API (DetectModerationLabels).

Measure model accuracy on images

You can assess the accuracy of a model by comparing human-generated ground truth annotations with the model predictions. You repeat this comparison for every image independently and then aggregate over the whole test set:

  • Per-image results – A model prediction is defined as the pair {label_name, confidence_score} (where the confidence score >= the threshold you selected earlier). For each image, a prediction is considered correct when it matches the ground truth (GT). A prediction is one of the following options:

    • True Positive (TP): both prediction and GT are “unsafe”
    • True Negative (TN): both prediction and GT are “safe”
    • False Positive (FP): the prediction says “unsafe”, but the GT is “safe”
    • False Negative (FN): the prediction is “safe”, but the GT is “unsafe”
  • Aggregated results over all images – Next, you can aggregate these predictions into dataset-level results:

    • False positive rate (FPR) – This is the percentage of images in the test set that are wrongly flagged by the model as containing unsafe content: (FP): FP / (TN+FP).
    • False negative rate (FNR) – This is the percentage of unsafe images in the test set that are missed by the model: (FN): FN / (FN+TP).
    • True positive rate (TPR) – Also called recall, this computes the percentage of unsafe content (ground truth) that is correctly discovered or predicted by the model: TP / (TP + FN) = 1 – FNR.
    • Precision – This computes the percentage of correct predictions (unsafe content) with regards to the total number of predictions made: TP / (TP+FP).

Let’s explore an example. Let’s assume that your test set contains 10,000 images: 9,950 safe and 50 unsafe. The model correctly predicts 9,800 out of 9,950 images as safe and 45 out of 50 as unsafe:

  • TP = 45
  • TN = 9800
  • FP = 9950 – 9800 = 150
  • FN = 50 – 45 = 5
  • FPR = 150 / (9950 + 150) = 0.015 = 1.5%
  • FNR = 5 / (5 + 45) = 0.1 = 10%
  • TPR/Recall = 45 / (45 + 5) = 0.9 = 90%
  • Precision = 45 / (45 + 150) = 0.23 = 23%

Measure model accuracy on videos

If you want to evaluate the performance on videos, a few additional steps are necessary:

  1. Sample a subset of frames from each video. We suggest sampling uniformly with a rate of 0.3–1 frames per second (fps). For example, if a video is encoded at 24 fps and you want to sample one frame every 3 seconds (0.3 fps), you need to select one every 72 frames.
  2. Run these sampled frames through Amazon Rekognition content moderation. You can either use our video API, which already samples frames for you (at a rate of 3 fps), or use the image API, in which case you want to sample more sparsely. We recommend the latter option, given the redundancy of information in videos (consecutive frames are very similar).
  3. Compute the per-frame results as explained in the previous section (per-image results).
  4. Aggregate results over the whole test set. Here you have two options, depending on the type of outcome that matters for your business:
    1. Frame-level results – This considers all the sampled frames as independent images and aggregates the results exactly as explained earlier for images (FPR, FNR, recall, precision). If some videos are considerably longer than others, they will contribute more frames to the total count, making the comparison unbalanced. In that case, we suggest changing the initial sampling strategy to a fixed number of frames per video. For example, you could uniformly sample 50–100 frames per video (assuming videos are at least 2–3 minutes long).
    2. Video-level results – For some use cases, it doesn’t matter whether the model is capable of correctly predicting 50% or 99% of the frames in a video. Even a single wrong unsafe prediction on a single frame could trigger a downstream human evaluation and only videos with 100% correct predictions are truly considered correctly. If this is your use case, we suggest you compute FPR/FNR/TPR over the frames of each video and consider the video as follows:
Video ID Accuracy Per-Video Categorization
Results Aggregated Over All the Frames of Video ID

Total FP = 0

Total FN = 0

Perfect predictions
. Total FP > 0 False Positive (FP)
. Total FN > 0 False Negative (FN)

After you have computed these for each video independently, you can then compute all the metrics we introduced earlier:

  • The percentage of videos that are wrongly flagged (FP) or missed (FN)
  • Precision and recall

Measure performance against goals

Finally, you need to interpret these results in the context of your goals and capabilities.

First, consider your business needs in regards to the following:

  • Data – Learn about your data (daily volume, type of data, and so on) and the distribution of your unsafe vs. safe content. For example, is it balanced (50/50), skewed (10/90) or very skewed (1/99, meaning that only 1% is unsafe)? Understanding such distribution can help you define your actual metric goals. For example, the number of safe content is often an order of magnitude larger than unsafe content (very skewed), making this almost an anomaly detection problem. Within this scenario, the number of false positives may outnumber the number of true positives, and you can use your data information (distribution skewness, volume of data, and so on) to decide the FPR you can work with.
  • Metric goals – What are the most critical aspects of your business? Lowering the FPR often comes at the cost of a higher FNR (and vice versa) and it’s important to find the right balance that works for you. If you can’t miss any unsafe content, you likely want close to 0% FNR (100% recall). However, this will incur the largest number of false positives, and you need to decide the target (maximum) FPR you can work with, based on your post-prediction pipeline. You may want to allow some level of false negatives to be able to find a better balance and lower your FPR: for example, accepting a 5% FNR instead of 0% could reduce the FPR from 2% to 0.5%, considerably reducing the number of flagged contents.

Next, ask yourself what mechanisms you will use to parse the flagged images. Even though the API’s may not provide 0% FPR and FNR, it can still bring huge savings and scale (for example, by only flagging 3% of your images, you have already filtered out 97% of your content). When you pair the API with some downstream mechanisms, like a human workforce that reviews the flagged content, you can easily reach your goals (for example, 0.5% flagged content). Note how this pairing is considerably cheaper than having to do a human review on 100% of your content.

When you have decided on your downstream mechanisms, we suggest you evaluate the throughput that you can support. For example, if you have a workforce that can only verify 2% of your daily content, then your target goal from our content moderation API is a flag rate (FPR+TPR) of 2%.

Finally, if obtaining ground truth annotations is too hard or too expensive (for example, your volume of data is too large), we suggest annotating the small number of images flagged by the API. Although this doesn’t allow for FNR evaluations (because your data doesn’t contain any false negatives), you can still measure TPR and FPR.

In the following section, we provide a solution for image moderation evaluation. You can take a similar approach for video moderation evaluation.

Solution overview

The following diagram illustrates the various AWS services you can use to evaluate the performance of Amazon Rekognition content moderation on your test dataset.

The content moderation evaluation has the following steps:

  1. Upload your evaluation dataset into Amazon Simple Storage Service (Amazon S3).
  2. Use Ground Truth to assign ground truth moderation labels.
  3. Generate the predicted moderation labels using the Amazon Rekognition pre-trained moderation API using a few threshold values. (For example, 70%, 75% and 80%).
  4. Assess the performance for each threshold by computing true positives, true negatives, false positives, and false negatives. Determine the optimum threshold value for your use case.
  5. Optionally, you can tailor the size of the workforce based on true and false positives, and use Amazon Augmented AI (Amazon A2I) to automatically send all flagged content to your designated workforce for a manual review.

The following sections provide the code snippets for steps 1, 2, and 3. For complete end-to-end source code, refer to the provided Jupyter notebook.

Prerequisites

Before you get started, complete the following steps to set up the Jupyter notebook:

  1. Create a notebook instance in Amazon SageMaker.
  2. When the notebook is active, choose Open Jupyter.
  3. On the Jupyter dashboard, choose New, and choose Terminal.
  4. In the terminal, enter the following code:
    cd SageMaker
    git clone https://github.com/aws-samples/amazon-rekognition-code-samples.git

  5. Open the notebook for this post: content-moderation-evaluation/Evaluating-Amazon-Rekognition-Content-Moderation-Service.ipynb.
  6. Upload your evaluation dataset to Amazon Simple Storage Service (Amazon S3).

We will now go through steps 2 through 4 in the Jupyter notebook.

Use Ground Truth to assign moderation labels

To assign labels in Ground Truth, complete the following steps:

  1. Create a manifest input file for your Ground Truth job and upload it to Amazon S3.
  2. Create the labeling configuration, which contains all moderation labels that are needed for the Ground Truth labeling job.To check the limit for the number of label categories you can use, refer to Label Category Quotas. In the following code snippet, we use five labels (refer to the hierarchical taxonomy used in Amazon Rekognition for more details) plus one label (Safe_Content) that marks content as safe:
    # customize CLASS_LIST to include all labels that can be used to classify sameple data, it's up to 10 labels
    # In order to easily match image label with content moderation service supported taxonomy, 
    
    CLASS_LIST = ["<label_1>", "<label_2>", "<label_3>", "<label_4>", "<label_5>", "Safe_Content"]
    print("Label space is {}".format(CLASS_LIST))
    
    json_body = {"labels": [{"label": label} for label in CLASS_LIST]}
    with open("class_labels.json", "w") as f:
        json.dump(json_body, f)
    
    s3.upload_file("class_labels.json", BUCKET, EXP_NAME + "/class_labels.json")

  3. Create a custom worker task template to provide the Ground Truth workforce with labeling instructions and upload it to Amazon S3.
    The Ground Truth label job is defined as an image classification (multi-label) task. Refer to the source code for instructions to customize the instruction template.
  4. Decide which workforce you want to use to complete the Ground Truth job. You have two options (refer to the source code for details):
    1. Use a private workforce in your own organization to label the evaluation dataset.
    2. Use a public workforce to label the evaluation dataset.
  5. Create and submit a Ground Truth labeling job. You can also adjust the following code to configure the labeling job parameters to meet your specific business requirements. Refer to the source code for complete instructions on creating and configuring the Ground Truth job.
    human_task_config = {
        "AnnotationConsolidationConfig": {
            "AnnotationConsolidationLambdaArn": acs_arn,
        },
        "PreHumanTaskLambdaArn": prehuman_arn,
        "MaxConcurrentTaskCount": 200,  # 200 images will be sent at a time to the workteam.
        "NumberOfHumanWorkersPerDataObject": 3,  # 3 separate workers will be required to label each image.
        "TaskAvailabilityLifetimeInSeconds": 21600,  # Your workteam has 6 hours to complete all pending tasks.
        "TaskDescription": task_description,
        "TaskKeywords": task_keywords,
        "TaskTimeLimitInSeconds": 180,  # Each image must be labeled within 3 minutes.
        "TaskTitle": task_title,
        "UiConfig": {
            "UiTemplateS3Uri": "s3://{}/{}/instructions.template".format(BUCKET, EXP_NAME),
        },
    }

After the job is submitted, you should see output similar to the following:

Labeling job name is: ground-truth-cm-1662738403

Wait for labeling job on the evaluation dataset to complete successfully, then continue to the next step.

Use the Amazon Rekognition moderation API to generate predicted moderation labels.

The following code snippet shows how to use the Amazon Rekognition moderation API to generate moderation labels:

client=boto3.client('rekognition')
def moderate_image(photo, bucket):
    response = client.detect_moderation_labels(Image={'S3Object':{'Bucket':bucket,'Name':photo}})
    return len(response['ModerationLabels'])

Assess the performance

You first retrieved ground truth moderation labels from the Ground Truth labeling job results for the evaluation dataset, then you ran the Amazon Rekognition moderation API to get predicted moderation labels for the same dataset. Because this is a binary classification problem (safe vs. unsafe content), we calculate the following metrics (assuming unsafe content is positive):

We also calculate the corresponding evaluation metrics:

The following code snippet shows how to calculate those metrics:

FPR = FP / (FP + TN)
FNR = FN / (FN + TP)
Recall = TP / (TP + FN)
Precision = TP / (TP + FP)

Conclusion

This post discusses the key elements needed to evaluate the performance aspect of your content moderation service in terms of various accuracy metrics. However, accuracy is only one of the many dimensions that you need to evaluate when choosing a particular content moderation service. It’s critical that you include other parameters, such as the service’s total feature set, ease of use, existing integrations, privacy and security, customization options, scalability implications, customer service, and pricing. To learn more about content moderation in Amazon Rekognition, visit Amazon Rekognition Content Moderation.


About the authors

Amit Gupta is a Senior AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.

Davide Modolo is an Applied Science Manager at AWS AI Labs. He has a PhD in computer vision from the University of Edinburgh (UK) and is passionate about developing new scientific solutions for real-world customer problems. Outside of work, he enjoys traveling and playing any kind of sport, especially soccer.

Jian Wu is a Senior Enterprise Solutions Architect at AWS. He’s been with AWS for 6 years working with customers of all sizes. He is passionate about helping customers to innovate faster via the adoption of the Cloud and AI/ML. Prior to joining AWS, Jian spent 10+ years focusing on software development, system implementation and infrastructure management. Aside from work, he enjoys staying active and spending time with his family.

Read More

Researchers Use AI to Help Earbud Users Mute Background Noise

Thanks to earbuds, people can take calls anywhere, while doing anything. The problem: those on the other end of the call can hear all the background noise, too, whether it’s the roommate’s vacuum cleaner or neighboring conversations at a café.

Now, work by a trio of graduate students at the University of Washington, who spent the pandemic cooped up together in a noisy apartment, lets those on the other end of the call hear just the speaker — rather than all the surrounding sounds.

Users found that the system, dubbed “ClearBuds” — presented last month at the ACM International Conference on Mobile Systems, Applications and Services — improved background noise suppression much better than a commercially available alternative.

AI Podcast host Noah Kravitz caught up with the team at ClearBuds to discuss the unlikely pandemic-time origin story behind a technology that promises to make calls clearer and easier, wherever we go.

You Might Also Like

Listen Up: How Audio Analytic Is Teaching Machines to Listen

Audio Analytic has been using machine learning that enables a vast array of devices to make sense of the world of sound. Dr. Chris Mitchell, CEO and founder of Audio Analytic, discusses the challenges and the fun involved in teaching machines to listen.

A Podcast With Teeth: How Overjet Brings AI to Dentists’ Offices

Overjet, a member of the NVIDIA Inception program for startups, is moving fast to bring AI to dentists’ offices. Dr. Wardah Inam, CEO of Overjet, talks about how her company improves patient care with AI-powered technology that analyzes and annotates X-rays for dentists and insurance providers.

Sing It, Sister! Maya Ackerman on LyricStudio, an AI-Based Writing Assistant

Maya Ackerman is the CEO of WaveAI, a Silicon Valley startup using AI and machine learning to, as the company motto puts it, “unlock new heights of human creative expression.” She discusses WaveAI’s LyricStudio software, an AI-based lyric and poetry writing assistant.

Subscribe to the AI Podcast: Now Available on Amazon Music

The AI Podcast is now available through Amazon Music.

In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

The post Researchers Use AI to Help Earbud Users Mute Background Noise appeared first on NVIDIA Blog.

Read More

Meet the Omnivore: Ph.D. Student Lets Anyone Bring Simulated Bots to Life With NVIDIA Omniverse Extension

Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who use NVIDIA Omniverse to accelerate their 3D workflows and create virtual worlds.

Yizhou Zhao

When not engrossed in his studies toward a Ph.D. in statistics, conducting data-driven research on AI and robotics, or enjoying his favorite hobby of sailing, Yizhou Zhao is winning contests for developers who use NVIDIA Omniverse — a platform for connecting and building custom 3D pipelines and metaverse applications.

The fifth-year doctoral candidate at the University of California, Los Angeles recently received first place in the inaugural #ExtendOmniverse contest, where developers were invited to create their own Omniverse extension for a chance to win an NVIDIA RTX GPU.

Omniverse extensions are core building blocks that let anyone create and extend functions of Omniverse apps using the popular Python programming language.

Zhao’s winning entry, called “IndoorKit,” allows users to easily load and record robotics simulation tasks in indoor scenes. It sets up robotics manipulation tasks by automatically populating scenes with the indoor environment, the bot and other objects with just a few clicks.

“Typically, it’s hard to deploy a robotics task in simulation without a lot of skills in scene building, layout sampling and robot control,” Zhao said. “By bringing assets into Omniverse’s powerful user interface using the Universal Scene Description framework, my extension achieves instant scene setup and accurate control of the robot.”

Within “IndoorKit,” users can simply click “add object,” “add house,” “load scene,” “record scene” and other buttons to manipulate aspects of the environment and dive right into robotics simulation.

With Universal Scene Description (USD), an open-source, extensible file framework, Zhao seamlessly brought 3D models into his environments using Omniverse Connectors for Autodesk Maya and Blender software.

The “IndoorKit” extension also relies on assets from the NVIDIA Isaac Sim robotics simulation platform and Omniverse’s built-in PhysX capabilities for accurate, articulated manipulation of the bots.

In addition, “IndoorKit” can randomize a scene’s lighting, room materials and more. One scene Zhao built with the extension is highlighted in the feature video above.

Omniverse for Robotics 

The “IndoorKit” extension bridges Omniverse and robotics research in simulation.

A view of Zhao’s “IndoorKit” extension

“I don’t see how accurate robot control was performed prior to Omniverse,” Zhao said. He provides four main reasons for why Omniverse was the ideal platform on which to build this extension:

First, Python’s popularity means many developers can build extensions with it to unlock machine learning and deep learning research for a broader audience, he said.

Second, using NVIDIA RTX GPUs with Omniverse greatly accelerates robot control and training.

Third, Omniverse’s ray-tracing technology enables real-time, photorealistic rendering of his scenes. This saves 90% of the time Zhao used to spend for experiment setup and simulation, he said.

And fourth, Omniverse’s real-time advanced physics simulation engine, PhysX, supports an extensive range of features — including liquid, particle and soft-body simulation — which “land on the frontier of robotics studies,” according to Zhao.

“The future of art, engineering and research is in the spirit of connecting everything: modeling, animation and simulation,” he said. “And Omniverse brings it all together.”

Join In on the Creation

Creators and developers across the world can download NVIDIA Omniverse for free, and enterprise teams can use the platform for their 3D projects.

Discover how to build an Omniverse extension in less than 10 minutes.

For a deeper dive into developing on Omniverse, watch the on-demand NVIDIA GTC session, “How to Build Extensions and Apps for Virtual Worlds With NVIDIA Omniverse.”

Find additional documentation and tutorials in the Omniverse Resource Center, which details how developers like Zhao can build custom USD-based applications and extensions for the platform.

To discover more free tools, training and a community for developers, join the NVIDIA Developer Program.

Follow NVIDIA Omniverse on Instagram, Medium, Twitter and YouTube for additional resources and inspiration. Check out the Omniverse forums, and join our Discord server and Twitch channel to chat with the community.

The post Meet the Omnivore: Ph.D. Student Lets Anyone Bring Simulated Bots to Life With NVIDIA Omniverse Extension appeared first on NVIDIA Blog.

Read More