Building a custom Angular application for labeling jobs with Amazon SageMaker Ground Truth

Building a custom Angular application for labeling jobs with Amazon SageMaker Ground Truth

As a data scientist attempting to solve a problem using supervised learning, you usually need a high-quality labeled dataset before starting your model building. Amazon SageMaker Ground Truth makes dataset building for a different range of tasks, like text classification and object detection, easier and more accessible to everyone.

Ground Truth also helps you build datasets for custom user-defined tasks that let you annotate anything. This capability is powered by the following:

  • Custom AWS Lambda functions that can be triggered between labeling steps. This allows you to have custom logic pre-labeling like filtering examples or augmenting them with metadata using other services like Amazon Translate or Amazon Rekognition, and post-labeling logic for label consolidation or quality control.
  • Custom web templates that let you build unique user interfaces using HTML and Javascript that integrate perfectly with Ground Truth workflows. These templates are easy to build with Crowd HTML Elements, which are a set of common UI elements used for text, video, and audio labeling jobs that you can arrange like blocks in your custom template.
  • Availability of a large set of skilled and specialized workforces in the AWS Marketplace and in Amazon Mechanical Turk if you need to augment your private teams of subject matter experts. Vetted partners in the AWS Marketplace cover numerous languages as well as specific skills in video and image annotations that fit different industry needs (like medical labeling).

For complex labeling tasks, such as complex taxonomy classification, extreme multi-class classifications, or autonomous driving labeling tasks, you may need to build a more complex front-end application for your labeling workforce. Front-end frameworks like Angular are helpful in these cases because they bring useful design patterns like model-view-controller (MVC), which makes your codebase more robust and maintainable for a larger team composed of UX/UI designers and software developers.

This post walks you through using Angular and Angular Elements to create fully customizable solutions that work nicely with Ground Truth. This walkthrough assumes that you’re familiar with running a custom labeling job with Ground Truth and Crowd HTML Elements. For more information, see Build a custom data labeling workflow with Amazon SageMaker Ground Truth.

The approach described in this post also works with Amazon Augmented AI (Amazon A2I), which makes it easy to build the workflows required for human review of machine learning predictions. This is possible because Amazon A2I uses Crowd HTML Elements to create custom worker templates. For more information, see Create Custom Worker Templates.

Building a custom UI for complex taxonomy classification

If you manage large supply chains and interact with different types of suppliers, like global food restaurants or automotive manufacturers, you likely receive invoices in different formats and languages. To keep track of your operations and drive financial efficiencies, you need teams behind the scenes to map invoices and receipts to large categories of products and organize them in hierarchical taxonomies.

The following diagram illustrates a hierarchical taxonomy of computer components.

The following diagram illustrates a hierarchical taxonomy of types of food.

Hierarchical taxonomies can have thousands of categories at their leaf level. Such examples can include web directories (the Yahoo! Directory or the Open Directory Project), library classification schemes (Dewey Decimal or Library of Congress), or the classification schemes used in natural science, legal, or medical applications.

What if a natural language processing (NLP) model could help you automatically tag every invoice to the proper category? What if text labeling tools could extract categories from invoices?

Even if accurate classification over large sets of closely related classes is inherently difficult, it all starts with constructing a high-quality dataset in the most cost-efficient manner.

Taxonomy labeling with Angular Elements

For the following use case, you are one of the biggest fast food chains operating and sourcing materials across the world. To build a dataset for your NLP model, you came up with a single-page web app based on UX research that helps your workforce read an invoice description and select the corresponding category in the taxonomy. See the following screenshot.

This implementation makes use of Angular Materials tabs and a filter box that makes navigating the categories easy. It also displays an English translation of your invoice description so that the workers can labels invoices from across the world. Moreover, because it’s built on a framework like Angular, you can improve it down the line with more elements, such as drop-downs for the higher levels of the taxonomy or dynamic content like images or videos based on third-party APIs.

For more information about this application, see the GitHub repo.

The application is built using Angular Elements, which creates Angular components packaged as custom elements (also called web components), a web standard for defining new HTML elements in a framework-agnostic way. This enables you to integrate smoothly with Crowd HTML Elements later on.

Angular Elements inputs and outputs

In this use case, your Angular component expects two inputs: an invoice description and an invoice translation. These are passed to it using tag attributes in the <ng-home> (the directive that designates the root element of the application). The values are then captured by the @Input() annotations defined in the Angular Controller in src/app/home.ts. See the following code:

<ng-home source='10牛ステーキ-20パッケージ-ブランドX' translation='10 beef steak - 20 packages - brand X' id="home">loading</ng-home> 

export class Home implements OnInit {

  @Input() invoice = '';
  @Input() translation = '';

The values are rendered using two-binding in the placehoders {{source}} and {{translation}} in the Angular View in src/app/home.html. See the following code:

<!-- Invoice Description -->
<div class="card" >
    <div class="card-header">
        <h3>Invoice Description</h3>
        <p id="step1">
        <span>Invoice Description: <br />
        <b>{{ invoice }}</b></span>
        <p style='font-weight: small; color: gray;' id="step2">
        <span>English Translation: <br /> {{ translation }}</span>

The following screenshot shows the Meats tab on the Food Categories page.

When you choose a category and choose Submit, the Angular component should also broadcast a Javascript event contaning the category ID to its parent DOM element. This is achieved using the @Output() in the Angular Controller in src/app/home.ts. See the following code:

<button mat-button color="primary" (click)="onSubmit()" id="submitButton">Submit</button>

    <tr mat-row *matRowDef="let row; columns: displayedColumns;"
        (click)="selectRow(row)" [ngClass]="{ 'highlight': row === selectedRow }">
@Output('rowselected') rowselected = new EventEmitter<any>();

#called when user click on a row in the table ("selecting" a category)
selectRow(row) {
      this.selectedRow = row;

#called when user click on Submit button

Angular integration with Crowd HTML Elements

Communication between Angular Elements and Crowd HTML Elements happens through the mechanism described in the preceding section.

Following the steps described in Build a custom data labeling workflow with Amazon SageMaker Ground Truth, you can adapt how to pass the text to annotate and how to catch the broadcasted event from Angular Elements to create your custom template.

The following code shows the full Liquid HTML template to use in your job creations. This file should also be your index.html root file of the Angular app under src/ folder. (Make sure to use the index.html file under the dist folder that has the minified .js files injected into it with the right Amazon Simple Storage Service (Amazon S3) path to host your app.)

<!doctype html>
<html lang="en">
    <script src=""></script>

    <crowd-form style="display: none;">
        <input name="annotations" id="annotations" type="hidden">
        <input name="timeElapsed" id="timeElapsed" type="hidden">
         <!-- Prevent crowd-form from creating its own button -->
        <crowd-button form-action="submit" style="display: none;"></crowd-button>

    <div class="mat-app-background basic-container">
      <!-- Dev Mode to test the Angular Element -->
      <!-- <ng-home source='10牛ステーキ-20パッケージ-ブランドX' translation='10 beef steak - 20 packages - brand X' id="home">loading</ng-home> -->
      <ng-home source='{{ task.input.source }}' translation='{{ task.input.translatedDesc }}'>loading</ng-home>

    <script src="<your-s3-bucket-angular-app>/runtime-es2015.js" type="module"></script>
    <script src="<your-s3-bucket-angular-app>/runtime-es5.js" nomodule defer></script>
    <script src="<your-s3-bucket-angular-app>/polyfills-es5.js" nomodule defer></script>
    <script src="<your-s3-bucket-angular-app>/polyfills-es2015.js" type="module"></script>
    <script src="<your-s3-bucket-angular-app>/styles-es2015.js" type="module"></script>
    <script src="<your-s3-bucket-angular-app>/styles-es5.js" nomodule defer></script>
    <script src="<your-s3-bucket-angular-app>/vendor-es2015.js" type="module"></script>
    <script src="<your-s3-bucket-angular-app>/vendor-es5.js" nomodule defer></script>
    <script src="<your-s3-bucket-angular-app>/main-es2015.js" type="module"></script>
    <script src="<your-s3-bucket-angular-app>/main-es5.js" nomodule defer></script>


  document.addEventListener("DOMContentLoaded", function(event) {
    // Counter
    var enterDate = new Date();
    function secondsSinceEnter()
      return (new Date() - enterDate) / 1000;

    // GT Form Submitting
    const component = document.querySelector('ng-home').addEventListener('rowselected', (event) => {
      // alert(event.detail.CODE);
      document.getElementById('annotations').value = event.detail.CODE;
      document.getElementById('timeElapsed').value = secondsSinceEnter();


  .body {
    background-color: #fafafa;

  .header {
    background: #673ab7;
      color: #fff;
      padding: 0 16px;
      margin: 20px 20px 0px 20px;
      padding: 20px;

  .cards {
    display: grid;
    grid-template-columns: 30% auto;
    grid-auto-rows: auto;
    grid-gap: 1rem;
    margin: 20px 20px 0px 20px;

  .card {
    box-shadow: 0 2px 1px -1px rgba(0,0,0,.2), 0 1px 1px 0 rgba(0,0,0,.14), 0 1px 3px 0 rgba(0,0,0,.12);
    transition: box-shadow 280ms cubic-bezier(.4,0,.2,1);
    display: block;
    position: relative;
    padding: 16px;
    border-radius: 4px;
    /* margin: 20px 0px 0px 20px; */
    border: 2px solid #e7e7e7;
    border-radius: 4px;

  .highlight-step {
    background-color: #2515424a;
    margin: 0px -15px 0px -15px;
    padding: 15px;

Creating the template

To create the preceding template, complete the following steps:

  1. Add the crowd-html-element.js script at the top of the template so you can use Crowd HTML Elements:
    <script src=""></script>

  2. Inject the text to annotate and the associated metadata coming from the pre-processing Lambda function to the user interface using the Liquid templating language directly in root element <ng-home>:
    <ng-home source='{{ task.input.source }}' translation='{{ task.input.translated }}' id="home">loading</ng-home>

  3. Use the <crowd-form /> element, which submits the annotations to Ground Truth. The element is hidden because the submission happens in the background. See the following code:
    <crowd-form style="display: none;">
            <input name="annotations" id="annotations" type="hidden">
            <input name="timeElapsed" id="timeElapsed" type="hidden">
             <!-- Prevent crowd-form from creating its own button -->
            <crowd-button form-action="submit" style="display: none;"></crowd-button>

  4. Instead of using Crowd HTML Elements to submit the annotation, include a small script to integrate the Angular Element with <crowd-form />:
    ocument.addEventListener("DOMContentLoaded", function(event) {
        var enterDate = new Date();    
        function secondsSinceEnter()
          return (new Date() - enterDate) / 1000;
        const component = document.querySelector('ng-home').addEventListener('rowselected', (event) => 
          document.getElementById('annotations').value = event.detail.CODE;
          document.getElementById('timeElapsed').value = secondsSinceEnter();

For this use case, I’m also keeping a counter to monitor the time it takes a worker to complete the annotation.

The following diagram illustrates the data flow between each element.


This post showed how to build custom labeling UI with Angular and Ground Truth. The solution can handle communication between the different scopes in the custom template provided in the labeling job creation. The ability to use a custom front-end framework like Angular enables you to easily create modern web applications that serve your exact needs when tapping into public, private, or vendor labeling workforces.

For more information about hierarchical taxonomies in Ground Truth, see Creating hierarchical label taxonomies using Amazon SageMaker Ground Truth.

If you have any comments or questions about this post, please use the comments section. Happy labeling!

About the Authors

Yassine Landa is a Data Scientist at AWS. He holds an undergraduate degree in Math and Physics, and master’s degrees from French universities in Computer Science and Data Science, Web Intelligence, and Environment Engineering. He is passionate about building machine learning and artificial intelligence products for customers, and has won multiple awards for machine learning products he has built with tech startups and as a startup founder.





Read More

Research shows news articles following suicide prevention best practices get more engagement on Facebook

A new study published today in the journal Proceedings of the National Academy of Sciences examines how news stories on Facebook adhere to suicide reporting guidelines and their engagement on the platform. Led by CDC researchers with support of Facebook researchers, the report is part of the CDC’s work to understand the impact of safe suicide-reporting on social media.

Key findings include the following:

  • More than half (60%) of the most-shared news articles about suicide did not include any protective information, such as a suicide prevention helpline or public health resources.
  • The majority of articles included harmful elements that go against suicide reporting guidelines, such as explicitly reporting the name of the person who died (60%), featuring the word “suicide” prominently in the headline (59%), and publicizing details about the location (55%) or method (50%).
  • When news articles followed more of the suicide prevention guidelines, they got more engagement on Facebook. Each additional guideline followed is associated with a 19% increase in the odds of an article being reshared.

Leading suicide prevention organizations and health authorities such as the World Health Organization (WHO) and the U.S. Centers for Disease Control and Prevention (CDC) have created guidelines for news organizations to cover suicide more responsibly, such as those presented on Their goal is to reduce sensationalism around suicide and prevent exposure to content that may trigger vulnerable people, such as providing a description of the method used or the location where it took place. The guidelines also recommend including resources for people in crisis, such as suicide helplines. See the full report here.

The post Research shows news articles following suicide prevention best practices get more engagement on Facebook appeared first on Facebook Research.

Read More

Hardhats and AI: Startup Navigates 3D Aerial Images for Inspections

Hardhats and AI: Startup Navigates 3D Aerial Images for Inspections

Childhood buddies from back in South Africa, Nicholas Pilkington, Jono Millin and Mike Winn went off together to a nearby college, teamed up on a handful of startups and kept a pact: work on drones once a week.

That dedication is paying off. Their drone startup, based in San Francisco, is picking up interest worldwide and has landed $35 million in Series D funding.

It all catalyzed in 2014, when the friends were accepted into the AngelPad accelerator program in Silicon Valley. They founded DroneDeploy there, enabling contractors to capture photos, maps, videos and high-fidelity panoramic images for remote inspections of job sites.

“We had this a-ha moment: Almost any industry can benefit from aerial imagery, so we set out to build the best drone software out there and make it easy for everyone,” said Pilkington, co-founder and CTO at DroneDeploy.

DroneDeploy’s AI software platform — it’s the navigational brains and eyes — is operating in more than 200 countries and handling more than 1 million flights a year.

Nailing Down Applications

DroneDeploy’s software has been adopted in construction, agriculture, forestry, search and rescue, inspection, conservation and mining.

In construction, DroneDeploy is used by one-quarter of the world’s 400 largest building contractors and six of the top 10 oil and gas companies, according to the company.

DroneDeploy was one of three startups that recently presented at an NVIDIA Inception Connect event held by Japanese insurer Sompo Holdings. For good reason: Startups are helping insurance and reinsurance firms become more competitive by analyzing portfolio risks with AI.

The NVIDIA Inception program nurtures startups with access to GPU guidance, Deep Learning Institute courses, networking and marketing opportunities.

Navigating Drone Software

DroneDeploy offers features like fast setup of autonomous flights, photogrammetry to take physical measurements and APIs for drone data.

In addition to supporting industry-leading drones and hardware, DroneDeploy operates an app ecosystem for partners to build apps using its drone data platform. John Deere, for example, offers an app for customers to upload aerial drone maps of their fields to their John Deere account so that they can plan flights based on the field data.

Split-second photogrammetry and 360-degree images provided by DroneDeploy’s algorithms running on NVIDIA GPUs in the cloud help provide pioneering mapping and visibility.

AI on Safety, Cost and Time

Drones used in high places instead of people can aid in safety. The U.S. Occupational Safety and Health Administration last year reported that 22 people were killed in roofing-related accidents in the U.S.

Inspecting roofs and solar panels with drone technology can improve that safety record. It can also save on cost: The traditional alternative to having people on rooftops to perform these inspections is using helicopters.

Customers of the DroneDeploy platform can follow a quickly created map to carry out a sequence of inspections with guidance from cameras fed into image recognition algorithms.

Using drones, customers can speed up inspections by 80 percent, according to the company.  

“In areas like oil, gas and energy, it’s about zero-downtime inspections of facilities for operations and safety, which is a huge value driver for these customers,” said Pilkington.

The post Hardhats and AI: Startup Navigates 3D Aerial Images for Inspections appeared first on The Official NVIDIA Blog.

Read More

Google at ACL 2020

Google at ACL 2020

Posted by Cat Armato and Emily Knapp, Program Managers

This week, the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), a premier conference covering a broad spectrum of research areas that are concerned with computational approaches to natural language, takes place online.

As a leader in natural language processing and understanding, and a Diamond Level sponsor of ACL 2020, Google will showcase the latest research in the field with over 30 publications, and the organization of and participation in a variety of workshops and tutorials.

If you’re registered for ACL 2020, we hope that you’ll visit the Google virtual booth to learn more about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about the Google research being presented at ACL 2020 below (Google affiliations bolded).

Diversity & Inclusion (D&I) Chair: Vinodkumar Prabhakaran
Accessibility Chair: Sushant Kafle
Local Sponsorship Chair: Kristina Toutanova
Virtual Infrastructure Committee: Yi Luan
Area Chairs: Anders Søgaard, Ankur Parikh, Annie Louis, Bhuvana Ramabhadran, Christo Kirov, Daniel Cer, Dipanjan Das, Diyi Yang, Emily Pitler, Eunsol Choi, George Foster, Idan Szpektor, Jacob Eisenstein, Jason Baldridge, Jun Suzuki, Kenton Lee, Luheng He, Marius Pasca, Ming-Wei Chang, Sebastian Gehrmann, Shashi Narayan, Slav Petrov, Vinodkumar Prabhakaran, Waleed Ammar, William Cohen

Long Papers
Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage
Ashish V. Thapliyal, Radu Soricut

Automatic Detection of Generated Text is Easiest when Humans are Fooled
Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, Douglas Eck

On Faithfulness and Factuality in Abstractive Summarization
Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou

BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps
Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, Fei Sha

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
Xuanli He, Gholamreza Haffari, Mohammad Norouzi

GoEmotions: A Dataset of Fine-Grained Emotions
Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, Sujith Ravi

TaPas: Weakly Supervised Table Parsing via Pre-training (see blog post)
Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, Julian Eisenschlos

Toxicity Detection: Does Context Really Matter?
John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, Ion Androutsopoulos

(Re)construing Meaning in NLP
Sean Trott, Tiago Timponi Torrent, Nancy Chang, Nathan Schneider

Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models
Dan Iter, Kelvin Guu, Larry Lansing, Dan Jurafsky

Probabilistic Assumptions Matter: Improved Models for Distantly-Supervised Document-Level Question Answering
Hao Cheng, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

AdvAug: Robust Adversarial Augmentation for Neural Machine Translation
Yong Cheng, Lu Jiang, Wolfgang Macherey, Jacob Eisenstein

Named Entity Recognition as Dependency Parsing
Juntao Yu, Bernd Bohnet, Massimo Poesio

Cross-modal Coherence Modeling for Caption Generation
Malihe Alikhani, Piyush Sharma, Shengjie Li, Radu Soricut, Matthew Stone

Representation Learning for Information Extraction from Form-like Documents (see blog post)
Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, Marc Najork

Low-Dimensional Hyperbolic Knowledge Graph Embeddings
Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Sujith Ravi, Christopher Ré

What Question Answering can Learn from Trivia Nerds
Jordan Boyd-Graber, Benjamin Börschinger

Learning a Multi-Domain Curriculum for Neural Machine Translation (see blog post)
Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, Zarana Parekh

Translationese as a Language in “Multilingual” NMT
Parker Riley, Isaac Caswell, Markus Freitag, David Grangier

Mapping Natural Language Instructions to Mobile UI Action Sequences
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge

BLEURT: Learning Robust Metrics for Text Generation (see blog post)
Thibault Sellam, Dipanjan Das, Ankur Parikh

Exploring Unexplored Generalization Challenges for Cross-Database Semantic Parsing
Alane Suhr, Ming-Wei Chang, Peter Shaw, Kenton Lee

Frugal Paradigm Completion
Alexander Erdmann, Tom Kenter, Markus Becker, Christian Schallhart

Short Papers
Reverse Engineering Configurations of Neural Text Generation Models
Yi Tay, Dara Bahri, Che Zheng, Clifford Brunk, Donald Metzler, Andrew Tomkins

Syntactic Data Augmentation Increases Robustness to Inference Heuristics
Junghyun Min, R. Thomas McCoy, Dipanjan Das, Emily Pitler, Tal Linzen

Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation
Aditya Siddhant, Ankur Bapna, Yuan Cao, Orhan Firat, Mia Chen, Sneha Kudugunta, Naveen Arivazhagan, Yonghui Wu

Social Biases in NLP Models as Barriers for Persons with Disabilities
Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, Stephen Denuyl

Toward Better Storylines with Sentence-Level Language Models
Daphne Ippolito, David Grangier, Douglas Eck, Chris Callison-Burch

TACL Papers
TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages (see blog post)
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, Jennimaria Palomaki

Phonotactic Complexity and Its Trade-offs
Tiago Pimentel, Brian Roark, Ryan Cotterell

Multilingual Universal Sentence Encoder for Semantic Retrieval (see blog post)
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil

IWPT – The 16th International Conference on Parsing Technologies
Yuji Matsumoto, Stephan Oepen, Kenji Sagae, Anders Søgaard, Weiwei Sun and Reut Tsarfaty

ALVR – Workshop on Advances in Language and Vision Research
Xin Wang, Jesse Thomason, Ronghang Hu, Xinlei Chen, Peter Anderson, Qi Wu, Asli Celikyilmaz, Jason Baldridge and William Yang Wang

WNGT – The 4th Workshop on Neural Generation and Translation
Alexandra Birch, Graham Neubig, Andrew Finch, Hiroaki Hayashi, Kenneth Heafield, Ioannis Konstas, Yusuke Oda and Xian Li

NLPMC – NLP for Medical Conversations
Parminder Bhatia, Chaitanya Shivade, Mona Diab, Byron Wallace, Rashmi Gangadharaiah, Nan Du, Izhak Shafran and Steven Lin

AutoSimTrans – The 1st Workshop on Automatic Simultaneous Translation
Hua Wu, Colin Cherry, James Cross, Liang Huang, Zhongjun He, Mark Liberman and Yang Liu

Interpretability and Analysis in Neural NLP (cutting-edge)
Yonatan Belinkov, Sebastian Gehrmann, Ellie Pavlick

Commonsense Reasoning for Natural Language Processing (Introductory)
Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Choi, Dan Roth

You Can’t Touch This: Deep Clean System Flags Potentially Contaminated Surfaces

You Can’t Touch This: Deep Clean System Flags Potentially Contaminated Surfaces

Amid the continued spread of coronavirus, extra care is being taken by just about everyone to wash hands and wipe down surfaces, from countertops to groceries.

To spotlight potentially contaminated surfaces, hobbyist Nick Bild has come up with Deep Clean, a stereo camera system that flags objects that have been touched in a room.

The device can be used by cleaning crews at hospitals and assisted living facilities or anyone  who’d like to know what areas need special attention when trying to prevent disease transmission.

Courtesy of Nick Bild.

Deep Clean uses an NVIDIA Jetson AGX Xavier developer kit as the main processing unit to map out a room, detecting where different objects lie within it. Jetson helps pinpoint the exact location (x,y-coordinates) and depth (z-coordinate) of each object.

When an object overlaps with a person’s hand, which is identified by an open-source body keypoint detection system called OpenPose, those coordinates are stored in the system’s memory. To maintain users’ privacy, only the coordinates are stored, not the images.

Then, the coordinates are used to automatically annotate an image of the unoccupied room, displaying what has been touched and thus potentially contaminated.

Nick the Bild-er: Equipped with the Right Tools

When news broke in early March that COVID-19 was spreading in the U.S., Bild knew he had to take action.

“I’m not a medical doctor. I’m not a biologist. So, I thought, what can I do as a software developer slash hardware hacker to help?” said Bild.

Juggling a software engineering job by day, as well as two kids at home in Orlando, Florida, Bild faced the challenges of finding the time and resources to get this machine built. He knew getting his hands on a 3D camera would be expensive, which is why he turned to Jetson, an edge AI platform he found to be simultaneously affordable and powerful.

Deep Clean’s stereo camera system. Image courtesy of Nick Bild.

“It’s really a good general-purpose tool that hits the sweet spot of low price and good performance,” said Bild. “You can do a lot of different types of tasks — classify images, sounds, pretty much whatever kind of AI inference you need to do.”

Within a week and a half, Bild had made a 3D camera of his own, which he further developed into the prototype for Deep Clean.

Looking ahead, Bild hopes to improve the device to detect sources of potential contamination beyond human touch, such as cough or sneeze droplets.

Technology to Help the Community

Deep Clean isn’t Bild’s first instance of helping the community through his technological pursuits. He’s developed seven different projects since he began using NVIDIA products when the first Jetson Nano was released in March 2019.

One of these projects, a pair of AI-enabled glasses, won NVIDIA’s Jetson Community Project of the Month Award for allowing people to switch devices such as a lamp or stereo on and off simply by looking at them and waving. The shAIdes are especially helpful for those with limited mobility.

Bild calls himself a “prototyper,” as he creates a variety of smart, useful devices like Deep Clean in hopes that someday one will be made available for wide commercial use.

A fast learner who’s committed to making a difference, Bild is always exploring how to make a device better and looking for what to embark upon as his next project.

Anyone can get started on a Jetson project. Learn how on the Jetson developers page.

The post You Can’t Touch This: Deep Clean System Flags Potentially Contaminated Surfaces appeared first on The Official NVIDIA Blog.

Read More

Stanford AI Lab Papers and Talks at ACL 2020

Stanford AI Lab Papers and Talks at ACL 2020

The 58th annual meeting of the Association for Computational Linguistics is being hosted virtually this week. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Authors: Weixin Liang, James Zou, Zhou Yu


Keywords: dialog, automatic dialog evaluation, user experience

Contextual Embeddings: When Are They Worth It?

Authors: Simran Arora, Avner May, Jian Zhang, Christopher Ré


Links: Paper | Video

Keywords: contextual embeddings, pretraining, benefits of context

Enabling Language Models to Fill in the Blanks

Authors: Chris Donahue, Mina Lee, Percy Liang


Links: Paper | Blog Post | Video

Keywords: natural language generation, infilling, fill in the blanks, language models

ExpBERT: Representation Engineering with Natural Language Explanations

Authors: Shikhar Murty, Pang Wei Koh, Percy Liang


Links: Paper | Video

Keywords: language explanations, bert, relation extraction, language supervision

Finding Universal Grammatical Relations in Multilingual BERT

Authors: Ethan A. Chi, John Hewitt, Christopher D. Manning


Links: Paper | Blog Post

Keywords: analysis, syntax, multilinguality

Is Your Classifier Actually Biased? Measuring Fairness under Uncertainty with Bernstein Bounds

Authors: Kawin Ethayarajh


Links: Paper

Keywords: fairness, bias, equal opportunity, ethics, uncertainty

Low-Dimensional Hyperbolic Knowledge Graph Embeddings

Authors: Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Sujith Ravi, Christopher Ré


Links: Paper | Video

Keywords: knowledge graphs, hyperbolic embeddings, link prediction

Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports

Authors: Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christopher D. Manning, Curtis P. Langlotz


Links: Paper

Keywords: nlp, text summarization, reinforcement learning, medicine, radiology report

Orthogonal Relation Transforms with Graph Context Modeling for Knowledge Graph Embedding

Authors: Yun Tang, Jing Huang, Guangtao Wang, Xiaodong He and Bowen Zhou


Links: Paper | Video

Keywords: orthogonal transforms, knowledge graph embedding

Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models

Authors: Dan Iter , Kelvin Guu , Larry Lansing, Dan Jurafsky


Links: Paper

Keywords: discourse coherence, language model pretraining

Robust Encodings: A Framework for Combating Adversarial Typos

Authors: Erik Jones, Robin Jia, Aditi Raghunathan, Percy Liang


Links: Paper

Keywords: nlp, robustness, adversarial robustness, typos, safe ml

SenseBERT: Driving Some Sense into BERT

Authors: Yoav Levine, Barak Lenz, Or Dagan, Ori Ram, Dan Padnos, Or Sharir, Shai Shalev-Schwarz, Amnon Shashua, Yoav Shoham


Links: Paper | Blog Post

Keywords: language models, semantics

Shaping Visual Representations with Language for Few-shot Classification

Authors: Jesse Mu, Percy Liang, Noah Goodman


Links: Paper

Keywords: grounding, language supervision, vision, few-shot learning, meta-learning, transfer

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Authors: Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, Christopher D. Manning


Links: Paper

Keywords: natural language processing, multilingual, data-driven, neural networks

Theoretical Limitations of Self-Attention in Neural Sequence Models

Authors: Michael Hahn


Links: Paper

Keywords: theory, transformers, formal languages

Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking

Authors: Giovanni Campagna, Agata Foryciarz, Mehrad Moradshahi, and Monica S. Lam


Links: Paper

Keywords: dialogue state tracking, multiwoz, zero-shot, data programming, pretraining

We look forward to seeing you at ACL 2020!

Read More

CATER: A diagnostic dataset for Compositional Actions and Temporal Reasoning

CATER: A diagnostic dataset for Compositional Actions and Temporal Reasoning

Introducing CATER: A diagnostic dataset for video understanding that, by design, requires temporal reasoning to be solved.

While deep features have revolutionized static image analysis, deep video descriptors have struggled to outperform classic hand-crafted descriptors. Though recent works have shown improvements by through spatio-temporal architectures, simpler frame-based architectures still routinely appear among top performers in video challenge benchmarks. This raises the natural question: are videos trivially understandable by simply aggregating the predictions over a sampled set of frames?

At some level, the answer must be no. Reasoning about high-level cognitive concepts such as intentions, goals, and causal relations requires reasoning over long-term temporal structure and order. For example, consider the cup-and-ball parlor game shown next. In these games, an operator puts a target object (ball) under one of multiple container objects (cups), and moves them about, possibly revealing the target at various times and recursively containing cups within other cups. The task at the end is to tell which of the cups is covering the ball. Even in its simplest instantiation, one can expect any human or computer system that solves this task to require the ability to model state of the world over long temporal horizons, reason about occlusion, understand the spatiotemporal implications of containment, etc.

Humans (and apparently even cats!) are able to solve long temporal reasoning tasks like locating the ball as the cups are shuffled. Can we design similarly hard spatiotemporal reasoning tasks for computers?

Given such intricate requirements for spatiotemporal reasoning, why don’t spatiotemporal models readily outperform their frame-based counterparts? We posit that this is due to limitations of existing video benchmarks. Even though datasets have evolved significantly in the past few years, tasks have remained highly correlated to the scene and object context. In this work, we take an alternate approach to developing a video understanding dataset. Inspired by the CLEVR dataset, and the adversarial parlor games described above, we introduce CATER, a diagnostic dataset for Compositional Actions and TEmporal Reasoning in dynamic tabletop scenes. We define three tasks on the dataset, each with an increasingly higher level of complexity, but set up as classification problems in order to be comparable to existing benchmarks for easy transfer of existing models and approaches. Next, we introduce the dataset, associated tasks, and evaluate state of the art video models, showing that they struggle on CATER’s temporal reasoning tasks.

The CATER Dataset

  • 2 objects move at a time
  • All objects move
  • The camera also moves
Sample videos from our proposed CATER dataset.
The CLEVR dataset was designed to evaluate the visual reasoning ability of question-answering models as shown above. Given a complex scene as shown, the task was to answer corresponding automatically generated questions shown below. Figure taken from the original paper.

Our CATER dataset builds upon the CLEVR dataset, which was originally proposed for question-answering based visual reasoning tasks (an example on left). CATER inherits and extends the set of object shapes, sizes, colors and materials present from CLEVR. Specifically, we add two new object shapes: inverted cones and a special object called a ‘snitch’, which we created to look like three intertwined toruses in metallic gold color (see if you can find it in the above videos!). In addition, we define four atomic actions: ‘rotate’, ‘pick-place’, ‘slide’ and ‘contain’; a subset of which is afforded by each object. While the first three are self-explanatory, the final action, ‘contain’, is a special operation, only afforded by the cones. In it, a cone is pick-placed on top of another object, which may be a sphere, a snitch or even a smaller cone. This also allows for recursive containment, as a cone can contain a smaller cone that contains another object. Once a cone ‘contains’ an object, it is constrained to only ‘slide’ actions and effectively slides all objects contained within the cone. This holds until the top-most cone is ‘pick-place’d to another location, effectively ending the containment for that top-most cone. Given these objects and actions, the videos are generated by first spawning a random number of objects with random parameters at random locations, and then adding a series of actions for all or subset of the objects. As the samples from CATER above show, we can control for the complexity of the data by limiting the number of objects, number of moving objects, as well as the motion of the camera.

Tasks on CATER

We define three tasks with increasing temporal reasoning complexity on CATER, as illustrated in the figure above.

  • The first and perhaps the simplest task is Atomic Action Recognition. This task requires the models to classify what individual actions happen in a given video, out of a total of 14 action classes we define in CATER. This is set up as a multi-label classification problem, since multiple actions can happen in a video, and the model needs to predict all of them. The predictions are evaluated using mean Average Precision (mAP), which is a standard metric metric used in similar problem settings.
  • The second task on CATER is Compositional Action Recognition. Here, the models need to reason about ordering of different actions in the video (i.e., X happens “before”/”after”/”during” Y), and predict all the combinations that are active in the given video, out of the 301 unique combinations possible using the 14 atomic actions. Similar to Task 1, it is evaluated using mAP.
  • Finally, the flagship task on CATER is Snitch Localization, which is directly inspired from the cup-and-ball shell game described earlier. The task now is to predict the location of the snitch at the end of the video. While it may seem trivial to localize it from the last frame, it may not always be possible to do that due to occlusions and recursive containments. For simplicity, we pose this as a classification problem by quantizing the (6 times 6) grid on the floor into 36 cells, and evaluate the performance using a standard accuracy metric.

How do video models fare on CATER?

We consider state of the art video architecture: both based on the 2D convolution over frames (left) and 3D, or spatio-temporal, convolutions over video clips (right).

We evaluate multiple state of the art video recognition models on all tasks defined on CATER. Specifically, we focus on three broad model architectures:

  • Frame-level models: These models are similar to image classification models, and are applied to a sampled set of frames from the video. The specific architecture we use is Inception-V2, based on the popular TSN paper, which obtained strong performance on multiple video understanding tasks.
  • Spatio-temporal models: These models extend the idea of 2D convolution (over the height-width of an image) to a 3D convolution (over the height-width-time of a video clip), as shown in the figure above. These models are better capable of capturing the temporal dynamics in videos. Specifically, we experiment with the ResNet-3D architecture, along with its non-local extension, from the Non-Local Neural Networks paper, which also obtains strong performance on video recognition tasks.
  • Tracker: For task 3, we also experiment with a tracking based solution, where we initialize a state of the art tracker to a box around the initial ground truth position of the snitch on the first frame. At the last frame, we take the center point of the tracked box in the image plane, and project it to the 3D world plane using a precomputed homography between the image and world planes. The projected 3D point is then converted to the quantized grid position and evaluated for the top-1 accuracy.

Since most video models are designed to operate on short clips of the whole video, we experiment with two approaches to aggregate the predictions over the whole video:

  • Average pooling (Avg): This is the typical approach used on most benchmarks, where we average the last layer predictions (logits) for each frame/clip of the video.
  • LSTM: We train a 2-layer LSTM over the features from clips in the video. While LSTMs haven’t shown strong improvements on standard video classification benchmarks, we experiment with them here given the temporal nature of our tasks.
Model Top-1
Random 2.8 2.8
Frame-level RGB model 14.1 25.6
Spatio-temporal RGB model 57.4 60.2
Tracker 33.9 33.9
Comparing models’ performance on Task 3.
Task 1 and 2 are presented in the paper.
Tracker failure case

We observed that most models are easily able to solve the Task 1, since that only requires short-term temporal reasoning to recognize individual actions. Task 2 and 3, on the other hand, pose more of a challenge to video models, given their temporal nature. While full results are provided in the paper, we highlight some key results on the snitch localization task (Task 3) in the attached table. First, we note that the 2D, or frame-level models, perform quite poorly on this task. This is unlike the observation on other standard video datasets, where the frame-level models are quite competitive with spatio-temporal models. Second, we note that a state of the art tracking method (Zhu et al., ECCV’18), is only able to solve about a third of the videos, showing that low-level trackers may not solve this task either (see, for example, the attached video of a tracker failure case). Third, the performance of LSTM aggregation was always better than average pooling, since that is better suited to capture the temporal nature of the tasks. And finally, we note that best performance is still fairly low on this task, suggesting scope for future work.

Diagnostics using CATER

CATER allows for diagnosing model performance by splitting the test set based on various parameters. For example, left graph shows models’ performance w.r.t whether the snitch is visible at the end or not, and the right one shows the performance w.r.t the frame at which the snitch last moves (out of 300 total frames in each video).

Having close control over the dataset generation process enables us to perform diagnostics impossible with any previous dataset. While full analysis over multiple control parameters (like camera motion, number of objects etc.) is provided in the paper, we highlight two ablations in the figure above. First, we evaluate the performance of our models with different aggregation schemes, over the cases when the snitch was visible at the end of the video, and when it was contained. As expected, the performance drops when it’s contained, with the tracking based model dropping the most suggesting that the tracker is not effective at handling the containments and occlusions. In the second ablation, we evaluate the performance as a function of the last time in the video the snitch moves. We observe that the performance of all models consistently drops as the snitch keeps moving throughout the video. Interestingly, the tracker is the most resilient approach here, since it attempts to explicitly keep track of the snitch’s position until the end, unlike the other models that rely on aggregating belief over the snitch’s location over all clips of the video. The following video provides some qualitative examples of the videos where models perform well (easy cases) and the ones where they do not (hard cases).

Analysis of which videos are easiest for models and which are the hardest. We find that videos where the snitch suddenly moves in the end are the hardest, corroborating the diagnostic analysis seen earlier.


To conclude, in this work we introduced and used CATER to analyze several leading network designs on hard spatiotemporal reasoning tasks. We found most models struggle on CATER, especially on the snitch localization task which requires long term reasoning. Such temporal reasoning challenges are common in the real world, and solving those would be the cornerstone of the next improvements in machine video understanding. That said, CATER is, by no means, a complete solution to the video understanding problem. Like any other synthetic or simulated dataset, it should be considered in addition to real world benchmarks. One of our findings is that while high-level semantic tasks such as activity recognition may be addressable with current architectures given a richly labeled dataset, “mid-level” tasks such as tracking still pose tremendous challenges, particularly under long-term occlusions and containment. We believe addressing such challenges will enable broader temporal reasoning tasks that capture intentions, goals, and causal behavior.

Additionally, note that CATER is also not limited to video classification. Recent papers have also used CATER for other related tasks like video reconstruction and for learning object permanence. We have released a pre-generated version of CATER, along with full metadata, which can be used to define arbitrarily fine-grained spatio-temporal reasoning tasks. Additionally, we have released the code to generate CATER as well as the baselines discussed above, which can be used to reproduce our experiments or generate a custom variant of CATER for other tasks.

Interested in more details?

Check out the links to the paper, complete codebase with pre-trained models, talk and project webpage below.

Read More

Heads Up, Down Under: Sydney Suburb Enhances Livability with Traffic Analytics

Heads Up, Down Under: Sydney Suburb Enhances Livability with Traffic Analytics

With a new university campus nearby and an airport under construction, the city of Liverpool, Australia, 27 kilometers southwest of Sydney, is growing fast.

More than 30,000 people are expected to make a daily commute to its central business district. Liverpool needed to know the possible impact to traffic flow and movement of pedestrians, cyclists and vehicles.

The city already hosts closed-circuit televisions to monitor safety and security. Each CCTV captures lots of video and data that, due to stringent privacy regulations, is mainly combed through after an incident has been reported.

The challenge before the city was to turn this massive dataset into information that could help it run more efficiently, handle an influx of commuters and keep the place liveable for residents — without compromising anyone’s privacy.

To achieve this goal, the city has partnered with the Digital Living Lab of the University of Wollongong. Part of Wollongong’s SMART Infrastructure Facility, the DLL has developed what it calls the Versatile Intelligent Video Analytics platform. VIVA, for short, unlocks data so that owners of CCTV networks can access real-time, privacy-compliant data to make better informed decisions.

VIVA is designed to convert existing infrastructure into edge-computing devices embedded with the latest AI. The platform’s state-of-the-art deep learning algorithms are developed at DLL on the NVIDIA Metropolis platform. Their video analytics deep-learning models are trained using transfer learning to adapt to use cases, optimized via NVIDIA TensorRT software and deployed on NVIDIA Jetson edge AI computers.

“We designed VIVA to process video feeds as close as possible to the source, which is the camera,” said Johan Barthelemy, lecturer at the SMART Infrastructure Facility of the University of Wollongong. “Once a frame has been analyzed using a deep neural network, the outcome is transmitted and the current frame is discarded.”

Disposing of frames maintains privacy as no images are transmitted. It also reduces the bandwidth needed.

Beyond city streets like in Liverpool, VIVA has been adapted for a wide variety of applications, such as identifying and tracking wildlife; detecting culvert blockage for stormwater management and flash flood early warnings; and tracking of people using thermal cameras to understand people’s mobility behavior during heat waves. It can also distinguish between firefighters searching a building and other building occupants, helping identify those who may need help to evacuate.

Making Sense of Traffic Patterns

The research collaboration between SMART, Liverpool’s city council and its industry partners is intended to improve the efficiency, effectiveness and accessibility of a range of government services and facilities.

For pedestrians, the project aims to understand where they’re going, their preferred routes and which areas are congested. For cyclists, it’s about the routes they use and ways to improve bicycle usage. For vehicles, understanding movement and traffic patterns, where they stop, and where they park are key.

Understanding mobility within a city formerly required a fleet of costly and fixed sensors, according to Barthelemy. Different models were needed to count specific types of traffic, and manual processes were used to understand how different types of traffic interacted with each other.

Using computer vision on the NVIDIA Jetson TX2 at the edge, the VIVA platform can count the different types of traffic and capture their trajectory and speed. Data is gathered using the city’s existing CCTV network, eliminating the need to invest in additional sensors.

Patterns of movements and points of congestion are identified and predicted to help improve street and footpath layout and connectivity, traffic management and guided pathways. The data has been invaluable in helping Liverpool plan for the urban design and traffic management of its central business district.

Machine Learning Application Built Using NVIDIA Technologies

SMART trained the machine learning applications on its VIVA platform for Liverpool on four workstations powered by a variety of NVIDIA TITAN GPUs, as well as six workstations equipped with NVIDIA RTX GPUs to generate synthetic data and run experiments.

In addition to using open databases such as OpenImage, COCO and Pascal VOC for training, DLL created synthetic data via an in-house application based on the Unity Engine. Synthetic data allows the project to learn from numerous scenarios that might not otherwise be present at any given time, like rainstorms or masses of cyclists.

“This synthetic data generation allowed us to generate 35,000-plus images per scenario of interest under different weather, time of day and lighting conditions,” said Barthelemy. “The synthetic data generation uses ray tracing to improve the realism of the generated images.”

Inferencing is done with NVIDIA Jetson Nano, NVIDIA Jetson TX2 and NVIDIA Jetson Xavier NX, depending on the use case and processing required.

The post Heads Up, Down Under: Sydney Suburb Enhances Livability with Traffic Analytics appeared first on The Official NVIDIA Blog.

Read More