Mitigating Unfair Bias in ML Models with the MinDiff Framework

Mitigating Unfair Bias in ML Models with the MinDiff Framework

Posted by Flavien Prost, Senior Software Engineer, and Alex Beutel, Staff Research Scientist, Google Research

The responsible research and development of machine learning (ML) can play a pivotal role in helping to solve a wide variety of societal challenges. At Google, our research reflects our AI Principles, from helping to protect patients from medication errors and improving flood forecasting models, to presenting methods that tackle unfair bias in products, such as Google Translate, and providing resources for other researchers to do the same.

One broad category for applying ML responsibly is the task of classification — systems that sort data into labeled categories. At Google, such models are used throughout our products to enforce policies, ranging from the detection of hate speech to age-appropriate content filtering. While these classifiers serve vital functions, it is also essential that they are built in ways that minimize unfair biases for users.

Today, we are announcing the release of MinDiff, a new regularization technique available in the TF Model Remediation library for effectively and efficiently mitigating unfair biases when training ML models. In this post, we discuss the research behind this technique and explain how it addresses the practical constraints and requirements we’ve observed when incorporating it in Google’s products.

Unfair Biases in Classifiers
To illustrate how MinDiff can be used, consider an example of a product policy classifier that is tasked with identifying and removing text comments that could be considered toxic. One challenge is to make sure that the classifier is not unfairly biased against submissions from a particular group of users, which could result in incorrect removal of content from these groups.

The academic community has laid a solid theoretical foundation for ML fairness, offering a breadth of perspectives on what unfair bias means and on the tensions between different frameworks for evaluating fairness. One of the most common metrics is equality of opportunity, which, in our example, means measuring and seeking to minimize the difference in false positive rate (FPR) across groups. In the example above, this means that the classifier should not be more likely to incorrectly remove safe comments from one group than another. Similarly, the classifier’s false negative rate should be equal between groups. That is, the classifier should not miss toxic comments against one group more than it does for another.

When the end goal is to improve products, it’s important to be able to scale unfair bias mitigation to many models. However, this poses a number of challenges:

  • Sparse demographic data: The original work on equality of opportunity proposed a post-processing approach to the problem, which consisted of assigning each user group a different classifier threshold at serving time to offset biases of the model. However, in practice this is often not possible for many reasons, such as privacy policies. For example, demographics are often collected by users self-identifying and opting in, but while some users will choose to do this, others may choose to opt-out or delete data. Even for in-process solutions (i.e., methods that change how a model is trained) one needs to assume that most data will not have associated demographics, and thus needs to make efficient use of the few examples for which demographics are known.
  • Ease of Use: In order for any technique to be adopted broadly, it should be easy to incorporate into existing model architectures, and not be highly sensitive to hyperparameters. While an early approach to incorporating ML fairness principles into applications utilized adversarial learning, we found that it too frequently caused models to degenerate during training, which made it difficult for product teams to iterate and made new product teams wary.
  • Quality: The method for removing unfair biases should also reduce the overall classification performance (e.g., accuracy) as little as possible. Because any decrease in accuracy caused by the mitigation approach could result in the moderation model allowing more toxic comments, striking the right balance is crucial.

MinDiff Framework
We iteratively developed the MinDiff framework over the previous few years to meet these design requirements. Because demographic information is so rarely known, we utilize in-process approaches in which the model’s training objective is augmented with an objective specifically focused on removing biases. This new objective is then optimized over the small sample of data with known demographic information. To improve ease of use, we switched from adversarial training to a regularization framework, which penalizes statistical dependency between its predictions and demographic information for non-harmful examples. This encourages the model to equalize error rates across groups, e.g., classifying non-harmful examples as toxic.

There are several ways to encode this dependency between predictions and demographic information. Our initial MinDiff implementation minimized the correlation between the predictions and the demographic group, which essentially optimized for the average and variance of predictions to be equal across groups, even if the distributions still differ afterward. We have since improved MinDiff further by considering the maximum mean discrepancy (MMD) loss, which is closer to optimizing for the distribution of predictions to be independent of demographics. We have found that this approach is better able to both remove biases and maintain model accuracy.

MinDiff with MMD better closes the FPR gap with less decrease in accuracy
(on an academic benchmark dataset).

To date we have launched modeling improvements across several classifiers at Google that moderate content quality. We went through multiple iterations to develop a robust, responsible, and scalable approach, solving research challenges and enabling broad adoption.

Gaps in error rates of classifiers is an important set of unfair biases to address, but not the only one that arises in ML applications. For ML researchers and practitioners, we hope this work can further advance research toward addressing even broader classes of unfair biases and the development of approaches that can be used in practical applications. In addition, we hope that the release of the MinDiff library and the associated demos and documentation, along with the tools and experience shared here, can help practitioners improve their models and products.

Acknowledgements
This research effort on ML Fairness in classification was jointly led with Jilin Chen, Shuo Chen, Ed H. Chi, Tulsee Doshi, and Hai Qian. Further, this work was pursued in collaboration with Jonathan Bischof, Qiuwen Chen, Pierre Kreitmann, and Christine Luu. The MinDiff infrastructure was also developed in collaboration with Nick Blumm, James Chen, Thomas Greenspan, Christina Greer, Lichan Hong, Manasi Joshi, Maciej Kula, Summer Misherghi, Dan Nanas, Sean O’Keefe, Mahesh Sathiamoorthy, Catherina Xu, and Zhe Zhao. (All names are listed in alphabetical order of last names.)

Read More

5 ways to celebrate TensorFlow's 5th birthday

5 ways to celebrate TensorFlow’s 5th birthday

Five years ago, we open-sourced TensorFlow, our machine learning framework for research and production. Our goal was to expand access to state-of-the-art machine learning tools so anyone could use them.

Since then, TensorFlow has become the most popular machine learning library in the world, with over 160 million downloads. Seeing so many people use TensorFlow is an incredible and humbling experience, and we’re thankful for the thousands of people outside of Google who have contributed code, created educational content and organized developer events around the world to support TensorFlow and the growing machine learning community.

To celebrate five years of TensorFlow, we’d like to point out a few interactive demos you can try from your browser with a single click, as well as some tutorials that can help you create your own projects. If you’re new to TensorFlow, these are a great way to get a feel for what it can do. And if you like what you see and want to dive a bit deeper, check out the TensorFlow Blog.

Try out some interactive demos powered by machine learning

TensorFlow supports multiple programming languages and environments. Let’s start with a quick tour of JavaScript, and three interactive demos you can try with a click.

TensorFlow.js enables you to write and run machine learning models entirely in the browser. This has important applications for privacy preserving applications (no data needs to be sent to a server), and for interactive machine learning programs. 

One great example of this is this iris landmark-tracking program which supports hands-free interfaces and assistive technologies; you can try the model yourself in your browser (be patient—it may take a few moments to load!).

Animated gif showing a woman tilting her head and the software tracking this by analyzing her iris.

Similarly to eye-tracking, you can also use TensorFlow.js to track hand motions

Animated gif showing a hand counting out numbers and the tracking software tracing this movement.

You only need a webcam for both of these demos, and no data leaves your machine.

Train your own model, no coding necessary

You can train your own model (with no coding required) using the Teachable Machine. It’s a fast, fun, and easy way to create a machine learning model right in your browser. For instance, you could teach a model to recognize images, or sounds that you record using your microphone.

Screenshot of three projects you can use teachable machine to do: image project, audio project, or pose project.

Go deeper with tutorials

TensorFlow includes a powerful Python library. To get started using it, here are some tutorials for beginners and experts alike. These tutorials (which contain complete, end-to-end code) span topics from machine learning fundamentals, to computer vision and machine translation—and even show you how to generate artwork with machine learning.

Images shows pink roses.

Image CC-BY by Virginia McMillan.

Bring TensorFlow to mobile apps 

TensorFlow Lite enables you to build machine learning-powered apps on mobile and small embedded devices. A group of engineering students in India used TensorFlow Lite to develop an Android app that provides local air quality information using a smartphone camera.

Photo shows a person holding out their smartphone against a landscape of green trees to analyze air quality.

You can go even smaller, too: TensorFlow Lite Micro lets you run machine learning models on microcontrollers (tiny computers that can fit in the palm of your hand).

Understand how to build responsibly

As billions of people around the world continue to use products and services with machine learning at their core, it’s become increasingly important to design and deploy these systems responsibly. TensorFlow includes a large set of tools and best practices for Responsible AI, including the What-If Tool which tests how machine learning models will work for different people in hypothetical situations.

And there’s much more you can do as well. TensorFlow includes a complete set of tools to power production ML systems, and even supports the latest research in Quantum computing

This is only the beginning, and we’re excited to see what the next five years bring. To learn more about TensorFlow, check out tensorflow.org, read the blog, follow us on social or subscribe to our YouTube Channel.

Read More

The Machine Learning Behind Hum to Search

The Machine Learning Behind Hum to Search

Posted by Christian Frank, Google Research, Zürich

Melodies stuck in your head, often referred to as “earworms,” are a well-known and sometimes irritating phenomenon — once that earworm is there, it can be tough to get rid of it. Research has found that engaging with the original song, whether that’s listening to or singing it, will drive the earworm away. But what if you can’t quite recall the name of the song, and can only hum the melody?

Existing methods to match a hummed melody to its original polyphonic studio recording face several challenges. With lyrics, background vocals and instruments, the audio of a musical or studio recording can be quite different from a hummed tune. By mistake or design, when someone hums their interpretation of a song, often the pitch, key, tempo or rhythm may vary slightly or even significantly. That’s why so many existing approaches to query by humming match the hummed tune against a database of pre-existing melody-only or hummed versions of a song, instead of identifying the song directly. However, this type of approach often relies on a limited database that requires manual updates.

Launched in October, Hum to Search is a new fully machine-learned system within Google Search that allows a person to find a song using only a hummed rendition of it. In contrast to existing methods, this approach produces an embedding of a melody from a spectrogram of a song without generating an intermediate representation. This enables the model to match a hummed melody directly to the original (polyphonic) recordings without the need for a hummed or MIDI version of each track or for other complex hand-engineered logic to extract the melody. This approach greatly simplifies the database for Hum to Search, allowing it to constantly be refreshed with embeddings of original recordings from across the world — even the latest releases.

Background
Many existing music recognition systems convert an audio sample into a spectrogram before processing it, in order to find a good match. However, one challenge in recognizing a hummed melody is that a hummed tune often contains relatively little information, as illustrated by this hummed example of Bella Ciao. The difference between the hummed version and the same segment from the corresponding studio recording can be visualized using spectrograms, seen below:

Visualization of a hummed clip and a matching studio recording.

Given the image on the left, a model needs to locate the audio corresponding to the right-hand image from a collection of over 50M similar-looking images (corresponding to segments of studio recordings of other songs). To achieve this, the model has to learn to focus on the dominant melody, and ignore background vocals, instruments, and voice timbre, as well as differences stemming from background noise or room reverberations. To find by eye the dominant melody that might be used to match these two spectrograms, a person might look for similarities in the lines near the bottom of the above images.

Prior efforts to enable discovery of music, in particular in the context of recognizing recorded music being played in an environment such as a cafe or a club, demonstrated how machine learning might be applied to this problem. Now Playing, released to Pixel phones in 2017, uses an on-device deep neural network to recognize songs without the need for a server connection, and Sound Search further developed this technology to provide a server-based recognition service for faster and more accurate searching of over 100 million songs. The next challenge then was to leverage what was learned from these releases to recognize hummed or sung music from a similarly large library of songs.

Machine Learning Setup
The first step in developing Hum to Search was to modify the music-recognition models used in Now Playing and Sound Search to work with hummed recordings. In principle, many such retrieval systems (e.g., image recognition) work in a similar way. A neural network is trained with pairs of input (here pairs of hummed or sung audio with recorded audio) to produce embeddings for each input, which will later be used for matching to a hummed melody.

Training setup for the neural network

To enable humming recognition, the network should produce embeddings for which pairs of audio containing the same melody are close to each other, even if they have different instrumental accompaniment and singing voices. Pairs of audio containing different melodies should be far apart. In training, the network is provided such pairs of audio until it learns to produce embeddings with this property.

The trained model can then generate an embedding for a tune that is similar to the embedding of the song’s reference recording. Finding the correct song is then only a matter of searching for similar embeddings from a database of reference recordings computed from audio of popular music.

Training Data
Because training of the model required song pairs (recorded and sung), the first challenge was to obtain enough training data. Our initial dataset consisted of mostly sung music segments (very few of these contained humming). To make the model more robust, we augmented the audio during training, for example by varying the pitch or tempo of the sung input randomly. The resulting model worked well enough for people singing, but not for people humming or whistling.

To improve the model’s performance on hummed melodies we generated additional training data of simulated “hummed” melodies from the existing audio dataset using SPICE, a pitch extraction model developed by our wider team as part of the FreddieMeter project. SPICE extracts the pitch values from given audio, which we then use to generate a melody consisting of discrete audio tones. The very first version of this system transformed this original clip into these tones.

Generating hummed audio from sung audio

We later refined this approach by replacing the simple tone generator with a neural network that generates audio resembling an actual hummed or whistled tune. For example, the network generates this humming example or whistling example from the above sung clip.

As a final step, we compared training data by mixing and matching the audio samples. For example, if we had a similar clip from two different singers, we’d align those two clips with our preliminary models, and are therefore able to show the model an additional pair of audio clips that represent the same melody.

Machine Learning Improvements
When training the Hum to Search model, we started with a triplet loss function. This loss has been shown to perform well across a variety of classification tasks like images and recorded music. Given a pair of audio corresponding to the same melody (points R and P in embedding space, shown below), triplet loss would ignore certain parts of the training data derived from a different melody. This helps the machine improve learning behavior, either when it finds a different melody that is too ‘easy’ in that it is already far away from R and P (see point E) or because it is too hard in that, given the model’s current state of learning, the audio ends up being too close to R — even though according to our data it represents a different melody (see point H).

Example audio segments visualized as points in embedding space

We’ve found that we could improve the accuracy of the model by taking these additional training data (points H and E) into account, namely by formulating a general notion of model confidence across a batch of examples: How sure is the model that all the data it has seen can be classified correctly, or has it seen examples that do not fit its current understanding? Based on this notion of confidence, we added a loss that drives model confidence towards 100% across all areas of the embedding space, which led to improvements in our model’s precision and recall.

The above changes, but in particular our variations, augmentations and superpositions of the training data, enabled the neural network model deployed in Google Search to recognize sung or hummed melodies. The current system reaches a high level of accuracy on a song database that contains over half a million songs that we are continually updating. This song corpus still has room to grow to include more of the world’s many melodies.

Hum to Search in the Google App

To try the feature, you can open the latest version of the Google app, tap the mic icon and say “what’s this song?” or click the “Search a song” button, after which you can hum, sing, or whistle away! We hope that Hum to Search can help with that earworm of yours, or maybe just help you in case you want to find and playback a song without having to type its name.

Acknowledgements
The work described here was authored by Alex Tudor, Duc Dung Nguyen, Matej Kastelic‎, Mihajlo Velimirović‎, Stefan Christoph, Mauricio Zuluaga, Christian Frank, Dominik Roblek, and Matt Sharifi. We would like to deeply thank Krishna Kumar, Satyajeet Salgar and Blaise Aguera y Arcas for their ongoing support, as well as all the Google teams we’ve collaborated with to build the full Hum to Search product.

We would also like to thank all our colleagues at Google who donated clips of themselves singing or humming and therefore laid a foundation for this work, as well as Nick Moukhine‎ for building the Google-internal singing donation app. Finally, special thanks to Meghan Danks and Krishna Kumar for their feedback on earlier versions of this post.

Read More

Bickey Russell finds inspiration from his native Bangladesh

Bickey Russell finds inspiration from his native Bangladesh

Welcome to the latest edition of “My Path to Google,” where we talk to Googlers, interns and alumni about how they got to Google, what their roles are like and even some tips on how to prepare for interviews.

Having spent his childhood between London, Milan and Dhaka, Bangladesh, Bickey Russell began his career at Google in sales before pursuing his passion for developing technology to serve under-resourced communities. Today, he’s the founder and leader of Kormo Jobs. Guided by Google’s commitment to our AI Principles, Bickey and his team are helping job seekers across Bangladesh, Indonesia, and India find meaningful work. 

What’s your role at Google?

I founded the Kormo Jobs app and currently lead global product operations for it as well as some other new projects in the Next Billion Users initiative at Google.

I drive Kormo Jobs’ go-to-market approach. This involves things like working with employers to use Kormo Jobs to post openings on our platform and building up a community of job seekers who get value from Kormo Jobs as they look for work and grow their careers.

Students holding up pamphlets about Kormo

Participants at a vocational training institute in Jakarta learning about Kormo Jobs.

You’ve held a few different roles in multiple offices. How did you end up working on Kormo Jobs? 

I’m super passionate about the positive impact technology can have on society in countries like my native Bangladesh. Throughout my career at Google I have moved from business analysis to sales, partnerships management and leadership roles, and worked in London, Mountain View and currently, Singapore. Despite all that change, I have always been involved with initiatives to make Google products work better in Bangladesh—ranging from Maps to Bangla language capabilities. 

In 2016, I was fortunate to be able to collaborate with colleagues and pitch an app idea I had to Google’s internal innovation incubator, Area 120. We were hoping to use machine learning to build a better way to help people in Bangladesh get jobs in more blue-collar sectors. Our small team was fortunate to join the Area 120 program, and after just three years, our app became a Google product. Kormo Jobs is live in Bangladesh, India and Indonesia. 

And what were you up to before joining Google?

I grew up in London, Milan and Dhaka, spending middle school and high school  in Dhaka before returning to London for university where I did a degree in geography.

I worked in retail throughout my time in university. The highlight was probably selling band t-shirts in Camden Market! My first full-time job was working as a researcher, and then as a business analyst. 

Can you tell us about your decision to apply to Google?

I was fascinated by the Internet, and I wanted to join a fast-paced company that has an entrepreneurial and open working culture. Google’s vision was majorly inspiring and so attractive to me at the time, and it still is. I felt that if I could join a company like that, I could make an impact.

I applied via the Google careers page. The interview day was quite nerve-wracking, but actually a lot of fun. I remember talking a lot about my interest in cricket, plus my favorite websites and Google products. I was also asked to propose a plan on how we might develop the market for Google AdWords in the UK for a particular industry. That was a challenge, but I guess I did okay!

Bickey presenting on a large stage with a display of the Kormo Jobs app on a screen behind him.

Bickey presenting the Kormo Jobs app at a Google India event.

Can you tell us about the resources you used to prepare for your interview or role?

I didn’t know anyone who worked at Google at the time, but since I knew the job was to join the advertising business in the UK, I reached out and talked to a lot of my network in the advertising and media space to prepare. Plus, I used Search to do research!

Do you have any tips you’d like to share with aspiring Googlers?

I would say that aspiring Googlers should really think about why they are interested in the specific role they are applying for. I often interview candidates who are keen to work at Google but haven’t done enough preparation on why they would be a good fit for the role and team that they have applied to join.

Bickey working with an employer using Kormo Jobs.

Bickey working with an employer using Kormo Jobs.

What inspires you to log in every day?

Having been at the company a long time, I’ve seen firsthand countless times the impact technology can have on people and society at large.

I am inspired by the fact that Google’s AI Principles guide us to make socially beneficial AI systems—and that I get to work with an amazing team at Kormo Jobs to put this principle into practice every day. We invest in applying our tech capability to solving important problems—finding work, earning money, building a career—to people in places like my home town of Dhaka.

Every day I get excited when I see that we’ve helped more people get a job than we did the day before.

Read More

Improving On-Device Speech Recognition with VoiceFilter-Lite

Improving On-Device Speech Recognition with VoiceFilter-Lite

Posted by Quan Wang, Software Engineer, Google Research

Voice assistive technologies, which enable users to employ voice commands to interact with their devices, rely on accurate speech recognition to ensure responsiveness to a specific user. But in many real-world use cases, the input to such technologies often consists of overlapping speech, which poses great challenges to many speech recognition algorithms. In 2018, we published a VoiceFilter system, which leverages Google’s Voice Match to personalize interaction with assistive technology by allowing people to enroll their voices.


While the VoiceFilter approach is highly successful, achieving a better source to distortion ratio (SDR) than conventional approaches, efficient on-device streaming speech recognition requires addressing restrictions such as model size, CPU and memory limitations, as well as battery usage considerations and latency minimization.

In “VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition”, we present an update to VoiceFilter for on-device use that can significantly improve speech recognition in overlapping speech by leveraging the enrolled voice of a selected speaker. Importantly, this model can be easily integrated with existing on-device speech recognition applications, allowing the user to access voice assistive features under extremely noisy conditions even if an internet connection is unavailable. Our experiments show that a 2.2MB VoiceFilter-Lite model provides a 25.1% improvement to the word error rate (WER) on overlapping speech.


Improving On-Device Speech Recognition
While the original VoiceFilter system was very successful at separating a target speaker’s speech signal from other overlapping sources, its model size, computational cost and latency are not feasible for speech recognition on mobile devices.

The new VoiceFilter-Lite system has been carefully designed to fit on-device applications. Instead of processing audio waveforms, VoiceFilter-Lite takes exactly the same input features as the speech recognition model (stacked log Mel-filterbanks), and directly enhances these features by filtering out components not belonging to the target speaker in real time. Together with several optimizations on network topologies, the number of runtime operations is drastically reduced. After quantizing the neural network with the TensorFlow Lite library, the model size is only 2.2 MB, which fits most on-device applications.

To train the VoiceFilter-Lite model, the filterbanks of the noisy speech are fed as input to the network together with an embedding vector that represents the identity of the target speaker (i.e., a d-vector). The network predicts a mask that is element-wise multiplied to the input to produce enhanced filterbanks. A loss function is defined to minimize the difference between the enhanced filterbanks and the filterbanks from the clean speech during training.

Model architecture of the VoiceFilter-Lite system.

VoiceFilter-Lite is a plug-and-play model, which allows the application in which it’s implemented to easily bypass it if the speaker did not enroll their voice. This also means that the speech recognition model and the VoiceFilter-Lite model can be separately trained and updated, which largely reduces engineering complexity in the deployment process.

As a plug-and-play model, VoiceFilter-Lite can be easily bypassed if the speaker did not enroll their voice.

Addressing the Challenge of Over-Suppression
When speech separation models are used for improving speech recognition, two types of error could occur: under-suppression, when the model fails to filter out noisy components from the signal; and over-suppression, when the model fails to preserve useful signal, resulting in some words being dropped from the recognized text. Over-suppression is especially problematic since modern speech recognition models are usually already trained with extensively augmented data (such as room simulation and SpecAugment), and thus are more robust to under-suppression.

VoiceFilter-Lite addresses the over-suppression issue with two novel approaches. First, it uses an asymmetric loss during the training process, such that the model is less tolerant to over-suppression than under-suppression. Second, it predicts the type of noise at runtime, and adaptively adjusts the suppression strength according to this prediction.

VoiceFilter-Lite adaptively applies stronger suppression strength when overlapping speech is detected.

With these two solutions, the VoiceFilter-Lite model retains great performance on streaming speech recognition for other scenarios, such as single-speaker speech under quiet or various noise conditions, while still providing significant improvement on overlapping speech. From our experiments, we observed a 25.1% improvement of word error rate after the 2.2MB VoiceFilter-Lite model is applied on additive overlapping speech. For reverberant overlapping speech, which is a more challenging task to simulate far-field devices such as smart home speakers, we also observed a 14.7% improvement of word error rate with VoiceFilter-Lite.

Future Work
While VoiceFilter-Lite has shown great promise for various on-device speech applications, we are also exploring several other directions to make VoiceFilter-Lite more useful. First, our current model is trained and evaluated with English speech only. We are excited about adopting the same technology to improve speech recognition for more languages. Second, we would like to directly optimize the speech recognition loss during the training of VoiceFilter-Lite, which can potentially further improve speech recognition beyond overlapping speech.

Acknowledgements
The research described in this post represents joint efforts from multiple teams within Google. Contributors include Quan Wang, Ignacio Lopez Moreno, Mert Saglam, Kevin Wilson, Alan Chiao, Renjie Liu, Yanzhang He, Wei Li, Jason Pelecanos, Philip Chao, Sinan Akay, John Han, Stephen Wu, Hannah Muckenhirn, Ye Jia, Zelin Wu, Yiteng Huang, Marily Nika, Jaclyn Konzelmann, Nino Tasca, and Alexander Gruenstein.

Read More

Announcing the Objectron Dataset

Announcing the Objectron Dataset

Posted by Adel Ahmadyan and Liangkai Zhang, Software Engineers, Google Research

The state of the art in machine learning (ML) has achieved exceptional accuracy on many computer vision tasks solely by training models on photos. Building upon these successes and advancing 3D object understanding has great potential to power a wider range of applications, such as augmented reality, robotics, autonomy, and image retrieval. For example, earlier this year we released MediaPipe Objectron, a set of real-time 3D object detection models designed for mobile devices, which were trained on a fully annotated, real-world 3D dataset, that can predict objects’ 3D bounding boxes.

Yet, understanding objects in 3D remains a challenging task due to the lack of large real-world datasets compared to 2D tasks (e.g., ImageNet, COCO, and Open Images). To empower the research community for continued advancement in 3D object understanding, there is a strong need for the release of object-centric video datasets, which capture more of the 3D structure of an object, while matching the data format used for many vision tasks (i.e., video or camera streams), to aid in the training and benchmarking of machine learning models.

Today, we are excited to release the Objectron dataset, a collection of short, object-centric video clips capturing a larger set of common objects from different angles. Each video clip is accompanied by AR session metadata that includes camera poses and sparse point-clouds. The data also contain manually annotated 3D bounding boxes for each object, which describe the object’s position, orientation, and dimensions. The dataset consists of 15K annotated video clips supplemented with over 4M annotated images collected from a geo-diverse sample (covering 10 countries across five continents).

Example videos in the Objectron dataset.

A 3D Object Detection Solution
Along with the dataset, we are also sharing a 3D object detection solution for four categories of objects — shoes, chairs, mugs, and cameras. These models are released in MediaPipe, Google’s open source framework for cross-platform customizable ML solutions for live and streaming media, which also powers ML solutions like on-device real-time hand, iris and body pose tracking.

Sample results of 3D object detection solution running on mobile.

In contrast to the previously released single-stage Objectron model, these newest versions utilize a two-stage architecture. The first stage employs the TensorFlow Object Detection model to find the 2D crop of the object. The second stage then uses the image crop to estimate the 3D bounding box while simultaneously computing the 2D crop of the object for the next frame, so that the object detector does not need to run every frame. The second stage 3D bounding box predictor runs at 83 FPS on Adreno 650 mobile GPU.

Diagram of a reference 3D object detection solution.

Evaluation Metric for 3D Object Detection
With ground truth annotations, we evaluate the performance of 3D object detection models using 3D intersection over union (IoU) similarity statistics, a commonly used metric for computer vision tasks, which measures how close the bounding boxes are to the ground truth.

We propose an algorithm for computing accurate 3D IoU values for general 3D-oriented boxes. First, we compute the intersection points between faces of the two boxes using Sutherland-Hodgman Polygon clipping algorithm. This is similar to frustum culling, a technique used in computer graphics. The volume of the intersection is computed by the convex hull of all the clipped polygons. Finally, the IoU is computed from the volume of the intersection and volume of the union of two boxes. We are releasing the evaluation metrics source code along with the dataset.

Compute the 3D intersection over union using the polygon clipping algorithm, Left: Compute the intersection points of each face by clipping the polygon against the box. Right: Compute the volume of intersection by computing the convex hull of all intersection points (green).

Dataset Format
The technical details of the Objectron dataset, including usage and tutorials, are available on the dataset website. The dataset includes bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes, and is stored in the objectron bucket on Google Cloud storage with the following assets:

  • The video sequences
  • The annotation labels (3D bounding boxes for objects)
  • AR metadata (such as camera poses, point clouds, and planar surfaces)
  • Processed dataset: shuffled version of the annotated frames, in tf.example format for images and SequenceExample format for videos.
  • Supporting scripts to run evaluation based on the metric described above
  • Supporting scripts to load the data into Tensorflow, PyTorch, and Jax and to visualize the dataset, including “Hello World” examples

With the dataset, we are also open-sourcing a data-pipeline to parse the dataset in popular Tensorflow, PyTorch and Jax frameworks. Example colab notebooks are also provided.

By releasing this Objectron dataset, we hope to enable the research community to push the limits of 3D object geometry understanding. We also hope to foster new research and applications, such as view synthesis, improved 3D representation, and unsupervised learning. Stay tuned for future activities and developments by joining our mailing list and visiting our github page.

Acknowledgements
The research described in this post was done by Adel Ahmadyan, Liangkai Zhang, Jianing Wei, Artsiom Ablavatski, Mogan Shieh, Ryan Hickman, Buck Bourdon, Alexander Kanaukou, Chuo-Ling Chang, Matthias Grundmann, ‎and Tom Funkhouser. We thank Aliaksandr Shyrokau, Sviatlana Mialik, Anna Eliseeva, and the annotation team for their high quality annotations. We also would like to thank Jonathan Huang and Vivek Rathod for their guidance on TensorFlow Object Detection API.

Read More

Halodoc uses AI to improve how doctors receive feedback

Halodoc uses AI to improve how doctors receive feedback

Due to Indonesia’s vast size and population, timely and reliable access to healthcare can sometimes be a challenge. Halodoc aims to change that with a mobile first-telemedicine platform that connects Indonesians to doctors and helps them arrange appointments, medicine deliveries and tests. 

What’s distinctive about the Halodoc platform is that it draws on human-centered artificial intelligence: a promising new area of research that uses continuous human feedback to improve how AI systems work, and provides a better experience for the people who rely on those systems. 

With support from Google’s Late Stage Accelerator, a program that assists high-potential startups, we assembled a team of doctors, data scientists, engineers, product managers and researchers to determine how technology could support Indonesian doctors’ work. One particular approach the team identified was using AI to replicate the mentoring and feedback that junior doctors receive from more experienced colleagues in hospitals—a process that’s important to improving quality of care, but is hard to reproduce on a larger scale.  

We set out to create an easy way to provide feedback in virtual health, and worked with Google’s machine learning experts in the Late-Stage Accelerator to determine the best approach. With Google’s guidance, Halodoc’s engineers applied Natural Language Processing in Bahasa Indonesia to measure, rank, and provide insights that can inform doctors’ decisions across the country—using thousands of consultations to train their machine learning models. 

When doctors open the Halodoc app, they see information on how they performed based on their response time and quality index metrics, along with suggested actions on how they can improve their consultation quality.  They also have the option of receiving further feedback and coaching from more senior doctors if needed. 

Right now, more than five percent of Indonesians use Halodoc’s platform. As a result of applying AI principles to improve the quality of care that patients experience, our app ratings have increased from 4.5 to 4.8 stars in fewer than six months, while our overall doctor scores have improved by 64 percent.

Halodoc's app interface.

Halodoc’s telemedicine app enables doctors to deliver personalized feedback with assistance from ML-enabled insights that improve patient care.

From here, with Google’s help, we hope to continue simplifying Indonesia’s healthcare infrastructure and advance the application of AI in healthcare globally.

Read More

Background Features in Google Meet, powered by Web ML

Background Features in Google Meet, powered by Web ML

Posted by Tingbo Hou and Tyler Mullen, Software Engineers, Google Research

Video conferencing is becoming ever more critical in people’s work and personal lives. Improving that experience with privacy enhancements or fun visual touches can help center our focus on the meeting itself. As part of this goal, we recently announced ways to blur and replace your background in Google Meet, which use machine learning (ML) to better highlight participants regardless of their surroundings. Whereas other solutions require installing additional software, Meet’s features are powered by cutting-edge web ML technologies built with MediaPipe that work directly in your browser — no extra steps necessary. One key goal in developing these features was to provide real-time, in-browser performance on almost all modern devices, which we accomplished by combining efficient on-device ML models, WebGL-based rendering, and web-based ML inference via XNNPACK and TFLite.

Background blur and background replacement, powered by MediaPipe on the web.

Overview of Our Web ML Solution
The new features in Meet are developed with MediaPipe, Google’s open source framework for cross-platform customizable ML solutions for live and streaming media, which also powers ML solutions like on-device real-time hand, iris and body pose tracking.

A core need for any on-device solution is to achieve high performance. To accomplish this, MediaPipe’s web pipeline leverages WebAssembly, a low-level binary code format designed specifically for web browsers that improves speed for compute-heavy tasks. At runtime, the browser converts WebAssembly instructions into native machine code that executes much faster than traditional JavaScript code. In addition, Chrome 84 recently introduced support for WebAssembly SIMD, which processes multiple data points with each instruction, resulting in a performance boost of more than 2x.

Our solution first processes each video frame by segmenting a user from their background (more about our segmentation model later in the post) utilizing ML inference to compute a low resolution mask. Optionally, we further refine the mask to align it with the image boundaries. The mask is then used to render the video output via WebGL2, with the background blurred or replaced.

WebML Pipeline: All compute-heavy operations are implemented in C++/OpenGL and run within the browser via WebAssembly.

In the current version, model inference is executed on the client’s CPU for low power consumption and widest device coverage. To achieve real-time performance, we designed efficient ML models with inference accelerated by the XNNPACK library, the first inference engine specifically designed for the novel WebAssembly SIMD specification. Accelerated by XNNPACK and SIMD, the segmentation model can run in real-time on the web.

Enabled by MediaPipe’s flexible configuration, the background blur/replace solution adapts its processing based on device capability. On high-end devices it runs the full pipeline to deliver the highest visual quality, whereas on low-end devices it continues to perform at speed by switching to compute-light ML models and bypassing the mask refinement.

Segmentation Model
On-device ML models need to be ultra lightweight for fast inference, low power consumption, and small download size. For models running in the browser, the input resolution greatly affects the number of floating-point operations (FLOPs) necessary to process each frame, and therefore needs to be small as well. We downsample the image to a smaller size before feeding it to the model. Recovering a segmentation mask as fine as possible from a low-resolution image adds to the challenges of model design.

The overall segmentation network has a symmetric structure with respect to encoding and decoding, while the decoder blocks (light green) also share a symmetric layer structure with the encoder blocks (light blue). Specifically, channel-wise attention with global average pooling is applied in both encoder and decoder blocks, which is friendly to efficient CPU inference.

Model architecture with MobileNetV3 encoder (light blue), and a symmetric decoder (light green).

We modified MobileNetV3-small as the encoder, which has been tuned by network architecture search for the best performance with low resource requirements. To reduce the model size by 50%, we exported our model to TFLite using float16 quantization, resulting in a slight loss in weight precision but with no noticeable effect on quality. The resulting model has 193K parameters and is only 400KB in size.

Rendering Effects
Once segmentation is complete, we use OpenGL shaders for video processing and effect rendering, where the challenge is to render efficiently without introducing artifacts. In the refinement stage, we apply a joint bilateral filter to smooth the low resolution mask.

Rendering effects with artifacts reduced. Left: Joint bilateral filter smooths the segmentation mask. Middle: Separable filters remove halo artifacts in background blur. Right: Light wrapping in background replace.

The blur shader simulates a bokeh effect by adjusting the blur strength at each pixel proportionally to the segmentation mask values, similar to the circle-of-confusion (CoC) in optics. Pixels are weighted by their CoC radii, so that foreground pixels will not bleed into the background. We implemented separable filters for the weighted blur, instead of the popular Gaussian pyramid, as it removes halo artifacts surrounding the person. The blur is performed at a low resolution for efficiency, and blended with the input frame at the original resolution.

Background blur examples.

For background replacement, we adopt a compositing technique, known as light wrapping, for blending segmented persons and customized background images. Light wrapping helps soften segmentation edges by allowing background light to spill over onto foreground elements, making the compositing more immersive. It also helps minimize halo artifacts when there is a large contrast between the foreground and the replaced background.

Background replacement examples.

Performance
To optimize the experience for different devices, we provide model variants at multiple input sizes (i.e., 256×144 and 160×96 in the current release), automatically selecting the best according to available hardware resources.

We evaluated the speed of model inference and the end-to-end pipeline on two common devices: MacBook Pro 2018 with 2.2 GHz 6-Core Intel Core i7, and Acer Chromebook 11 with Intel Celeron N3060. For 720p input, the MacBook Pro can run the higher-quality model at 120 FPS and the end-to-end pipeline at 70 FPS, while the Chromebook runs inference at 62 FPS with the lower-quality model and 33 FPS end-to-end.

 Model   FLOPs   Device   Model Inference   Pipeline 
 256×144   64M   MacBook Pro 18   8.3ms (120 FPS)   14.3ms (70 FPS) 
 160×96   27M   Acer Chromebook 11   16.1ms (62 FPS)   30ms (33 FPS) 
Model inference speed and end-to-end pipeline on high-end (MacBook Pro) and low-end (Chromebook) laptops.

For quantitative evaluation of model accuracy, we adopt the popular metrics of intersection-over-union (IOU) and boundary F-measure. Both models achieve high quality, especially for having such a lightweight network:

  Model     IOU     Boundary  
  F-measure  
  256×144     93.58%     0.9024  
  160×96     90.79%     0.8542  
Evaluation of model accuracy, measured by IOU and boundary F-score.

We also release the accompanying Model Card for our segmentation models, which details our fairness evaluations. Our evaluation data contains images from 17 geographical subregions of the globe, with annotations for skin tone and gender. Our analysis shows that the model is consistent in its performance across the various regions, skin-tones, and genders, with only small deviations in IOU metrics.

Conclusion
We introduced a new in-browser ML solution for blurring and replacing your background in Google Meet. With this, ML models and OpenGL shaders can run efficiently on the web. The developed features achieve real-time performance with low power consumption, even on low-power devices.

Acknowledgments
Special thanks to the people who worked on this project, in particular Sebastian Jansson, Rikard Lundmark, Stephan Reiter, Fabian Bergmark, Ben Wagner, Stefan Holmer, Dan Gunnarson, Stéphane Hulaud and to all our team members who worked on the technology with us: Siargey Pisarchyk, Karthik Raveendran, Chris McClanahan, Marat Dukhan, Frank Barchard, Ming Guang Yong, Chuo-Ling Chang, Michael Hays, Camillo Lugaresi, Gregory Karpiak, Siarhei Kazakou, Matsvei Zhdanovich, and Matthias Grundmann.

Read More

Experimenting with Automatic Video Creation From a Web Page

Experimenting with Automatic Video Creation From a Web Page

Posted by Peggy Chi, Senior Research Scientist, and Irfan Essa, Senior Staff Research Scientist, Google Research

At Google, we’re actively exploring how people can use creativity tools powered by machine learning and computational methods when producing multimedia content, from creating music and reframing videos, to drawing and more. One creative process in particular, video production, can especially benefit from such tools, as it requires a series of decisions about what content is best suited to a target audience, how to position the available assets within the field of view, and what temporal arrangement will yield the most compelling narrative. But what if one could leverage existing assets, such as a website, to get a jump-start on video creation? Businesses commonly host websites that contain rich visual representations about their services or products, all of which could be repurposed for other multimedia formats, such as videos, potentially enabling those without extensive resources the ability to reach a broader audience.

In “Automatic Video Creation From a Web Page”, published at UIST 2020, we introduce URL2Video, a research prototype pipeline to automatically convert a web page into a short video, given temporal and visual constraints provided by the content owner. URL2Video extracts assets (text, images, or videos) and their design styles (including fonts, colors, graphical layouts, and hierarchy) from HTML sources and organizes the visual assets into a sequence of shots, while maintaining a look-and-feel similar to the source page. Given a user-specified aspect ratio and duration, it then renders the repurposed materials into a video that is ideal for product and service advertising.

URL2Video Overview
Assume a user provides an URL to a web page that illustrates their business. The URL2Video pipeline automatically selects key content from the page and decides the temporal and visual presentation of each asset, based on a set of heuristics derived from an interview study with designers who were familiar with web design and video ad creation. These designer-informed heuristics capture common video editing styles, including content hierarchy, constraining the amount of information in a shot and its time duration, providing consistent color and style for branding, and more. Using this information, the URL2Video pipeline parses a web page, analyzing the content and selecting visually salient text or images while preserving their design styles, which it organizes according to the video specifications provided by the user.

By extracting the structural content and design from the input web page, URL2Video makes automatic editing decisions to present key messages in a video. It considers the temporal (e.g., the duration in seconds) and spatial (e.g., the aspect ratio) constraints of the output video defined by users.

Webpage Analysis
Given a webpage URL, URL2Video extracts document object model (DOM) information and multimedia materials. For the purposes of our research prototype, we limited the domain to static web pages that contain salient assets and headings preserved in an HTML hierarchy that follows recent web design principles, which encourage the use of prominent elements, distinct sections, and an order of visual focus that guides readers in perceiving information. URL2Video identifies such visually-distinguishable elements as a candidate list of asset groups, each of which may contain a heading, a product image, detailed descriptions, and call-to-action buttons, and captures both the raw assets (text and multimedia files) and detailed design specifications (HTML tags, CSS styles, and rendered locations) for each element. It then ranks the asset groups by assigning each a priority score based on their visual appearance and annotations, including their HTML tags, rendered sizes, and ordering shown on the page. In this way, an asset group that occupies a larger area at the top of the page receives a higher score.

Constraints-Based Asset Selection
We consider two goals when composing a video: (1) each video shot should provide concise information, and (2) the visual design should be consistent with the source page. Based on these goals and the video constraints provided by the user, including the intended video duration (in seconds) and aspect ratio (commonly 16:9, 4:3, 1:1, etc.), URL2Video automatically selects and orders the asset groups to optimize the total priority score. To make the content concise, it presents only dominant elements from a page, such as a headline and a few multimedia assets. It constrains the duration of each visual element for viewers to perceive the content. In this way, a short video highlights the most salient information from the top of the page, and a longer video contains more campaigns or products.

Scene Composition & Video Rendering
Given an ordered list of assets based on the DOM hierarchy, URL2Video follows the design heuristics obtained from interview studies to make decisions about both the temporal and spatial arrangement to present the assets in individual shots. It transfers the graphical layout of elements into the video’s aspect ratio, and applies the style choices including fonts and colors. To make a video more dynamic and engaging, it adjusts the presentation timing of assets. Finally, it renders the content into a video in the MPEG-4 container format.

User Control
The interface to the research prototype allows the user to review the design attributes in each video shot extracted from the source page, reorder the materials, change the detailed design, such as colors and fonts, and adjust the constraints to generate a new video.

In URL2Video’s authoring interface (left), users specify the input URL to a source page, size of the target page view, and the output video parameters. URL2Video analyzes the web page and extracts major visual components. It composes a series of scenes and visualizes the key frames as a storyboard. These components are rendered into an output video that satisfies the input temporal and spatial constraints. Users can playback the video, examine the design attributes (bottom-right), and make adjustments to generate video variation, such as reordering the scenes (top-right).

URL2Video Use Cases
We demonstrate the performance of the end-to-end URL2Video pipeline on a variety of existing web pages. Below we highlight an example result where URL2Video converts a page that embeds multiple short video clips into a 12-second output video. Note how the pipeline makes automatic editing decisions on font and color choices, timing, and content ordering in a video captured from the source page.

URL2Video identifies key content from our Google Search introduction page (top), including headings and video assets. It converts them into a video by considering the presentation flow, the source design and the output constraints (a 12-second landscape video; bottom).

The video below provides further demonstration:

To evaluate the automatically-generated videos, we conducted a user study with designers at Google. Our results show that URL2Video effectively extracted design elements from a web page and supported designers by bootstrapping the video creation process.

Next steps
While this current research focuses on the visual presentation, we are developing new techniques that support the audio track and a voiceover in video editing. All in all, we envision a future where creators focus on making high-level decisions and an ML model interactively suggests detailed temporal and graphical edits for a final video creation on multiple platforms.

Acknowledgments
We greatly thank our paper co-authors, Zheng Sun (Research) and Katrina Panovich (YouTube). We would also like to thank our colleagues who contributed to URL2Video, (in alphabetical order of last name) Jordan Canedy, Brian Curless, Nathan Frey, Madison Le, Alireza Mahdian, Justin Parra, Emily Ryan, Mogan Shieh, Sandor Szego, and Weilong Yang. We are grateful to receive the support from our leadership, Tomas Izo, Rahul Sukthankar, and Jay Yagnik.

Read More

Estimating the Impact of Training Data with Reinforcement Learning

Estimating the Impact of Training Data with Reinforcement Learning

Posted by Jinsung Yoon and Sercan O. Arik, Research Scientists, Cloud AI Team, Google Research

Recent work suggests that not all data samples are equally useful for training, particularly for deep neural networks (DNNs). Indeed, if a dataset contains low-quality or incorrectly labeled data, one can often improve performance by removing a significant portion of training samples. Moreover, in cases where there is a mismatch between the train and test datasets (e.g., due to difference in train and test location or time), one can also achieve higher performance by carefully restricting samples in the training set to those most relevant for the test scenario. Because of the ubiquity of these scenarios, accurately quantifying the values of training samples has great potential for improving model performance on real-world datasets.

Top: Examples of low-quality samples (noisy/crowd-sourced); Bottom: Examples of a train and test mismatch.

In addition to improving model performance, assigning a quality value to individual data can also enable new use cases. It can be used to suggest better practices for data collection, e.g., what kinds of additional data would benefit the most, and can be used to construct large-scale training datasets more efficiently, e.g., by web searching using the labels as keywords and filtering out less valuable data.

In “Data Valuation Using Deep Reinforcement Learning”, accepted at ICML 2020, we address the challenge of quantifying the value of training data using a novel approach based on meta-learning. Our method integrates data valuation into the training procedure of a predictor model that learns to recognize samples that are more valuable for the given task, improving both predictor and data valuation performance. We have also launched four AI Hub Notebooks that exemplify the use cases of DVRL and are designed to be conveniently adapted to other tasks and datasets, such as domain adaptationcorrupted sample discovery and robust learningtransfer learning on image data and data valuation.

Quantifying the Value of Data
Not all data are equal for a given ML model — some have greater relevance for the task at hand or are more rich in informative content than others. So how does one evaluate the value of a single datum? At the granularity of a full dataset, it is straightforward; one can simply train a model on the entire dataset and use its performance on a test set as its value. However, estimating the value of a single datum is far more difficult, especially for complex models that rely on large-scale datasets, because it is computationally infeasible to re-train and re-evaluate a model on all possible subsets.

To tackle this, researchers have explored permutation-based methods (e.g., influence functions), and game theory-based methods (e.g., data Shapley). However, even the best current methods are far from being computationally feasible for large datasets and complex models, and their data valuation performance is limited. Concurrently, meta learning-based adaptive weight assignment approaches have been developed to estimate the weight values using a meta-objective. But rather than prioritizing learning from high value data samples, their data value mapping is typically based on gradient descent learning or other heuristic approaches that alter the conventional predictor model training dynamics, which can result in performance changes that are unrelated to the value of individual data points.

Data Valuation Using Reinforcement Learning (DVRL)
To infer the data values, we propose a data value estimator (DVE) that estimates data values and selects the most valuable samples to train the predictor model. This selection operation is fundamentally non-differentiable and thus conventional gradient descent-based methods cannot be used. Instead, we propose to use reinforcement learning (RL) such that the supervision of the DVE is based on a reward that quantifies the predictor performance on a small (but clean) validation set. The reward guides the optimization of the policy towards the action of optimal data valuation, given the state and input samples. Here, we treat the predictor model learning and evaluation framework as the environment, a novel application scenario of RL-assisted machine learning.

Training with Data Value Estimation using Reinforcement Learning (DVRL). When training the data value estimator with an accuracy reward, the most valuable samples (denoted with green dots) are used more and more, whereas the least valuable samples (red dots) are used less frequently.

Results
We evaluate the data value estimation quality of DVRL on multiple types of datasets and use cases.

<!–

    –>

    • Model performance after removing high/low value samples
      Removing low value samples from the training dataset can improve the predictor model performance, especially in the cases where the training dataset contains corrupted samples. On the other hand, removing high value samples, especially if the dataset is small, decreases the performance significantly. Overall, the performance after removing high/low value samples is a strong indicator for the quality of data valuation.
      Accuracy with the removal of most and least valuable samples, where 20% of the labels are noisy by design. By removing such noisy labels as the least valuable samples, a high-quality data valuation method achieves better accuracy. We demonstrate that DVRL outperforms other methods significantly from this perspective.

      DVRL shows the fastest performance degradation after removing the most important samples and the slowest performance degradation after removing the least important samples in most cases, underlining the superiority of DVRL in identifying noisy labels compared to competing methods (Leave-One-Out and Data Shapley).

    • Robust learning with noisy labels
      We consider how reliably DVRL can learn with noisy data in an end-to-end way, without removing the low-value samples. Ideally, noisy samples should get low data values as DVRL converges and a high performance model would be returned.
      Robust learning with noisy labels. Test accuracy for ResNet-32 and WideResNet-28-10 on CIFAR-10 and CIFAR-100 datasets with 40% of uniform random noise on labels. DVRL outperforms other popular methods that are based on meta-learning.

      We show state-of-the-art results with DVRL in minimizing the impact of noisy labels. These also demonstrate that DVRL can scale to complex models and large-scale datasets.

    • Domain adaptation
      We consider the scenario where the training dataset comes from a substantially different distribution from the validation and testing datasets. Data valuation is expected to be beneficial for this task by selecting the samples from the training dataset that best match the distribution of the validation dataset. We focus on the three cases: (1) a training set based on image search results (low-quality web-scraped) applied to the task of predicting skin lesion classification using HAM 10000 data (high-quality medical); (2) an MNIST training set for a digit recognition task on USPS data (different visual domain); (3) e-mail spam data to detect spam applied to an SMS dataset (different task). DVRL yields significant improvements for domain adaptation, by jointly optimizing the data valuator and corresponding predictor model.

    <!–

–>

Conclusions
We propose a novel meta learning framework for data valuation which determines how likely each training sample will be used in training of the predictor model. Unlike previous works, our method integrates data valuation into the training procedure of the predictor model, allowing the predictor and DVE to improve each other’s performance. We model this data value estimation task using a DNN trained through RL with a reward obtained from a small validation set that represents the target task performance. In a computationally-efficient way, DVRL can provide high quality ranking of training data that is useful for domain adaptation, corrupted sample discovery and robust learning. We show that DVRL significantly outperforms alternative methods on diverse types of tasks and datasets.

Acknowledgements
We gratefully acknowledge the contributions of Tomas Pfister.

Read More