Ask a Techspert: How do machine learning models explain themselves?

Ask a Techspert: How do machine learning models explain themselves?

Editor’s Note: Do you ever feel like a fish out of water? Try being a tech novice and talking to an engineer at a place like Google. Ask a Techspert is a series on the Keyword asking Googler experts to explain complicated technology for the rest of us. This isn’t meant to be comprehensive, but just enough to make you sound smart at a dinner party. 

A few years ago, I learned that a translation from Finnish to English using Google Translate led to an unexpected outcome. The sentence “hän on lentäjä” became “he is a pilot” in English, even though “hän” is a gender-neutral word in Finnish. Why did Translate assume it was “he” as the default? 

As I started looking into it, I became aware that just like humans, machines are affected by society’s biases. The machine learning model for Translate relied on training data, which consisted of the input from hundreds of millions of already-translated examples from the web. “He” was more associated with some professions than “she” was, and vice versa. 

Now, Google provides options for both feminine and masculine translations when adapting gender-neutral words in several languages, and there’s a continued effort to roll it out more broadly. But it’s still a good example of how machine learning can reflect the biases we see all around us. Thankfully, there are teams at Google dedicated to finding human-centered solutions to making technology inclusive for everyone. I sat down with Been Kim, a Google researcher working on the People + AI Research (PAIR) team, who devotes her time to making sure artificial intelligence puts people, not machines, at its center, and helping others understand the full spectrum of human interaction with machine intelligence. We talked about how you make machine learning models easy to interpret and understand, and why it’s important for everybody to have a basic idea of how the technology works.

Been Kim

Why is this field of work so important?

Machine learning is such a powerful tool, and because of that, you want to make sure you’re using it responsibly. Let’s take an electric machine saw as an example. It’s a super powerful tool, but you need to learn how to use it in order not to cut your fingers. Once you learn, it’s so useful and efficient that you’ll never want to go back to using a hand saw. And the same goes for machine learning. We want to help you understand and use machine learning correctly, fairly and safely. 

Since machine learning is used in our everyday lives, it’s also important for everyone to understand how it impacts us. No matter whether you’re a coffee shop owner using machine learning to optimize the purchase of your beans based on seasonal trends, or your doctor diagnoses you with a disease with the help of this technology, it’s often crucial to understand why a machine learning model has produced the outcome it has. It’s also important for developers and decision-makers to be able to explain or present a machine learning model to people in order to do so. This is what we call “interpretability.” 

How do you make machine learning models easier to understand and interpret? 

There are many different ways to make an ML model easier to understand. One way is to make the model reflect how humans think from the start, and have the model “trained” to provide explanations along with predictions, meaning when it gives you an outcome, it also has to explain how it got there. 

Another way is to try and explain a model after the training on data is done. This is something you can do when the model has been built to use input to provide an output from its own perspective, optimizing for prediction, without a clear “how” included. This means you’re able to plug things into it and see what comes out, and that can give you some insight into how the model generally makes decisions, but you don’t necessarily know exactly how specific inputs are interpreted by the model in specific cases. 

One way to try and explain models after they’ve been trained is using low level features or high level concepts. Let me give you an example of what this means. Imagine a system that classifies pictures: you give it a picture and it says, “This is a cat.” A low level feature is when I then ask the machine which pixels mattered for that prediction, it can tell us if it was one pixel or the other, and we might be able to see that the pixels in question show the cat’s whiskers. But we might also see that it is a scattering of pixels that don’t appear meaningful to the human eye, or that it’s made the wrong interpretation. High level concepts are more similar to the way humans communicate with one another. Instead of asking about pixels, I’d ask, “Did the whiskers matter for the prediction? or the paws?” and again, the machine can show me what imagery led it to reach this conclusion. Based on the outcome, I can understand the model better. (Together with researchers from Stanford, we’ve published papers that go into further detail on this for those who are interested.)

Can machines understand some things that we humans can’t? 

Yes! This is an area that I am very interested in myself. I am currently working on a way to showcase how technology can help humans learn new things. Machine learning technology is better at some things than we are; for example it can analyze and interpret data at a much larger scale than humans can. Leveraging this technology, I believe we can enlighten human scientists with knowledge they haven’t previously been aware of. 

What do you need to be careful of when you’re making conclusions based on machine learning models?

First of all, we have to be careful that human bias doesn’t come into play. Humans carry biases that we simply cannot help and are often unaware of, so if an explanation is up to a human’s interpretation, and often it is, then we have a problem. Humans read what they want to read. Now, this doesn’t mean that you should remove humans from the loop. Humans communicate with machines, and vice versa. Machines need to communicate their outcomes in the form of a clear statement using quantitative data, not one that is vague and completely open for interpretation. If the latter happens, then the machine hasn’t done a very good job and the human isn’t able to provide good feedback to the machine. It could also be that the outcome simply lacks additional context only the human can provide, or that it could benefit from having caveats, in order for them to make an informed judgement about the results of the model. 

What are some of the main challenges of this work? 

Well, one of the challenges for computer scientists in this field is dealing with non mathematical objectives, which are things you might want to optimize for, but don’t have an equation for. You can’t always define what is good for humans using math. That requires us to test and evaluate methods with rigor, and have a table full of different people to discuss the outcome. Another thing has to do with complexity. Humans are so complex that we have a whole field of work – psychology – to study this. So in my work, we don’t just have computational challenges, but also complex humans that we have to consider. Value-based questions such as “what defines fairness?” are even harder. They require interdisciplinary collaboration, and a diverse group of people in the room to discuss each individual matter.

What’s the most exciting part? 

I think interpretability research and methods are making a huge impact. Machine learning technology is a powerful tool that will transform society as we know it, and helping others to use it safely is very rewarding. 

On a more personal note, I come from South Korea and grew up in circumstances where I feel I didn’t have too many opportunities. I was incredibly lucky to get a scholarship to MIT and come to the U.S. When I think about the people who haven’t had these opportunities to be educated in science or machine learning, and knowing that this machine learning technology can really help and be useful to them in their everyday lives if they use it safely, I feel really motivated to be working on democratizing this technology. There’s many ways to do it, and interpretability is one of the things that I can contribute with.  

Read More

Duality — A New Approach to Reinforcement Learning

Duality — A New Approach to Reinforcement Learning

Posted by Ofir Nachum and Bo Dai, Research Scientists, Google Research

Reinforcement learning (RL) is an approach commonly used to train agents to make sequences of decisions that will be successful in complex environments, including for example, settings such as robotic navigation, where an agent controls the joint motors of a robot to seek a path to a target location, or game-playing, where the goal might be to solve a game level in minimal time. Many modern successful RL algorithms, such as Q-learning and actor-critic, propose to reduce the RL problem to a constraint-satisfaction problem, where a constraint exists for every possible “state” of the environment. For example, in vision-based robotic navigation, the “states” of the environment correspond to every possible camera input.

Despite how ubiquitous the constraint-satisfaction approach is in practice, this strategy is often difficult to reconcile with the complexity of real-world settings. In practical scenarios (like the robotic navigation example) the space of states is large, sometimes even uncountable, so how can one learn to satisfy the tremendous number of constraints associated with arbitrary input? Implementations of Q-learning and actor-critic often ignore these mathematical issues or obscure them through a series of rough approximations, which results in a stark divide between the practical implementations of these algorithms and their mathematical foundations.

In “Reinforcement Learning via Fenchel-Rockafellar Duality” we have developed a new approach to RL that enables algorithms that are both useful in practice and mathematically principled — that is to say, the proposed algorithms avoid the use of exceedingly rough approximations to translate their mathematical foundations to practical implementation. This approach is based on convex duality, which is a well-studied mathematical tool used to transform problems expressed in one form into equivalent problems in distinct forms that may be more computationally friendly. In our case, we develop specific ways to apply duality in RL to transform the traditional constraint-satisfaction mathematical form to an unconstrained, and thus more practical, mathematical problem.

A Duality-Based Solution
The duality-based approach begins by formulating the reinforcement learning problem as a mathematical objective along with a number of constraints, potentially infinite in number. Applying duality to this mathematical problem yields a different formulation of the same problem. Still, this dual formulation has the same format as the original problem — a single objective with a large number of constraints — although the specific objective and constraints are changed.

The next step is key to the duality-based solution. We augment the dual objective with a convex regularizer, a method often used in optimization as a way to smooth a problem and make it easier to solve. The choice of the regularizer is crucial to the final step, in which we apply duality once again to yield another formulation of an equivalent problem. In our case, we use the f-divergence regularizer, which results in a final formulation that is now unconstrained. Although there exist other choices of convex regularizers, regularization via the f-divergence is uniquely desirable for yielding an unconstrained problem that is especially amenable to optimization in practical and real-world settings which require off-policy or offline learning.

Notably in many cases, the applications of duality and regularization prescribed by the duality-based approach do not change the optimality of the original solution. In other words, although the form of the problem has changed, the solution has not. This way, the result obtained with the new formulation is the same result as for the original problem, albeit achieved in a much easier way.

Experimental Evaluation
As a test of our new approach, we implemented duality-based training on a navigational agent. The agent starts at one corner of a multi-room map and must navigate to the opposite corner. We compare our algorithm to an actor-critic approach. Although both of these algorithms are based on the same underlying mathematical problem, actor-critic uses a number of approximations due to the infeasibility of satisfying the large number of constraints. In contrast, our algorithm is more amenable to practical implementation as can be seen by comparing the performance of the two algorithms. In the figure below, we plot the average reward achieved by the learned agent against the number of iterations of training for each algorithm. The duality-based implementation achieves significantly higher reward compared to actor-critic.

A plot of the average reward achieved by an agent using the duality-based approach (blue) compared to an agent using standard actor-critic (orange). In addition to being more mathematically principled, our approach also yields better practical results.

Conclusion
In summary, we’ve shown that if one formulates the RL problem as a mathematical objective with constraints, then repeated applications of convex duality in conjunction with a cleverly chosen convex regularizer yield an equivalent problem without constraints. The resulting unconstrained problem is easy to implement in practice and applicable in a wide range of settings. We’ve already applied our general framework to agent behavior policy optimization as well as policy evaluation, and imitation learning. We’ve found that our algorithms are not only more mathematically principled than existing RL methods, but they also often yield better practical performance, showing the value of unifying mathematical principles with practical implementation.

Read More

Google at ACL 2020

Google at ACL 2020

Posted by Cat Armato and Emily Knapp, Program Managers

This week, the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), a premier conference covering a broad spectrum of research areas that are concerned with computational approaches to natural language, takes place online.

As a leader in natural language processing and understanding, and a Diamond Level sponsor of ACL 2020, Google will showcase the latest research in the field with over 30 publications, and the organization of and participation in a variety of workshops and tutorials.

If you’re registered for ACL 2020, we hope that you’ll visit the Google virtual booth to learn more about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about the Google research being presented at ACL 2020 below (Google affiliations bolded).

Committees
Diversity & Inclusion (D&I) Chair: Vinodkumar Prabhakaran
Accessibility Chair: Sushant Kafle
Local Sponsorship Chair: Kristina Toutanova
Virtual Infrastructure Committee: Yi Luan
Area Chairs: Anders Søgaard, Ankur Parikh, Annie Louis, Bhuvana Ramabhadran, Christo Kirov, Daniel Cer, Dipanjan Das, Diyi Yang, Emily Pitler, Eunsol Choi, George Foster, Idan Szpektor, Jacob Eisenstein, Jason Baldridge, Jun Suzuki, Kenton Lee, Luheng He, Marius Pasca, Ming-Wei Chang, Sebastian Gehrmann, Shashi Narayan, Slav Petrov, Vinodkumar Prabhakaran, Waleed Ammar, William Cohen

Long Papers
Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage
Ashish V. Thapliyal, Radu Soricut

Automatic Detection of Generated Text is Easiest when Humans are Fooled
Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, Douglas Eck

On Faithfulness and Factuality in Abstractive Summarization
Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou

BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps
Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, Fei Sha

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
Xuanli He, Gholamreza Haffari, Mohammad Norouzi

GoEmotions: A Dataset of Fine-Grained Emotions
Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, Sujith Ravi

TaPas: Weakly Supervised Table Parsing via Pre-training (see blog post)
Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, Julian Eisenschlos

Toxicity Detection: Does Context Really Matter?
John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, Ion Androutsopoulos

(Re)construing Meaning in NLP
Sean Trott, Tiago Timponi Torrent, Nancy Chang, Nathan Schneider

Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models
Dan Iter, Kelvin Guu, Larry Lansing, Dan Jurafsky

Probabilistic Assumptions Matter: Improved Models for Distantly-Supervised Document-Level Question Answering
Hao Cheng, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

AdvAug: Robust Adversarial Augmentation for Neural Machine Translation
Yong Cheng, Lu Jiang, Wolfgang Macherey, Jacob Eisenstein

Named Entity Recognition as Dependency Parsing
Juntao Yu, Bernd Bohnet, Massimo Poesio

Cross-modal Coherence Modeling for Caption Generation
Malihe Alikhani, Piyush Sharma, Shengjie Li, Radu Soricut, Matthew Stone

Representation Learning for Information Extraction from Form-like Documents (see blog post)
Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, Marc Najork

Low-Dimensional Hyperbolic Knowledge Graph Embeddings
Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Sujith Ravi, Christopher Ré

What Question Answering can Learn from Trivia Nerds
Jordan Boyd-Graber, Benjamin Börschinger

Learning a Multi-Domain Curriculum for Neural Machine Translation (see blog post)
Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, Zarana Parekh

Translationese as a Language in “Multilingual” NMT
Parker Riley, Isaac Caswell, Markus Freitag, David Grangier

Mapping Natural Language Instructions to Mobile UI Action Sequences
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge

BLEURT: Learning Robust Metrics for Text Generation (see blog post)
Thibault Sellam, Dipanjan Das, Ankur Parikh

Exploring Unexplored Generalization Challenges for Cross-Database Semantic Parsing
Alane Suhr, Ming-Wei Chang, Peter Shaw, Kenton Lee

Frugal Paradigm Completion
Alexander Erdmann, Tom Kenter, Markus Becker, Christian Schallhart

Short Papers
Reverse Engineering Configurations of Neural Text Generation Models
Yi Tay, Dara Bahri, Che Zheng, Clifford Brunk, Donald Metzler, Andrew Tomkins

Syntactic Data Augmentation Increases Robustness to Inference Heuristics
Junghyun Min, R. Thomas McCoy, Dipanjan Das, Emily Pitler, Tal Linzen

Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation
Aditya Siddhant, Ankur Bapna, Yuan Cao, Orhan Firat, Mia Chen, Sneha Kudugunta, Naveen Arivazhagan, Yonghui Wu

Social Biases in NLP Models as Barriers for Persons with Disabilities
Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, Stephen Denuyl

Toward Better Storylines with Sentence-Level Language Models
Daphne Ippolito, David Grangier, Douglas Eck, Chris Callison-Burch

TACL Papers
TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages (see blog post)
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, Jennimaria Palomaki

Phonotactic Complexity and Its Trade-offs
Tiago Pimentel, Brian Roark, Ryan Cotterell

Demos
Multilingual Universal Sentence Encoder for Semantic Retrieval (see blog post)
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil

Workshops
IWPT – The 16th International Conference on Parsing Technologies
Yuji Matsumoto, Stephan Oepen, Kenji Sagae, Anders Søgaard, Weiwei Sun and Reut Tsarfaty

ALVR – Workshop on Advances in Language and Vision Research
Xin Wang, Jesse Thomason, Ronghang Hu, Xinlei Chen, Peter Anderson, Qi Wu, Asli Celikyilmaz, Jason Baldridge and William Yang Wang

WNGT – The 4th Workshop on Neural Generation and Translation
Alexandra Birch, Graham Neubig, Andrew Finch, Hiroaki Hayashi, Kenneth Heafield, Ioannis Konstas, Yusuke Oda and Xian Li

NLPMC – NLP for Medical Conversations
Parminder Bhatia, Chaitanya Shivade, Mona Diab, Byron Wallace, Rashmi Gangadharaiah, Nan Du, Izhak Shafran and Steven Lin

AutoSimTrans – The 1st Workshop on Automatic Simultaneous Translation
Hua Wu, Colin Cherry, James Cross, Liang Huang, Zhongjun He, Mark Liberman and Yang Liu

Tutorials
Interpretability and Analysis in Neural NLP (cutting-edge)
Yonatan Belinkov, Sebastian Gehrmann, Ellie Pavlick

Commonsense Reasoning for Natural Language Processing (Introductory)
Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Choi, Dan Roth

SmartReply for YouTube Creators

SmartReply for YouTube Creators

Posted by Rami Al-Rfou, Research Scientist, Google Research

It has been more than 4 years since SmartReply was launched, and since then, it has expanded to more users with the Gmail launch and Android Messages and to more devices with Android Wear. Developers now use SmartReply to respond to reviews within the Play Developer Console and can set up their own versions using APIs offered within MLKit and TFLite. With each launch there has been a unique challenge in modeling and serving that required customizing SmartReply for the task requirements.

We are now excited to share an updated SmartReply built for YouTube and implemented in YouTube Studio that helps creators engage more easily with their viewers. This model learns comment and reply representation through a computationally efficient dilated self-attention network, and represents the first cross-lingual and character byte-based SmartReply model. SmartReply for YouTube is currently available for English and Spanish creators, and this approach simplifies the process of extending the SmartReply feature to many more languages in the future.

YouTube creators receive a large volume of responses to their videos. Moreover, the community of creators and viewers on YouTube is diverse, as reflected by the creativity of their comments, discussions and videos. In comparison to emails, which tend to be long and dominated by formal language, YouTube comments reveal complex patterns of language switching, abbreviated words, slang, inconsistent usage of punctuation, and heavy utilization of emoji. Following is a sample of comments that illustrate this challenge:

Deep Retrieval
The initial release of SmartReply for Inbox encoded input emails word-by-word with a recurrent neural network, and then decoded potential replies with yet another word-level recurrent neural network. Despite the expressivity of this approach, it was computationally expensive. Instead, we found that one can achieve the same ends by designing a system that searches through a predefined list of suggestions for the most appropriate response.

This retrieval system encoded the message and its suggestion independently. First, the text was preprocessed to extract words and short phrases. This preprocessing included, but was not limited to, language identification, tokenization, and normalization. Two neural networks then simultaneously and independently encoded the message and the suggestion. This factorization allowed one to pre-compute the suggestion encodings and then search through the set of suggestions using an efficient maximum inner product search data structure. This deep retrieval approach enabled us to expand SmartReply to Gmail and since then, it has been the foundation for several SmartReply systems including the current YouTube system.

Beyond Words
The previous SmartReply systems described above relied on word level preprocessing that is well tuned for a limited number of languages and narrow genres of writing. Such systems face significant challenges in the YouTube case, where a typical comment might include heterogeneous content, like emoji, ASCII art, language switching, etc. In light of this, and taking inspiration from our recent work on byte and character language modeling, we decided to encode the text without any preprocessing. This approach is supported by research demonstrating that a deep Transformer network is able to model words and phrases from the ground up just by feeding it text as a sequence of characters or bytes, with comparable quality to word-based models.

Although initial results were promising, especially for processing comments with emoji or typos, the inference speed was too slow for production due to the fact that character sequences are longer than word equivalents and the computational complexity of self-attention layers grows quadratically as a function of sequence length. We found that shrinking the sequence length by applying temporal reduction layers at each layer of the network, similar to the dilation technique applied in WaveNet, provides a good trade-off between computation and quality.

The figure below presents a dual encoder network that encodes both the comment and the reply to maximize the mutual information between their latent representations by training the network with a contrastive objective. The encoding starts with feeding the transformer a sequence of bytes after they have been embedded. The input for each subsequent layer will be reduced by dropping a percentage of characters at equal offsets. After applying several transformer layers the sequence length is greatly truncated, significantly reducing the computational complexity. This sequence compression scheme could be substituted by other operators such as average pooling, though we did not notice any gains from more sophisticated methods, and therefore, opted to use dilation for simplicity.

A dual encoder network that maximizes the mutual information between the comments and their replies through a contrastive objective. Each encoder is fed a sequence of bytes and is implemented as a computationally efficient dilated transformer network.

A Model to Learn Them All
Instead of training a separate model for each language, we opted to train a single cross-lingual model for all supported languages. This allows the support of mixed-language usage in the comments, and enables the model to utilize the learning of common elements in one language for understanding another, such as emoji and numbers. Moreover, having a single model simplifies the logistics of maintenance and updates. While the model has been rolled out to English and Spanish, the flexibility inherent in this approach will enable it to be expanded to other languages in the future.

Inspecting the encodings of a multilingual set of suggestions produced by the model reveals that the model clusters appropriate replies, regardless of the language to which they belong. This cross-lingual capability emerged without exposing the model during training to any parallel corpus. We demonstrate in the figure below for three languages how the replies are clustered by their meaning when the model is probed with an input comment. For example, the English comment “This is a great video,” is surrounded by appropriate replies, such as “Thanks!” Moreover, inspection of the nearest replies in other languages reveal them also to be appropriate and similar in meaning to the English reply. The 2D projection also shows several other cross-lingual clusters that consist of replies of similar meaning. This clustering demonstrates how the model can support a rich cross-lingual user experience in the supported languages.

A 2D projection of the model encodings when presented with a hypothetical comment and a small list of potential replies. The neighborhood surrounding English comments (black color) consists of appropriate replies in English and their counterparts in Spanish and Arabic. Note that the network learned to align English replies with their translations without access to any parallel corpus.

When to Suggest?
Our goal is to help creators, so we have to make sure that SmartReply only makes suggestions when it is very likely to be useful. Ideally, suggestions would only be displayed when it is likely that the creator would reply to the comment and when the model has a high chance of providing a sensible and specific response. To accomplish this, we trained auxiliary models to identify which comments should trigger the SmartReply feature.

Conclusion
We’ve launched YouTube SmartReply, starting with English and Spanish comments, the first cross-lingual and character byte-based SmartReply. YouTube is a global product with a diverse user base that generates heterogeneous content. Consequently, it is important that we continuously improve comments for this global audience, and SmartReply represents a strong step in this direction.

Acknowledgements
SmartReply for YouTube creators was developed by Golnaz Farhadi, Ezequiel Baril, Cheng Lee, Claire Yuan, Coty Morrison‎, Joe Simunic‎, Rachel Bransom‎, Rajvi Mehta, Jorge Gonzalez‎, Mark Williams, Uma Roy and many more. We are grateful for the leadership support from Nikhil Dandekar, Eileen Long, Siobhan Quinn, Yun-hsuan Sung, Rachel Bernstein, and Ray Kurzweil.

SpineNet: A Novel Architecture for Object Detection Discovered with Neural Architecture Search

SpineNet: A Novel Architecture for Object Detection Discovered with Neural Architecture Search

Posted by Xianzhi Du, Software Engineer and Jaeyoun Kim, Technical Program Manager, Google Research

Convolutional neural networks created for image tasks typically encode an input image into a sequence of intermediate features that capture the semantics of an image (from local to global), where each subsequent layer has a lower spatial dimension. However, this scale-decreased model may not be able to deliver strong features for multi-scale visual recognition tasks where recognition and localization are both important (e.g., object detection and segmentation). Several works including FPN and DeepLabv3+ propose multi-scale encoder-decoder architectures to address this issue, where a scale-decreased network (e.g., a ResNet) is taken as the encoder (commonly referred to as a backbone model). A decoder network is then applied to the backbone to recover the spatial information.

While this architecture has yielded improved success for image recognition and localization tasks, it still relies on a scale-decreased backbone that throws away spatial information by down-sampling, which the decoder then must attempt to recover. What if one were to design an alternate backbone model that avoids this loss of spatial information, and is thus inherently well-suited for simultaneous image recognition and localization?

In our recent CVPR 2020 paper “SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization”, we propose a meta architecture called a scale-permuted model that enables two major improvements on backbone architecture design. First, the spatial resolution of intermediate feature maps should be able to increase or decrease anytime so that the model can retain spatial information as it grows deeper. Second, the connections between feature maps should be able to go across feature scales to facilitate multi-scale feature fusion. We then use neural architecture search (NAS) with a novel search space design that includes these features to discover an effective scale-permuted model. We demonstrate that this model is successful in multi-scale visual recognition tasks, outperforming networks with standard, scale-reduced backbones. To facilitate continued work in this space, we have open sourced the SpineNet code to the Tensorflow TPU GitHub repository in Tensorflow 1 and TensorFlow Model Garden GitHub repository in Tensorflow 2.

A scale-decreased backbone is shown on the left and a scale-permuted backbone is shown on the right. Each rectangle represents a building block. Colors and shapes represent different spatial resolutions and feature dimensions. Arrows represent connections among building blocks.

Design of SpineNet Architecture
In order to efficiently design the architecture for SpineNet, and avoid a time-intensive manual search of what is optimal, we leverage NAS to determine an optimal architecture. The backbone model is learned on the object detection task using the COCO dataset, which requires simultaneous recognition and localization. During architecture search, we learn three things:

  • Scale permutations: The orderings of network building blocks are important because each block can only be built from those that already exist (i.e., with a “lower ordering”). We define the search space of scale permutations by rearranging intermediate and output blocks, respectively.
  • Cross-scale connections: We define two input connections for each block in the search space. The parent blocks can be any block with a lower ordering or a block from the stem network.
  • Block adjustments (optional): We allow the block to adjust its scale level and type.
The architecture search process from a scale-decreased backbone to a scale-permuted backbone.

Taking the ResNet-50 backbone as the seed for the NAS search, we first learn scale-permutation and cross-scale connections. All candidate models in the search space have roughly the same computation as ResNet-50 since we just permute the ordering of feature blocks to obtain candidate models. The learned scale-permuted model outperforms ResNet-50-FPN by +2.9% average precision (AP) in the object detection task. The efficiency can be further improved (-10% FLOPs) by adding search options to adjust scale and type (e.g., residual block or bottleneck block, used in the ResNet model family) of each candidate feature block.

We name the learned 49-layer scale-permuted backbone architecture SpineNet-49. SpineNet-49 can be further scaled up to SpineNet-96/143/190 by repeating blocks two, three, or four times and increasing the feature dimension. An architecture comparison between ResNet-50-FPN and the final SpineNet-49 is shown below.

The architecture comparison between a ResNet backbone (left) and the SpineNet backbone (right) derived from it using NAS.

Performance
We demonstrate the performance of SpineNet models through comparison with ResNet-FPN. Using similar building blocks, SpineNet models outperform their ResNet-FPN counterparts by ~3% AP at various scales while using 10-20% fewer FLOPs. In particular, our largest model, SpineNet-190, achieves 52.1% AP on COCO for a single model without multi-scale testing during inference, significantly outperforming prior detectors. SpineNet also transfers to classification tasks, achieving 5% top-1 accuracy improvement on the challenging iNaturalist fine-grained dataset.

Performance comparisons of SpineNet models and ResNet-FPN models adopting the RetinaNet detection framework on COCO bounding box detection.
Performance comparisons of SpineNet models and ResNet models on ImageNet classification and iNaturalist fine-grained image classification.

Conclusion
In this work, we identify that the conventional scale-decreased model, even with a decoder network, is not effective for simultaneous recognition and localization. We propose the scale-permuted model, a new meta-architecture, to address the issue. To prove the effectiveness of scale-permuted models, we learn SpineNet by Neural Architecture Search in object detection and demonstrate it can be used directly in image classification. In the future, we hope the scale-permuted model will become the meta-architecture design of backbones across many visual tasks beyond detection and classification.

Acknowledgements
Special thanks to the co-authors of the paper: Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, and Xiaodan Song. We also would like to acknowledge Yeqing Li, Youlong Cheng, Jing Li, Jianwei Xie, Russell Power, Hongkun Yu, Chad Richards, Liang-Chieh Chen, Anelia Angelova, and the larger Google Brain Team for their help.

Leveraging Temporal Context for Object Detection

Leveraging Temporal Context for Object Detection

Posted by Sara Beery, Student Researcher, and Jonathan Huang, Research Scientist, Google Research

Ecological monitoring helps researchers to understand the dynamics of global ecosystems, quantify biodiversity, and measure the effects of climate change and human activity, including the efficacy of conservation and remediation efforts. In order to monitor effectively, ecologists need high-quality data, often expending significant efforts to place monitoring sensors, such as static cameras, in the field. While it is increasingly cost effective to build and operate networks of such sensors, the manual data analysis of global biodiversity data remains a bottleneck to accurate, global, real-time ecological monitoring. While there are ways to automate this analysis via machine learning, the data from static cameras, widely used to monitor the world around us for purposes ranging from mountain pass road conditions to ecosystem phenology, still pose a strong challenge for traditional computer vision systems — due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger.

In order to perform well in this setting, computer vision models must be robust to objects of interest that are often off-center, out of focus, poorly lit, or at a variety of scales. In addition, a static camera will always take images of the same scene unless it is moved, which causes the data from any one camera to be highly repetitive. Without sufficient data variability, machine learning models may learn to focus on correlations in the background, leading to poor generalization to novel deployments. The machine learning and ecological communities have been working together through venues like LILA BC and Wildlife Insights to curate expert-labeled training data from many research groups, each of which may operate anywhere from one to hundreds of camera traps, in order to increase data variability. This process of data collection and annotation is slow, and is confounded by the need to have diverse, representative data across geographic regions and taxonomic groups.

What’s in this image? Objects in images from static cameras can be very challenging to detect and categorize. Here, a foggy morning has made it very difficult to see a herd of wildebeest walking along the crest of a hill. [Image from Snapshot Serengeti]

In Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection, we present a complementary approach that increases global scalability by improving generalization to novel camera deployments algorithmically. This new object detection architecture leverages contextual clues across time for each camera deployment in a network, improving recognition of objects in novel camera deployments without relying on additional training data from a large number of cameras. Echoing the approach a person might use when faced with challenging images, Context R-CNN leverages up to a month’s worth of images from the same camera for context to determine what objects might be present and identify them. Using this method, the model outperforms a single-frame Faster R-CNN baseline by significant margins across multiple domains, including wildlife camera traps. We have open sourced the code and models for this work as part of the TF Object Detection API to make it easy to train and test Context R-CNN models on new static camera datasets.

Here, we can see how additional examples from the same scene help experts determine that the object is an animal and not background. Context such as the shape & size of the object, its attachment to a herd, and habitual grazing at certain times of day help determine that the species is a wildebeest. Useful examples occur throughout the month.

The Context R-CNN Model
Context R-CNN is designed to take advantage of the high degree of correlation within images taken by a static camera to boost performance on challenging data and improve generalization to new camera deployments without additional human data labeling. It is an adaptation of Faster R-CNN, a popular two-stage object detection architecture. To extract context for a camera, we first use a frozen feature extractor to build up a contextual memory bank from images across a large time horizon (up to a month or more). Next, objects are detected in each image using Context R-CNN which aggregates relevant context from the memory bank to help detect objects under challenging conditions (such as the heavy fog obscuring the wildebeests in our previous example). This aggregation is performed using attention, which is robust to the sparse and irregular sampling rates often seen in static monitoring cameras.

High-level architecture diagram, showing how Context R-CNN incorporates long-term context within the Faster R-CNN model architecture.

The first stage of Faster R-CNN proposes potential objects, and the second stage categorizes each proposal as either background or one of the target classes. In Context R-CNN, we take the proposed objects from the first stage of Faster R-CNN, and for each one we use similarity-based attention to determine how relevant each of the features in our memory bank (M) is to the current object, and construct a per-object context feature by taking a relevance-weighted sum over M and adding it back to the original object features. Then each object, now with added contextual information, is finally categorized using the second stage of Faster R-CNN.

Context R-CNN is able to leverage context (spanning up to 1 month) to correctly categorize the challenging wildebeest example we saw above. The green values are the corresponding attention weights for each boxed object.
Compared to a Faster R-CNN baseline (left), Context R-CNN (right) is able to capture challenging objects such as an elephant occluded by a tree, two poorly-lit impala, and a vervet monkey leaving the frame. [Images from Snapshot Serengeti]

Results
We have tested Context R-CNN on Snapshot Serengeti (SS) and Caltech Camera Traps (CCT), both ecological datasets of animal species in camera traps but from highly different geographic regions (Tanzania vs. the Southwestern United States). Improvements over the Faster R-CNN baseline for each dataset can be seen in the table below. Notably, we see a 47.5% relative increase in mean average precision (mAP) on SS, and a 34.3% relative mAP increase on CCT. We also compare Context R-CNN to S3D (a 3D convolution based baseline) and see performance improve from 44.7% mAP to 55.9% mAP (a 25.1% relative increase). Finally, we find that the performance increases as the contextual time horizon increases, from a minute of context to a month.

Comparison to a single frame Faster R-CNN baseline, showing both mean average precision (mAP) and average recall (AR) detection metrics.

Ongoing and Future Work
We are working to implement Context R-CNN within the Wildlife Insights platform, to facilitate large-scale, global ecological monitoring via camera traps. We also host competitions such as the yearly iWildCam species identification competition at the CVPR Fine-Grained Visual Recognition Workshop to help bring these challenges to the attention of the computer vision community. The challenges seen in automatic species identification in static cameras are shared by numerous applications of static cameras outside of the ecological monitoring domain, as well as other static sensors used to monitor biodiversity, such as audio and sonar devices. Our method is general, and we anticipate the per-sensor context approach taken by Context R-CNN would be beneficial for any static sensor.

Acknowledgements
This post reflects the work of the authors as well as the following group of core contributors: Vivek Rathod, Guanhang Wu, Ronny Votel. We are also grateful to Zhichao Lu, David Ross, Tanya Birch and the Wildlife Insights AI team, and Pietro Perona and the Caltech Computational Vision Lab.

Sensing Force-Based Gestures on the Pixel 4

Sensing Force-Based Gestures on the Pixel 4

Posted by Philip Quinn and Wenxin Feng, Research Scientists, Android UX

Touch input has traditionally focussed on two-dimensional finger pointing. Beyond tapping and swiping gestures, long pressing has been the main alternative path for interaction. However, a long press is sensed with a time-based threshold where a user’s finger must remain stationary for 400–500 ms. By its nature, a time-based threshold has negative effects for usability and discoverability as the lack of immediate feedback disconnects the user’s action from the system’s response. Fortunately, fingers are dynamic input devices that can express more than just location: when a user touches a surface, their finger can also express some level of force, which can be used as an alternative to a time-based threshold.

While a variety of force-based interactions have been pursued, sensing touch force requires dedicated hardware sensors that are expensive to design and integrate. Further, research indicates that touch force is difficult for people to control, and so most practical force-based interactions focus on discrete levels of force (e.g., a soft vs. firm touch) — which do not require the full capabilities of a hardware force sensor.

For a recent update to the Pixel 4, we developed a method for sensing force gestures that allowed us to deliver a more expressive touch interaction experience By studying how the human finger interacts with touch sensors, we designed the experience to complement and support the long-press interactions that apps already have, but with a more natural gesture. In this post we describe the core principles of touch sensing and finger interaction, how we designed a machine learning algorithm to recognise press gestures from touch sensor data, and how we integrated it into the user experience for Pixel devices.

Touch Sensor Technology and Finger Biomechanics
A capacitive touch sensor is constructed from two conductive electrodes (a drive electrode and a sense electrode) that are separated by a non-conductive dielectric (e.g., glass). The two electrodes form a tiny capacitor (a cell) that can hold some charge. When a finger (or another conductive object) approaches this cell, it ‘steals’ some of the charge, which can be measured as a drop in capacitance. Importantly, the finger doesn’t have to come into contact with the electrodes (which are protected under another layer of glass) as the amount of charge stolen is inversely proportional to the distance between the finger and the electrodes.

Left: A finger interacts with a touch sensor cell by ‘stealing’ charge from the projected field around two electrodes. Right: A capacitive touch sensor is constructed from rows and columns of electrodes, separated by a dielectric. The electrodes overlap at cells, where capacitance is measured.

The cells are arranged as a matrix over the display of a device, but with a much lower density than the display pixels. For instance, the Pixel 4 has a 2280 × 1080 pixel display, but a 32 × 15 cell touch sensor. When scanned at a high resolution (at least 120 Hz), readings from these cells form a video of the finger’s interaction.

Slowed touch sensor recordings of a user tapping (left), pressing (middle), and scrolling (right).

Capacitive touch sensors don’t respond to changes in force per se, but are tuned to be highly sensitive to changes in distance within a couple of millimeters above the display. That is, a finger contact on the display glass should saturate the sensor near its centre, but will retain a high dynamic range around the perimeter of the finger’s contact (where the finger curls up).

When a user’s finger presses against a surface, its soft tissue deforms and spreads out. The nature of this spread depends on the size and shape of the user’s finger, and its angle to the screen. At a high level, we can observe a couple of key features in this spread (shown in the figures): it is asymmetric around the initial contact point, and the overall centre of mass shifts along the axis of the finger. This is also a dynamic change that occurs over some period of time, which differentiates it from contacts that have a long duration or a large area.

Touch sensor signals are saturated around the centre of the finger’s contact, but fall off at the edges. This allows us to sense small deformations in the finger’s contact shape caused by changes in the finger’s force.

However, the differences between users (and fingers) makes it difficult to encode these observations with heuristic rules. We therefore designed a machine learning solution that would allow us to learn these features and their variances directly from user interaction samples.

Machine Learning for Touch Interaction
We approached the analysis of these touch signals as a gesture classification problem. That is, rather than trying to predict an abstract parameter, such as force or contact spread, we wanted to sense a press gesture — as if engaging a button or a switch. This allowed us to connect the classification to a well-defined user experience, and allowed users to perform the gesture during training at a comfortable force and posture.

Any classification model we designed had to operate within users’ high expectations for touch experiences. In particular, touch interaction is extremely latency-sensitive and demands real-time feedback. Users expect applications to be responsive to their finger movements as they make them, and application developers expect the system to deliver timely information about the gestures a user is performing. This means that classification of a press gesture needs to occur in real-time, and be able to trigger an interaction at the moment the finger’s force reaches its apex.

We therefore designed a neural network that combined convolutional (CNN) and recurrent (RNN) components. The CNN could attend to the spatial features we observed in the signal, while the RNN could attend to their temporal development. The RNN also helps provide a consistent runtime experience: each frame is processed by the network as it is received from the touch sensor, and the RNN state vectors are preserved between frames (rather than processing them in batches). The network was intentionally kept simple to minimise on-device inference costs when running concurrently with other applications (taking approximately 50 µs of processing per frame and less than 1 MB of memory using TensorFlow Lite).

An overview of the classification model’s architecture.

The model was trained on a dataset of press gestures and other common touch interactions (tapping, scrolling, dragging, and long-pressing without force). As the model would be evaluated after each frame, we designed a loss function that temporally shaped the label probability distribution of each sample, and applied a time-increasing weight to errors. This ensured that the output probabilities were temporally smooth and converged towards the correct gesture classification.

User Experience Integration
Our UX research found that it was hard for users to discover force-based interactions, and that users frequently confused a force press with a long press because of the difficulty in coordinating the amount of force they were applying with the duration of their contact. Rather than creating a new interaction modality based on force, we therefore focussed on improving the user experience of long press interactions by accelerating them with force in a unified press gesture. A press gesture has the same outcome as a long press gesture, whose time threshold remains effective, but provides a stronger connection between the outcome and the user’s action when force is used.

A user long pressing (left) and firmly pressing (right) on a launcher icon.

This also means that users can take advantage of this gesture without developers needing to update their apps. Applications that use Android’s GestureDetector or View APIs will automatically get these press signals through their existing long-press handlers. Developers that implement custom long-press detection logic can receive these press signals through the MotionEvent classification API introduced in Android Q.

Through this integration of machine-learning algorithms and careful interaction design, we were able to deliver a more expressive touch experience for Pixel users. We plan to continue researching and developing these capabilities to refine the touch experience on Pixel, and explore new forms of touch interaction.

Acknowledgements
This project is a collaborative effort between the Android UX, Pixel software, and Android framework teams.

Presenting a Challenge and Workshop in Efficient Open-Domain Question Answering

Presenting a Challenge and Workshop in Efficient Open-Domain Question Answering

Posted by Eunsol Choi, Visiting Faculty Researcher and Tom Kwiatkowski, Research Scientist, Google Research

One of the primary goals of natural language processing is to build systems that can answer a user’s questions. To do this, computers need to be able to understand questions, represent world knowledge, and reason their way to answers. Traditionally, answers have been retrieved from a collection of documents or a knowledge graph. For example, to answer the question, “When was the declaration of independence officially signed?” a system might first find the most relevant article from Wikipedia, and then locate a sentence containing the answer, “August 2, 1776”. However, more recent approaches, like T5, have also shown that neural models, trained on large amounts of web-text, can also answer questions directly, without retrieving documents or facts from a knowledge graph. This has led to significant debate about how knowledge should be stored for use by our question answering systems — in human readable text and structured formats, or in the learned parameters of a neural network.

Today, we are proud to announce the EfficientQA competition and workshop at NeurIPS 2020, organized in cooperation with Princeton University and the University of Washington. The goal is to develop an end-to-end question answering system that contains all of the knowledge required to answer open-domain questions. There are no constraints on how the knowledge is stored — it could be in documents, databases, the parameters of a neural network, or any other form — but entries will be evaluated based on the number of bytes used to access this knowledge, including code, corpora, and model parameters. There will also be an unconstrained track, in which the goal is to achieve the best possible question answering performance regardless of system size. To build small, yet robust systems, participants will have to explore new methods of knowledge representation and reasoning.

An illustration of how the memory budget changes as a neural network and retrieval corpus grow and shrink. It is possible that successful systems will also use other resources such as a knowledge graph.

Competition Overview
The competition will be evaluated using the open-domain variant of the Natural Questions dataset. We will also provide further human evaluation of all the top performing entries to account for the fact that there are many correct ways to answer a question, not all of which will be covered by any set of reference answers. For example, for the question “What type of car is a Jeep considered?” both “off-road vehicles” and “crossover SUVs” are valid answers.

The competition is divided between four separate tracks: best performing system under 500 Mb; best performing system under 6 Gb; smallest system to get at least 25% accuracy; and the best performing system with no constraints. The winners of each of these tracks will be invited to present their work during the competition track at NeurIPS 2020, which will be hosted virtually. We will also put each of the winning systems up against human trivia experts (the 2017 NeurIPS Human-Computer competition featured Jeopardy! and Who Wants to Be a Millionaire champions) in a real-time contest at the virtual conference.

Participation
To participate, go to the competition site where you will find the data and evaluation code available for download, as well as dates and instructions on how to participate, and a sign-up form for updates. Along with our academic collaborators, we have provided some example systems to help you get started.

We believe that the field of natural language processing will benefit from a greater exploration and comparison of small system question answering options. We hope that by encouraging the development of very small systems, this competition will pave the way for on-device question answering.

Acknowledgements
Creating this challenge and workshop has been a large team effort including Adam Roberts, Colin Raffel, Chris Alberti, Jordan Boyd-Graber, Jennimaria Palomaki, Kenton Lee, Kelvin Guu, and Michael Collins from Google; as well as Sewon Min and Hannaneh Hajishirzi from the University of Washington; and Danqi Chen from Princeton University.

RepNet: Counting Repetitions in Videos

RepNet: Counting Repetitions in Videos

Posted by Debidatta Dwibedi, Research Scientist, Robotics at Google

Repeating processes ranging from natural cycles, such as phases of the moon or heartbeats and breathing, to artificial repetitive processes, like those found on manufacturing lines or in traffic patterns, are commonplace in our daily lives. Beyond just their prevalence, repeating processes are of interest to researchers for the variety of insights one can tease out of them. It may be that there is an underlying cause behind something that happens multiple times, or there may be gradual changes in a scene that may be useful for understanding. Sometimes, repeating processes provide us with unambiguous “action units”, semantically meaningful segments that make up an action. For example, if a person is chopping an onion, the action unit is the manipulation action that is repeated to produce additional slices. These units may be indicative of more complex activity and may allow us to analyze more such actions automatically at a finer time-scale without having a person annotate these units. For the above reasons, perceptual systems that aim to observe and understand our world for an extended period of time will benefit from a system that understands general repetitions.

In “Counting Out Time: Class Agnostic Video Repetition Counting in the Wild”, we present RepNet, a single model that can understand a broad range of repeating processes, ranging from people exercising or using tools, to animals running and birds flapping their wings, pendulums swinging, and a wide variety of others. In contrast to our previous work, which used cycle-consistency constraints across different videos of the same action to understand them at a fine-grained level, in this work we present a system that can recognize repetitions within a single video. Along with this model, we are releasing a dataset to benchmark class-agnostic counting in videos and a Colab notebook to run RepNet.

RepNet
RepNet is a model that takes as input a video that contains periodic action of a variety of classes (including those unseen during training) and returns the period of repetitions found therein. In the past the problem of repetition counting has been addressed by directly comparing pixel intensities in frames, but real world videos have camera motion, occlusion by objects in the field, drastic scale difference and changes in form, which necessitates learning of features invariant to such noise. To accomplish this we train a machine learning model in an end-to-end manner to directly estimate the period of the repetitions. The model consists of three parts: a frame encoder, an intermediate representation, called a temporal self-similarity matrix (which we will describe below), and a period predictor.

First, the frame encoder uses the ResNet architecture as a per-frame model to generate embeddings of each frame of the video The ResNet architecture was chosen since it has been successful for a number of image and video tasks. Passing each frame of a video through a ResNet-based encoder yields a sequence of embeddings.

At this point we calculate a temporal self-similarity matrix (TSM) by comparing each frame’s embedding with every other frame in the video, returning a matrix that is easy for subsequent modules to analyze for counting repetitions. This process surfaces self-similarities in the stream of video frames that enable period estimation, as demonstrated in the video below.

Demonstration of how the TSM processes images of the Earth’s day-night cycle.

For each frame, we then use Transformers to predict the period of repetition and the periodicity (i.e., whether or not a frame is part of the periodic process) directly from the sequence of similarities in the TSM. Once we have the period, we obtain the per-frame count by dividing the number of frames captured in a periodic segment by the period length. We sum this up to predict the number of repetitions in the video.

Overview of the RepNet model.

Temporal Self-Similarity Matrix
The example of the TSM from the day-night cycle, shown above, is derived from an idealized scenario with fixed period repetitions. TSMs from real videos often reveal fascinating structures in the world, as demonstrated in the three examples below. Jumping jacks are close to the ideal periodic action with a fixed period, while in contrast, the period of a bouncing ball declines as the ball loses energy through repeated bounces. The video of someone mixing concrete demonstrates repetitive action that is preceded and followed by a period without motion. These three behaviors are clearly distinguished in the learned TSM, which requires that the model pay attention to fine changes in the scene.

Jumping Jacks (constant period; video from Kinetics), Bouncing ball (decreasing period; Kinetics), Mixing concrete (aperiodic segments present in video; PERTUBE dataset).

One advantage of using the TSM as an intermediate layer in RepNet is that the subsequent processing by the transformers is done in the self-similarity space and not in the feature space. This encourages generalization to unseen classes. For example, the TSMs produced by actions as different as jumping jacks or swimming are similar as long as the action was repeated at a similar pace. This allows us to train on some classes and yet expect generalization to unseen classes.

Data
One way to train the above model would be to collect a large dataset of videos that capture repetitive activities and label them with the repetition count. The challenge in this is two-fold. First, it requires one to examine a large number of videos to identify those with repeated actions. Following that, each video must be annotated with the number of times an action was repeated. While for certain tasks annotators can skip frames (for example, to classify a video as showing jumping jacks), they still need to see the entire video in order to count how many jumping jacks were performed.

We overcome this challenge by introducing a process for synthetic data generation that produces videos with repetitions using videos that may not contain repeating actions at all. This is accomplished by randomly selecting a segment of the video to repeat an arbitrary number of times, bookended by the original video context.

Our synthetic data generation pipeline that produces videos with repetitions from any video.

While this process generates a video that resembles a natural-looking video with repeating processes, it is still too simple for deep learning methods, which can learn to cheat by looking for artifacts, instead of learning to recognize repetitions. To address this, we perform extreme data augmentation, which we call camera motion augmentation. In this method, we modify the video to simulate a camera that smoothly moves around using 2D affine motion as the video progresses.

Left: An example of a synthetic repeating video generated from a random video. Right: An example of a video with camera motion augmentation, which is tougher for the model, but results in better generalization to real repeating videos (both from Kinetics).

Evaluation
Even though we can train a model on synthetic repeating videos, the resulting models must be able to generalize to real video of repeating processes. In order to evaluate the performance of the trained models on real videos, we collect a dataset of ~9000 videos from the Kinetics dataset. These videos span many action classes and capture diverse scenes, arising from the diversity of data seen on Youtube. We annotate these videos with the count of the action being repeated in the video. To encourage further research in this field, we are releasing the count annotations for this dataset, which we call Countix.

Applications
A class-agnostic counting model has many useful applications. RepNet serves as a single model that can count repetitions from many different domains:

RepNet can count repeated activities from a range of domains, such as slicing onions (left; video from Kinetics dataset), Earth’s diurnal cycle (middle; Himawari satellite data), or even a cheetah in motion (right; video from imgur.com).

RepNet could be used to estimate heartbeat rates from echocardiogram videos even though it has not seen such videos in training:

Predicted heart rates: 45 bpm (left) and 75 bpm (right). True heart rates 46-50 bpm and 78-79 bpm, respectively. RepNet’s prediction of the heart rate across different devices is encouragingly close to the rate measured by the device. (Source for left and right)

RepNet can also be used to monitor repeating activities for any changes in speed. Below we show how the Such changes in speed can also be used in other settings for quality or process control.

In this video, we see RepNet counting accelerating cellular oscillations observed under a laser microscope even though it has never seen such a video during training, (from Nature article).
Left: Person performing a “mountain climber” exercise. Right: The 1D projection of the RepNet embeddings using principal component analysis, capturing the moment that the person changes their speed during the exercise. (Video from Kinetics)

Release
We are releasing Countix annotations for the community to work on the problem of repetition counting. We are also releasing a Colab notebook for running RepNet. Using this you can run RepNet on your videos or even using your webcam to detect periodic activities in video and count repetitions automatically in videos.

Acknowledgements
This is joint work with Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Special thanks to Tom Small for designing the visual explanation of TSM. The authors thank Anelia Angelova, Relja Arandjelović, Sourish Chaudhuri, Aishwarya Gomatam, Meghana Thotakuri, and Vincent Vanhoucke for their help with this project.

Improving Speech Representations and Personalized Models Using Self-Supervision

Improving Speech Representations and Personalized Models Using Self-Supervision

Posted by Joel Shor, Software Engineer and Oran Lang, Software Engineer, Google Research, Israel

There are many tasks within speech processing that are easier to solve by having large amounts of data. For example automatic speech recognition (ASR) translates spoken audio into text. In contrast, “non-semantic” tasks focus on the aspects of human speech other than its meaning, encompassing “paralinguistic” tasks, like speech emotion recognition, as well as other kinds of tasks, such as speaker identification, language identification, and certain kinds of voice-based medical diagnoses. In training systems to accomplish these tasks, one common approach is to utilize the largest datasets possible to help ensure good results. However, machine learning techniques that directly rely on massive datasets are often less successful when trained on small datasets.

One way to bridge the performance gap between large and small datasets is to train a representation model on a large dataset, then transfer it to a setting with less data. Representations can improve performance in two ways: they can make it possible to train small models by transforming high-dimensional data (like images and audio) to a lower dimension, and the representation model can also be used as pre-training. In addition, if the representation model is small enough to be run or trained on-device, it can improve performance in a privacy-preserving way by giving users the benefits of a personalized model where the raw data never leaves their device. While representation learning is commonly used in the text domain (e.g. BERT and ALBERT) and in the images domain (e.g. Inception layers and SimCLR), such approaches are underutilized in the speech domain.

Bottom:A large speech dataset is used to train a model, which is then rolled out to other environments. Top Left: On-device personalization — personalized, on-device models combine security and privacy. Top Middle: Small model on embeddings — general-use representations transform high-dimensional, few-example datasets to a lower dimension without sacrificing accuracy; smaller models train faster and are regularized. Top Right: Full model fine-tuning — large datasets can use the embedding model as pre-training to improve performance

Unambiguously improving generally-useful representations, for non-semantic speech tasks in particular, is difficult without a standard benchmark to compare “speech representation usefulness.” While the T5 framework systematically evaluates text embeddings and the Visual Task Adaptation Benchmark (VTAB) standardizes image embedding evaluation, both leading to progress in representation learning in those respective fields, there has been no such benchmark for non-semantic speech embeddings.

In “Towards Learning a Universal Non-Semantic Representation of Speech“, we make three contributions to representation learning for speech-related applications. First, we present a NOn-Semantic Speech (NOSS) benchmark for comparing speech representations, which includes diverse datasets and benchmark tasks, such as speech emotion recognition, language identification, and speaker identification. These datasets are available in the “audio” section of TensorFlow Datasets. Second, we create and open-source TRIpLet Loss network (TRILL), a new model that is small enough to be executed and fine-tuned on-device, while still outperforming other representations. Third, we perform a large-scale study comparing different representations, and open-source the code used to compute the performance on new representations.

A New Benchmark for Speech Embeddings
For a benchmark to usefully guide model development, it must contain tasks that ought to have similar solutions and exclude those that are significantly different. Previous work either dealt with the variety of possible speech-based tasks independently, or lumped semantic and non-semantic tasks together. Our work improves performance on non-semantic speech tasks, in part, by focusing on neural network architectures that perform well specifically on this subset of speech tasks.

The tasks were selected for the NOSS benchmark on the basis of their 1) diversity — they need to cover a range of use-cases; 2) complexity — they should be challenging; and 3) availability, with particular emphasis on those tasks that are open-source. We combined six datasets of different sizes and tasks.

Datasets for downstream benchmark tasks. *VoxCeleb results in our study were computed using a subset of the dataset that was filtered according to internal policy.

We also introduce three additional intra-speaker tasks to test performance in the personalization scenario. In some datasets with k speakers, we can create k different tasks consisting of training and testing on just a single speaker. Overall performance is averaged across speakers. These three additional intra-speaker tasks measure the ability of an embedding to adapt to a particular speaker, as would be necessary for personalized, on-device models, which are becoming more important as computation moves to smart phones and the internet of things.

To help enable researchers to compare speech embeddings, we’ve added the six datasets in our benchmark to TensorFlow Datasets (in the “audio” section) and open sourced the evaluation framework.

TRILL: A New State of the Art in Non-semantic Speech Classification
Learning an embedding from one dataset and applying it to other tasks is not as common in speech as in other modalities. However, transfer learning, the more general technique of using data from one task to help another (not necessarily with embeddings), has some compelling applications, such as personalizing speech recognizers and voice imitation text-to-speech from few samples. There have been many previously proposed representations of speech, but most of these have been trained on a smaller and less diverse data, have been tested primarily on speech recognition, or both.

To create a data-derived representation of speech that was useful across environments and tasks, we started with AudioSet, a large and diverse dataset that includes about 2500 hours of speech. We then trained an embedding model on a simple, self-supervised criteria derived from previous work on metric learning — embeddings from the same audio should be closer in embedding space than embeddings from different audio. Like BERT and the other text embeddings, the self-supervised loss function doesn’t require labels and only relies on the structure of the data itself. This form of self-supervision is the most appropriate for non-semantic speech, since non-semantic phenomena are more stable in time than ASR and other sub-second speech characteristics. This simple, self-supervised criteria captures a large number of acoustic properties that are leveraged in downstream tasks.

TRILL loss: Embeddings from the same audio are closer in embedding space than embeddings from different audio.

TRILL architecture is based on MobileNet, making it fast enough to run on mobile devices. To achieve high accuracy on this small architecture, we distilled the embedding from a larger ResNet50 model without performance degradation.

Benchmark Results
We compared the performance of TRILL against other deep learning representations that are not focused on speech recognition and were trained on similarly diverse datasets. In addition, we compared TRILL to the popular OpenSMILE feature extractor, which uses pre-deep learning techniques (e.g., a fourier transform coefficients, “pitch tracking” using a time-series of pitch measurements, etc.), and randomly initialized networks, which have been shown to be strong baselines. To aggregate the performance across tasks that have different performance characteristics, we first train a small number of simple models, for a given task and embedding. The best result is chosen. Then, to understand the effect that a particular embedding has across all tasks, we calculate a linear regression on the observed accuracies, with both the model and task as the explanatory variables. The effect a model has on the accuracy is the coefficient associated with the model in the regression. For a given task, when changing from one model to another, the resulting change in accuracy is expected to be the difference in y-values in the figure below.

Effect of model on accuracy.

TRILL outperforms the other representations in our study. Factors that contribute to TRILL’s success are the diversity of the training dataset, the large context window of the network, and the generality of the TRILL training loss that broadly preserves acoustic characteristics instead of prematurely focusing on certain aspects. Note that representations from intermediate network layers are often more generally useful. The intermediate representations are larger, have finer temporal granularity, and in the case of the classification networks they retain more general information that isn’t as specific to the classes on which they were trained.

Another benefit of a generally-useful model is that it can be used to initialize a model on a new task. When the sample size of a new task is small, fine-tuning an existing model may lead to better results than training the model from scratch. We achieved a new state-of-the-art result on three out of six benchmark tasks using this technique, despite doing no dataset-specific hyperparameter tuning.

To compare our new representation, we also tested it on the mask sub-challenge of the Interspeech 2020 Computational Paralinguistics Challenge (ComParE). In this challenge, models must predict whether a speaker is wearing a mask, which would affect their speech. The mask effects are sometimes subtle, and audio clips are only one second long. A linear model on TRILL outperformed the best baseline model, which was a fusion of many models on different kinds of features including traditional spectral and deep-learned features.

Summary
The code to evaluate NOSS is available on GitHub, the datasets are on TensorFlow Datasets, and the TRILL models are available on AI Hub.

The NOn-Semantic Speech benchmark helps researchers create speech embeddings that are useful in a wide range of contexts, including for personalization and small-dataset problems. We provide the TRILL model to the research community as a baseline embedding to surpass.

Acknowledgements
The core team behind this work includes Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Felix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, and Yinnon Haviv. We’d also like to thank Avinatan Hassidim and Yossi Matias for technical guidance.