View Synthesis with Transformers

View Synthesis with Transformers

A long-standing problem in the intersection of computer vision and computer graphics, view synthesis is the task of creating new views of a scene from multiple pictures of that scene. This has received increased attention [1, 2, 3] since the introduction of neural radiance fields (NeRF). The problem is challenging because to accurately synthesize new views of a scene, a model needs to capture many types of information — its detailed 3D structure, materials, and illumination — from a small set of reference images.

In this post, we present recently published deep learning models for view synthesis. In “Light Field Neural Rendering” (LFNR), presented at CVPR 2022, we address the challenge of accurately reproducing view-dependent effects by using transformers that learn to combine reference pixel colors. Then in “Generalizable Patch-Based Neural Rendering” (GPNR), to be presented at ECCV 2022, we address the challenge of generalizing to unseen scenes by using a sequence of transformers with canonicalized positional encoding that can be trained on a set of scenes to synthesize views of new scenes. These models have some unique features. They perform image-based rendering, combining colors and features from the reference images to render novel views. They are purely transformer-based, operating on sets of image patches, and they leverage a 4D light field representation for positional encoding, which helps to model view-dependent effects.

We train deep learning models that are able to produce new views of a scene given a few images of it. These models are particularly effective when handling view-dependent effects like the refractions and translucency on the test tubes. This animation is compressed; see the original-quality renderings here. Source: Lab scene from the NeX/Shiny dataset.

Overview

The input to the models consists of a set of reference images and their camera parameters (focal length, position, and orientation in space), along with the coordinates of the target ray whose color we want to determine. To produce a new image, we start from the camera parameters of the input images, obtain the coordinates of the target rays (each corresponding to a pixel), and query the model for each.

Instead of processing each reference image completely, we look only at the regions that are likely to influence the target pixel. These regions are determined via epipolar geometry, which maps each target pixel to a line on each reference frame. For robustness, we take small regions around a number of points on the epipolar line, resulting in the set of patches that will actually be processed by the model. The transformers then act on this set of patches to obtain the color of the target pixel.

Transformers are especially useful in this setting since their self-attention mechanism naturally takes sets as inputs, and the attention weights themselves can be used to combine reference view colors and features to predict the output pixel colors. These transformers follow the architecture introduced in ViT.

To predict the color of one pixel, the models take a set of patches extracted around the epipolar line of each reference view. Image source: LLFF dataset.

Light Field Neural Rendering

In Light Field Neural Rendering (LFNR), we use a sequence of two transformers to map the set of patches to the target pixel color. The first transformer aggregates information along each epipolar line, and the second along each reference image. We can interpret the first transformer as finding potential correspondences of the target pixel on each reference frame, and the second as reasoning about occlusion and view-dependent effects, which are common challenges of image-based rendering.

LFNR uses a sequence of two transformers to map a set of patches extracted along epipolar lines to the target pixel color.

LFNR improved the state-of-the-art on the most popular view synthesis benchmarks (Blender and Real Forward-Facing scenes from NeRF and Shiny from NeX) with margins as large as 5dB peak signal-to-noise ratio (PSNR). This corresponds to a reduction of the pixel-wise error by a factor of 1.8x. We show qualitative results on challenging scenes from the Shiny dataset below:

LFNR reproduces challenging view-dependent effects like the rainbow and reflections on the CD, reflections, refractions and translucency on the bottles. This animation is compressed; see the original quality renderings here. Source: CD scene from the NeX/Shiny dataset.
Prior methods such as NeX and NeRF fail to reproduce view-dependent effects like the translucency and refractions in the test tubes on the Lab scene from the NeX/Shiny dataset. See also our video of this scene at the top of the post and the original quality outputs here.

Generalizing to New Scenes

One limitation of LFNR is that the first transformer collapses the information along each epipolar line independently for each reference image. This means that it decides which information to preserve based only on the output ray coordinates and patches from each reference image, which works well when training on a single scene (as most neural rendering methods do), but it does not generalize across scenes. Generalizable methods are important because they can be applied to new scenes without needing to retrain.

We overcome this limitation of LFNR in Generalizable Patch-Based Neural Rendering (GPNR). We add a transformer that runs before the other two and exchanges information between points at the same depth over all reference images. For example, this first transformer looks at the columns of the patches from the park bench shown above and can use cues like the flower that appears at corresponding depths in two views, which indicates a potential match. Another key idea of this work is to canonicalize the positional encoding based on the target ray, because to generalize across scenes, it is necessary to represent quantities in relative and not absolute frames of reference. The animation below shows an overview of the model.

GPNR consists of a sequence of three transformers that map a set of patches extracted along epipolar lines to a pixel color. Image patches are mapped via the linear projection layer to initial features (shown as blue and green boxes). Then those features are successively refined and aggregated by the model, resulting in the final feature/color represented by the gray rectangle. Park bench image source: LLFF dataset.

To evaluate the generalization performance, we train GPNR on a set of scenes and test it on new scenes. GPNR improved the state-of-the-art on several benchmarks (following IBRNet and MVSNeRF protocols) by 0.5–1.0 dB on average. On the IBRNet benchmark, GPNR outperforms the baselines while using only 11% of the training scenes. The results below show new views of unseen scenes rendered with no fine-tuning.

GPNR-generated views of held-out scenes, without any fine tuning. This animation is compressed; see the original quality renderings here. Source: IBRNet collected dataset.
Details of GPNR-generated views on held-out scenes from NeX/Shiny (left) and LLFF (right), without any fine tuning. GPNR reproduces more accurately the details on the leaf and the refractions through the lens when compared against IBRNet.

Future Work

One limitation of most neural rendering methods, including ours, is that they require camera poses for each input image. Poses are not easy to obtain and typically come from offline optimization methods that can be slow, limiting possible applications, such as those on mobile devices. Research on jointly learning view synthesis and input poses is a promising future direction. Another limitation of our models is that they are computationally expensive to train. There is an active line of research on faster transformers which might help improve our models’ efficiency. For the papers, more results, and open-source code, you can check out the projects pages for “Light Field Neural Rendering” and “Generalizable Patch-Based Neural Rendering“.

Potential Misuse

In our research, we aim to accurately reproduce an existing scene using images from that scene, so there is little room to generate fake or non-existing scenes. Our models assume static scenes, so synthesizing moving objects, such as people, will not work.

Acknowledgments

All the hard work was done by our amazing intern – Mohammed Suhail – a PhD student at UBC, in collaboration with Carlos Esteves and Ameesh Makadia from Google Research, and Leonid Sigal from UBC. We are thankful to Corinna Cortes for supporting and encouraging this project.

Our work is inspired by NeRF, which sparked the recent interest in view synthesis, and IBRNet, which first considered generalization to new scenes. Our light ray positional encoding is inspired by the seminal paper Light Field Rendering and our use of transformers follow ViT.

Video results are from scenes from LLFF, Shiny, and IBRNet collected datasets.

Read More

FindIt: Generalized Object Localization with Natural Language Queries

FindIt: Generalized Object Localization with Natural Language Queries

Natural language enables flexible descriptive queries about images. The interaction between text queries and images grounds linguistic meaning in the visual world, facilitating a better understanding of object relationships, human intentions towards objects, and interactions with the environment. The research community has studied object-level visual grounding through a range of tasks, including referring expression comprehension, text-based localization, and more broadly object detection, each of which require different skills in a model. For example, object detection seeks to find all objects from a predefined set of classes, which requires accurate localization and classification, while referring expression comprehension localizes an object from a referring text and often requires complex reasoning on prominent objects. At the intersection of the two is text-based localization, in which a simple category-based text query prompts the model to detect the objects of interest.

Due to their dissimilar task properties, referring expression comprehension, detection, and text-based localization are mostly studied through separate benchmarks with most models only dedicated to one task. As a result, existing models have not adequately synthesized information from the three tasks to achieve a more holistic visual and linguistic understanding. Referring expression comprehension models, for instance, are trained to predict one object per image, and often struggle to localize multiple objects, reject negative queries, or detect novel categories. In addition, detection models are unable to process text inputs, and text-based localization models often struggle to process complex queries that refer to one object instance, such as “Left half sandwich.” Lastly, none of the models can generalize sufficiently well beyond their training data and categories.

To address these limitations, we are presenting “FindIt: Generalized Localization with Natural Language Queries” at ECCV 2022. Here we propose a unified, general-purpose and multitask visual grounding model, called FindIt, that can flexibly answer different types of grounding and detection queries. Key to this architecture is a multi-level cross-modality fusion module that can perform complex reasoning for referring expression comprehension and simultaneously recognize small and challenging objects for text-based localization and detection. In addition, we discover that a standard object detector and detection losses are sufficient and surprisingly effective for all three tasks without the need for task-specific design and losses common in existing works. FindIt is simple, efficient, and outperforms alternative state-of-the-art models on the referring expression comprehension and text-based localization benchmarks, while being competitive on the detection benchmark.

FindIt is a unified model for referring expression comprehension (col. 1), text-based localization (col. 2), and the object detection task (col. 3). FindIt can respond accurately when tested on object types/classes not known during training, e.g. “Find the desk” (col. 4). Compared to existing baselines (MattNet and GPV), FindIt can perform these tasks well and in a single model.

Multi-level Image-Text Fusion
Different localization tasks are created with different semantic understanding objectives. For example, because the referring expression task primarily references prominent objects in the image rather than small, occluded or faraway objects, low resolution images generally suffice. In contrast, the detection task aims to detect objects with various sizes and occlusion levels in higher resolution images. Apart from these benchmarks, the general visual grounding problem is inherently multiscale, as natural queries can refer to objects of any size. This motivates the need for a multi-level image-text fusion model for efficient processing of higher resolution images over different localization tasks.

The premise of FindIt is to fuse the higher level semantic features using more expressive transformer layers, which can capture all-pair interactions between image and text. For the lower-level and higher-resolution features, we use a cheaper dot-product fusion to save computation and memory cost. We attach a detector head (e.g., Faster R-CNN) on top of the fused feature maps to predict the boxes and their classes.

FindIt accepts an image and a query text as inputs, and processes them separately in image/text backbones before applying the multi-level fusion. We feed the fused features to Faster R-CNN to predict the boxes referred to by the text. The feature fusion uses more expressive transformers at higher levels and cheaper dot-product at the lower levels.

Multitask Learning
Apart from the multi-level fusion described above, we adapt the text-based localization and detection tasks to take the same inputs as the referring expression comprehension task. For the text-based localization task, we generate a set of queries over the categories present in the image. For any present category, the text query takes the form “Find the [object],” where [object] is the category name. The objects corresponding to that category are labeled as foreground and the other objects as background. Instead of using the aforementioned prompt, we use a static prompt for the detection task, such as “Find all the objects.”. We found that the specific choice of prompts is not important for text-based localization and detection tasks.

After adaptation, all tasks in consideration share the same inputs and outputs — an image input, a text query, and a set of output bounding boxes and classes. We then combine the datasets and train on the mixture. Finally, we use the standard object detection losses for all tasks, which we found to be surprisingly simple and effective.

Evaluation
We apply FindIt to the popular RefCOCO benchmark for referring expression comprehension tasks. When only the COCO and RefCOCO dataset is available, FindIt outperforms the state-of-the-art-model on all tasks. In the settings where external datasets are allowed, FindIt sets a new state of the art by using COCO and all RefCOCO splits together (no other datasets). On the challenging Google and UMD splits, FindIt outperforms the state of the art by a 10% margin, which, taken together, demonstrate the benefits of multitask learning.

Comparison with the state of the art on the popular referring expression benchmark. FindIt is superior on both the COCO and unconstrained settings (additional training data allowed).

On the text-based localization benchmark, FindIt achieves 79.7%, higher than the GPV (73.0%), and Faster R-CNN baselines (75.2%). Please refer to the paper for more quantitative evaluation.

We further observe that FindIt generalizes better to novel categories and super-categories in the text-based localization task compared to competitive single-task baselines on the popular COCO and Objects365 datasets, shown in the figure below.

FindIt on novel and super categories. Left: FindIt outperforms the single-task baselines especially on the novel categories. Right: FindIt outperforms the single-task baselines on the unseen super categories. “Rec-Single” is the Referring expression comprehension single task model and “Loc-Single” is the text-based localization single task model.

Efficiency
We also benchmark the inference times on the referring expression comprehension task (see Table below). FindIt is efficient and comparable with existing one-stage approaches while achieving higher accuracy. For fair comparison, all running times are measured on one GTX 1080Ti GPU.

Model     Image Size     Backbone     Runtime (ms)
MattNet     1000     R101     378
FAOA     256     DarkNet53     39
MCN     416     DarkNet53     56
TransVG     640     R50     62
FindIt (Ours)     640     R50     107
FindIt (Ours)     384     R50     57

Conclusion
We present Findit, which unifies referring expression comprehension, text-based localization, and object detection tasks. We propose multi-scale cross-attention to unify the diverse localization requirements of these tasks. Without any task-specific design, FindIt surpasses the state of the art on referring expression and text-based localization, shows competitive performance on detection, and generalizes better to out-of-distribution data and novel classes. All of these are accomplished in a single, unified, and efficient model.

Acknowledgements
This work is conducted by Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, and Anelia Angelova. We would like to thank Ashish Vaswani, Prajit Ramachandran, Niki Parmar, David Luan, Tsung-Yi Lin, and other colleagues at Google Research for their advice and helpful discussions. We would like to thank Tom Small for preparing the animation.

Read More

Google at Interspeech 2022

This week, the 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022) is being held in Incheon, South Korea, representing one of the world’s most extensive conferences on research and technology of spoken language understanding and processing. Over 2,000 experts in speech-related research fields gather to take part in oral presentations and poster sessions and to collaborate with streamed events across the globe.

We are excited to be a Diamond Sponsor of INTERSPEECH 2022, where we will be showcasing nearly 50 research publications and supporting a number of workshops, special sessions and tutorials. We welcome in-person attendees to drop by the Google booth to meet our researchers and participate in Q&As and demonstrations of some of our latest speech technologies, which help to improve accessibility and provide convenience in communication for billions of users. In addition, online attendees are encouraged to visit our virtual booth in GatherTown where you can get up-to-date information on research and opportunities at Google. You can also learn more about the Google research being presented at INTERSPEECH 2022 below (Google affiliations in bold).

Organizing Committee

Industry Liaisons include: Bhuvana Ramabahdran

Area Chairs include: John Hershey, Heiga Zen, Shrikanth Narayanan, Bastiaan Kleijn

ISCA Fellows

Include: Tara Sainath, Heiga Zen

Publications

Production Federated Keyword Spotting via Distillation, Filtering, and Joint Federated-Centralized Training
Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio Lopez Moreno, Rajiv Mathews, Françoise Beaufays

Leveraging Unsupervised and Weakly-Supervised Data to Improve Direct Speech-to-Speech Translation
Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

UserLibri: A Dataset for ASR Personalization Using Only Text
Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey

SNRi Target Training for Joint Speech Enhancement and Recognition
Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani

Turn-Taking Prediction for Natural Conversational Speech
Shuo-Yiin Chang, Bo Li, Tara Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He

Streaming Intended Query Detection Using E2E Modeling for Continued Conversation
Shuo-Yiin Chang, Guru Prakash, Zelin Wu, Tara Sainath, Bo Li, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman

Improving Distortion Robustness of Self-Supervised Speech Processing Tasks with Domain Adaptation
Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee

XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

Extracting Targeted Training Data from ASR Models, and How to Mitigate It
Ehsan Amid, Om Thakkar, Arun Narayanan, Rajiv Mathews, Françoise Beaufays

Detecting Unintended Memorization in Language-Model-Fused ASR
W. Ronny Huang, Steve Chien, Om Thakkar, Rajiv Mathews

AVATAR: Unconstrained Audiovisual Speech Recognition
Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

End-to-End Multi-talker Audio-Visual ASR Using an Active Speaker Attention Module
Richard Rose, Olivier Siohan

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-person Video
Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

Unsupervised Data Selection via Discrete Speech Representation for ASR
Zhiyun Lu, Yongqiang Wang, Yu Zhang, Wei Han, Zhehuai Chen, Parisa Haghani

Non-parallel Voice Conversion for ASR Augmentation
Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Jesse Emond, Yinghui Huang, Pedro J. Moreno

Ultra-Low-Bitrate Speech Coding with Pre-trained Transformers
Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, Jan Skoglund

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-Yiin Chang, Parisa Haghani

Improving Deliberation by Text-Only and Semi-supervised Training
Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

CycleGAN-Based Unpaired Speech Dereverberation
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

TRILLsson: Distilled Universal Paralinguistic Speech Representations (see blog post)
Joel Shor, Subhashini Venugopalan

Learning Neural Audio Features Without Supervision
Sarthak Yadav, Neil Zeghidour

SpeechPainter: Text-Conditioned Speech Inpainting
Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi

SpecGrad: Diffusion Probabilistic Model-Based Neural Vocoder with Adaptive Noise Spectral Shaping
Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani

Distance-Based Sound Separation
Katharine Patterson, Kevin Wilson, Scott Wisdom, John R. Hershey

Analysis of Self-Attention Head Diversity for Conformer-Based Automatic Speech Recognition
Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno

Improving Rare Word Recognition with LM-Aware MWER Training
Wang Weiran, Tongzhou Chen, Tara Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

MAESTRO: Matched Speech Text Representations Through Modality Matching
Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno, Ankur Bapna, Heiga Zen

Pseudo Label is Better Than Human Label
Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman

On the Optimal Interpolation Weights for Hybrid Autoregressive Transducer Model
Ehsan Variani, Michael Riley, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran

Streaming Align-Refine for Non-autoregressive Deliberation
Wang Weiran, Ke Hu, Tara Sainath

Federated Pruning: Improving Neural Network Efficiency with Federated Learning
Rongmei Lin*, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Françoise Beaufays

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
Shaojin Ding, Weiran Wang, Ding Zhao, Tara N Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

4-Bit Conformer with Native Quantization Aware Training for Speech Recognition
Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov

Visually-Aware Acoustic Event Detection Using Heterogeneous Graphs
Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha

A Conformer-Based Waveform-Domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy
Sankaran Panchapagesan, Arun Narayanan, Turaj Zakizadeh Shabestary, Shuai Shao, Nathan Howard, Alex Park, James Walker, Alexander Gruenstein

Reducing Domain Mismatch in Self-Supervised Speech Pre-training
Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Nicolás Serrano

On-the-Fly ASR Corrections with Audio Exemplars
Golan Pundak, Tsendsuren Munkhdalai, Khe Chai Sim

A Language Agnostic Multilingual Streaming On-Device ASR System
Bo Li, Tara Sainath, Ruoming Pang*, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani

XTREME-S: Evaluating Cross-Lingual Speech Representations
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

Towards Disentangled Speech Representations
Cal Peyser, Ronny Huang, Andrew Rosenberg, Tara Sainath, Michael Picheny, Kyunghyun Cho

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition
Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O’Malley, Ian McGraw

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation
Tom O’Malley, Arun Narayanan, Quan Wang

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alex Petelin, Jonathan Shen*, Vincent Wan, Yu Zhang, Yonghui Wu, Robert Clark

A Scalable Model Specialization Framework for Training and Inference Using Submodels and Its Application to Speech Model Personalization
Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro Moreno

Text-Driven Separation of Arbitrary Sounds
Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi

Workshops, Tutorials & Special Sessions

The VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
Organizers include: Arsha Nagrani

Self-Supervised Representation Learning for Speech Processing
Organizers include: Tara Sainath

Learning from Weak Labels
Organizers include: Ankit Shah

RNN Transducers for Named Entity Recognition with Constraints on Alignment for Understanding Medical Conversations
Authors: Hagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey

Listening with Googlears: Low-Latency Neural Multiframe Beamforming and Equalization for Hearing Aids
Authors: Samuel Yang, Scott Wisdom, Chet Gnegy, Richard F. Lyon, Sagar Savla

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset
Authors: Michael Chinen, Jan Skoglund, Chandan K. A. Reddy, Alessandro Ragano, Andrew Hines

Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device
Authors: Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Françoise Beaufays

Trustworthy Speech Processing
Organizers include: Shrikanth Narayanan



*Work done while at Google.  
Read More

Robust Online Allocation with Dual Mirror Descent

Robust Online Allocation with Dual Mirror Descent

The emergence of digital technologies has transformed decision making across commercial sectors such as airlines, online retailing, and internet advertising. Today, real-time decisions need to be repeatedly made in highly uncertain and rapidly changing environments. Moreover, organizations usually have limited resources, which need to be efficiently allocated across decisions. Such problems are referred to as online allocation problems with resource constraints, and applications abound. Some examples include:

  • Bidding with Budget Constraints: Advertisers increasingly purchase ad slots using auction-based marketplaces such as search engines and ad exchanges. A typical advertiser can participate in a large number of auctions in a given month. Because the supply in these marketplaces is uncertain, advertisers set budgets to control their total spend. Therefore, advertisers need to determine how to optimally place bids while limiting total spend and maximizing conversions.
  • Dynamic Ad Allocation: Publishers can monetize their websites by signing deals with advertisers guaranteeing a number of impressions or by auctioning off slots in the open market. To make this choice, publishers need to trade off, in real-time, the short-term revenue from selling slots in the open market and the long-term benefits of delivering good quality spots to reservation ads.
  • Airline Revenue Management: Planes have a limited number of seats that need to be filled up as much as possible before a flight’s departure. But demand for flights changes over time and airlines would like to sell airline tickets to the customers who are willing to pay the most. Thus, airlines have increasingly adopted sophisticated automated systems to manage the pricing and availability of airline tickets.
  • Personalized Retailing with Limited Inventories: Online retailers can use real-time data to personalize their offerings to customers who visit their store. Because product inventory is limited and cannot be easily replenished, retailers need to dynamically decide which products to offer and at what price to maximize their revenue while satisfying their inventory constraints.

The common feature of these problems is the presence of resource constraints (budgets, contractual obligations, seats, or inventory, respectively in the examples above) and the need to make dynamic decisions in environments with uncertainty. Resource constraints are challenging because they link decisions across time — e.g., in the bidding problem, bidding too high early can leave advertisers with no budget, and thus missed opportunities later. Conversely, bidding too conservatively can result in a low number of conversions or clicks.

Two central resource allocation problems faced by advertisers and publishers in internet advertising markets.

In this post, we discuss state-of-the-art algorithms that can help maximize goals in dynamic, resource-constrained environments. In particular, we have recently developed a new class of algorithms for online allocation problems, called dual mirror descent, that are simple, robust, and flexible. Our papers have appeared in Operations Research, ICML’20, and ICML’21, and we have ongoing work to continue progress in this space. Compared to existing approaches, dual mirror descent is faster as it does not require solving auxiliary optimization problems, is more flexible because it can handle many applications across different sectors with minimal modifications, and is more robust as it enjoys remarkable performance under different environments.

Online Allocation Problems
In an online allocation problem, a decision maker has a limited amount of total resources (B) and receives a certain number of requests over time (T). At any point in time (t), the decision maker receives a reward function (ft) and resource consumption function (bt), and takes an action (xt). The reward and resource consumption functions change over time and the objective is to maximize the total reward within the resource constraints. If all the requests were known in advance, then an optimal allocation could be obtained by solving an offline optimization problem for how to maximize the reward function over time within the resource constraints1.

The optimal offline allocation cannot be implemented in practice because it requires knowing future requests. However, this is still useful for framing the goal of online allocation problems: to design an algorithm whose performance is as close to optimal as possible without knowing future requests.

Achieving the Best of Many Worlds with Dual Mirror Descent
A simple, yet powerful idea to handle resource constraints is introducing “prices” for the resources, which enables accounting for the opportunity cost of consuming resources when making decisions. For example, selling a seat on a plane today means it can’t be sold tomorrow. These prices are useful as an internal accounting system of the algorithm. They serve the purpose of coordinating decisions at different moments in time and allow decomposing a complex problem with resource constraints into simpler subproblems: one per time period with no resource constraints. For example, in a bidding problem, the prices capture an advertiser’s opportunity cost of consuming one unit of budget and allow the advertiser to handle each auction as an independent bidding problem.

This reframes the online allocation problem as a problem of pricing resources to enable optimal decision making. The key innovation of our algorithm is using machine learning to predict optimal prices in an online fashion: we choose prices dynamically using mirror descent, a popular optimization algorithm for training machine learning predictive models. Because prices for resources are referred to as “dual variables” in the field of optimization, we call the resulting algorithm dual mirror descent.

The algorithm works sequentially by assuming uniform resource consumption over time is optimal and updating the dual variables after each action. It starts at a moment in time (t) by taking an action (xt) that maximizes the reward minus the opportunity cost of consuming resources (shown in the top gray box below). The action (e.g., how much to bid or which ad to show) is implemented if there are enough resources available. Then, the algorithm computes the error in the resource consumption (gt), which is the difference between uniform consumption over time and the actual resource consumption (below in the third gray box). A new dual variable for the next time period is computed using mirror descent based on the error, which then informs the next action. Mirror descent seeks to make the error as close as possible to zero, improving the accuracy of its estimate of the dual variable, so that resources are consumed uniformly over time. While the assumption of uniform resource consumption may be surprising, it helps avoid missing good opportunities and often aligns with commercial goals so is effective. Mirror descent also allows a variety of update rules; more details are in the paper.

An overview of the dual mirror descent algorithm.

By design, dual mirror descent has a self-correcting feature that prevents depleting resources too early or waiting too long to consume resources and missing good opportunities. When a request consumes more or less resources than the target, the corresponding dual variable is increased or decreased. When resources are then priced higher or lower, future actions are chosen to consume resources more conservatively or aggressively.

This algorithm is easy to implement, fast, and enjoys remarkable performance under different environments. These are some salient features of our algorithm:

  • Existing methods require periodically solving large auxiliary optimization problems using past data. In contrast, this algorithm does not need to solve any auxiliary optimization problem and has a very simple rule to update the dual variables, which, in many cases, can be run in linear time complexity. Thus, it is appealing for many real-time applications that require fast decisions.
  • There are minimal requirements on the structure of the problem. Such flexibility allows dual mirror descent to handle many applications across different sectors with minimal modifications. Moreover, our algorithms are flexible since they accommodate different objectives, constraints, or regularizers. By incorporating regularizers, decision makers can include important objectives beyond economic efficiency, such as fairness.
  • Existing algorithms for online allocation problems are tailored for either adversarial or stochastic input data. Algorithms for adversarial inputs are robust as they make almost no assumptions on the structure of the data but, in turn, obtain performance guarantees that are too pessimistic in practice. On the other hand, algorithms for stochastic inputs enjoy better performance guarantees by exploiting statistical patterns in the data but can perform poorly when the model is misspecified. Dual mirror descent, however, attains performance close to optimal in both stochastic and adversarial input models while being oblivious to the structure of the input model. Compared to existing work on simultaneous approximation algorithms, our method is more general, applies to a wide range of problems, and requires no forecasts. Below is a comparison of our algorithm to other state-of-the-art methods. Results are based on synthetic data for an ad allocation problem.
Performance of dual mirror descent, a training based method, and an adversarial method relative to the optimal offline solution. Lower values indicate performance closer to the optimal offline allocation. Results are generated using synthetic experiments based on public data for an ad allocation problem.

Conclusion
In this post we introduced dual mirror descent, an algorithm for online allocation problems that is simple, robust, and flexible. It is particularly notable that after a long line of work in online allocation algorithms, dual mirror descent provides a way to analyze a wider range of algorithms with superior robustness priorities compared to previous techniques. Dual mirror descent has a wide range of applications across several commercial sectors and has been used over time at Google to help advertisers capture more value through better algorithmic decision making. We are also exploring further work related to mirror descent and its connections to PI controllers.

Acknowledgements
We would like to thank our co-authors Haihao Lu and Balu Sivan, and Kshipra Bhawalkar for their exceptional support and contributions. We would also like to thank our collaborators in the ad quality team and market algorithm research.


1Formalized in the equation below: 

Read More

PaLI: Scaling Language-Image Learning in 100+ Languages

PaLI: Scaling Language-Image Learning in 100+ Languages

Advanced language models (e.g., GPT, GLaM, PaLM and T5) have demonstrated diverse capabilities and achieved impressive results across tasks and languages by scaling up their number of parameters. Vision-language (VL) models can benefit from similar scaling to address many tasks, such as image captioning, visual question answering (VQA), object recognition, and in-context optical-character-recognition (OCR). Increasing the success rates for these practical tasks is important for everyday interactions and applications. Furthermore, for a truly universal system, vision-language models should be able to operate in many languages, not just one.

In “PaLI: A Jointly-Scaled Multilingual Language-Image Model”, we introduce a unified language-image model trained to perform many tasks and in over 100 languages. These tasks span vision, language, and multimodal image and language applications, such as visual question answering, image captioning, object detection, image classification, OCR, text reasoning, and others. Furthermore, we use a collection of public images that includes automatically collected annotations in 109 languages, which we call the WebLI dataset. The PaLI model pre-trained on WebLI achieves state-of-the-art performance on challenging image and language benchmarks, such as COCO-CaptionsTextCaps, VQAv2, OK-VQA, TextVQA and others. It also outperforms prior models’ multilingual visual captioning and visual question answering benchmarks.

Overview
One goal of this project is to examine how language and vision models interact at scale and specifically the scalability of language-image models. We explore both per-modality scaling and the resulting cross-modal interactions of scaling. We train our largest model to 17 billion (17B) parameters, where the visual component is scaled up to 4B parameters and the language model to 13B. 

The PaLI model architecture is simple, reusable and scalable. It consists of a Transformer encoder that processes the input text, and an auto-regressive Transformer decoder that generates the output text. To process images, the input to the Transformer encoder also includes “visual words” that represent an image processed by a Vision Transformer (ViT). A key component of the PaLI model is reuse, in which we seed the model with weights from previously-trained uni-modal vision and language models, such as mT5-XXL and large ViTs. This reuse not only enables the transfer of capabilities from uni-modal training, but also saves computational cost.

The PaLI model addresses a wide range of tasks in the language-image, language-only and image-only domain using the same API (e.g., visual-question answering, image captioning, scene-text understanding, etc.). The model is trained to support over 100 languages and tuned to perform multilingually for multiple language-image tasks.

Dataset: Language-Image Understanding in 100+ Languages
Scaling studies for deep learning show that larger models require larger datasets to train effectively. To unlock the potential of language-image pretraining, we construct WebLI, a multilingual language-image dataset built from images and text available on the public web.

WebLI scales up the text language from English-only datasets to 109 languages, which enables us to perform downstream tasks in many languages. The data collection process is similar to that employed by other datasets, e.g. ALIGN and LiT, and enabled us to scale the WebLI dataset to 10 billion images and 12 billion alt-texts.

In addition to annotation with web text, we apply the Cloud Vision API to perform OCR on the images, leading to 29 billion image-OCR pairs. We perform near-deduplication of the images against the train, validation and test splits of 68 common vision and vision-language datasets, to avoid leaking data from downstream evaluation tasks, as is standard in the literature. To further improve the data quality, we score image and alt-text pairs based on their cross-modal similarity, and tune the threshold to keep only 10% of the images, for a total of 1 billion images used for training PaLI.

Sampled images from WebLI associated with multilingual alt-text and OCR. The second image is by jopradier (original), used under the CC BY-NC-SA 2.0 license. Remaining images are also used with permission.
Statistics of recognized languages from alt-text and OCR in WebLI.
Image-text pair counts of WebLI and other large-scale vision-language datasets, CLIP, ALIGN and LiT.

Training Large Language-Image Models
Vision-language tasks require different capabilities and sometimes have diverging goals. Some tasks inherently require localization of objects to solve the task accurately, whereas some other tasks might need a more global view. Similarly, different tasks might require either long or compact answers. To address all of these objectives, we leverage the richness of the WebLI pre-training data and introduce a mixture of pre-training tasks, which prepare the model for a variety of downstream applications. To accomplish the goal of solving a wide variety of tasks, we enable knowledge-sharing between multiple image and language tasks by casting all tasks into a single generalized API (input: image + text; output: text), which is also shared with the pretraining setup. The objectives used for pre-training are cast into the same API as a weighted mixture aimed at both maintaining the ability of the reused model components and training the model to perform new tasks (e.g., split-captioning for image description, OCR prediction for scene-text comprehension, VQG and VQA prediction).

The model is trained in JAX with Flax using the open-sourced T5X and Flaxformer framework. For the visual component, we introduce and train a large ViT architecture, named ViT-e, with 4B parameters using the open-sourced BigVision framework. ViT-e follows the same recipe as the ViT-G architecture (which has 2B parameters). For the language component, we concatenate the dense token embeddings with the patch embeddings produced by the visual component, together as the input to the multimodal encoder-decoder, which is initialized from mT5-XXL. During the training of PaLI, the weights of this visual component are frozen, and only the weights of the multimodal encoder-decoder are updated.

Results
We compare PaLI on common vision-language benchmarks that are varied and challenging. The PaLI model achieves state-of-the-art results on these tasks, even outperforming very large models in the literature. For example, it outperforms the Flamingo model, which is several times larger (80B parameters), on several VQA and image-captioning tasks, and it also sustains performance on challenging language-only and vision-only tasks, which were not the main training objective.

PaLI (17B parameters) outperforms the state-of-the-art approaches (including SimVLM, CoCa, GIT2, Flamingo, BEiT3) on multiple vision-and-language tasks. In this plot we show the absolute score differences compared with the previous best model to highlight the relative improvements of PaLI. Comparison is on the official test splits when available. CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated by VQA Accuracy.

<!–

PaLI (17B parameters) outperforms the state-of-the-art approaches (including SimVLM, CoCa, GIT2, Flamingo, BEiT3) on multiple vision-and-language tasks. In this plot we show the absolute score differences compared with the previous best model to highlight the relative improvements of PaLI. Comparison is on the official test splits when available. CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated by VQA Accuracy.

–>

Model Scaling Results
We examine how the image and language model components interact with each other with regards to model scaling and where the model yields the most gains. We conclude that scaling both components jointly results in the best performance, and specifically, scaling the visual component, which requires relatively few parameters, is most essential. Scaling is also critical for better performance across multilingual tasks.

Scaling both the language and the visual components of the PaLI model contribute to improved performance. The plot shows the score differences compared to the PaLI-3B model: CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated by VQA Accuracy.
Multilingual captioning greatly benefits from scaling the PaLI models. We evaluate PaLI on a 35-language benchmark Crossmodal-3600. Here we present the average score over all 35 languages and the individual score for seven diverse languages.

Model Introspection: Model Fairness, Biases, and Other Potential Issues
To avoid creating or reinforcing unfair bias within large language and image models, important first steps are to (1) be transparent about the data that were used and how the model used those data, and (2) test for model fairness and conduct responsible data analyses. To address (1), our paper includes a data card and model card. To address (2), the paper includes results of demographic analyses of the dataset. We consider this a first step and know that it will be important to continue to measure and mitigate potential biases as we apply our model to new tasks, in alignment with our AI Principles.

Conclusion
We presented PaLI, a scalable multi-modal and multilingual model designed for solving a variety of vision-language tasks. We demonstrate improved performance across visual-, language- and vision-language tasks. Our work illustrates the importance of scale in both the visual and language parts of the model and the interplay between the two. We see that accomplishing vision and language tasks, especially in multiple languages, actually requires large scale models and data, and will potentially benefit from further scaling. We hope this work inspires further research in multi-modal and multilingual models.

Acknowledgements
We thank all the authors who conducted this research Soravit (Beer) Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari,Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut. We also thank Claire Cui, Slav Petrov, Tania Bedrax-Weiss, Joelle Barral, Tom Duerig, Paul Natsev, Fernando Pereira, Jeff Dean, Jeremiah Harmsen, Zoubin Ghahramani, Erica Moreira, Victor Gomes, Sarah Laszlo, Kathy Meier-Hellstern, Susanna Ricco, Rich Lee, Austin Tarango, Emily Denton, Bo Pang, Wei Li, Jihyung Kil, Tomer Levinboim, Julien Amelot, Zhenhai Zhu, Xiangning Chen, Liang Chen, Filip Pavetic, Daniel Keysers, Matthias Minderer, Josip Djolonga, Ibrahim Alabdulmohsin, Mostafa Dehghani, Yi Tay, Elizabeth Adkison, James Cockerille, Eric Ni, Anna Davies, and Maysam Moussalem for their suggestions, improvements and support. We thank Tom Small for providing visualizations for the blogpost.

Read More

Our commitment on using AI to accelerate progress on global development goals

Our commitment on using AI to accelerate progress on global development goals

I joined Google earlier this year to lead a new function: Technology & Society. Our aim is to help connect research, people and ideas across Google to shape the future of our technology innovations and their impact on society for the better. A key area of focus is AI, a field I have studied and immersed myself in over the years. I recently met with a team at the Google AI Center in Ghana that is using advanced technology to address an ancient problem: detecting locust outbreaks which threaten food security and livelihoods for millions of people. And in India and Bangladesh, our Crisis Response teams are using our machine-learning-based forecasting to provide over 360 million people with alerts about upcoming floods.

Efforts like these make me optimistic about how AI can contribute to solving societal problems. They also reinforce how high the stakes are for people everywhere, especially as global forces threaten the progress we’ve made on health, prosperity and environmental issues.

AI for the Global Goals

As the United Nations General Assembly begins, the world will come together to discuss issues of global importance, including assessing progress towards the Sustainable Development Goals (SDGs) which provide a roadmap on economic growth, social inclusion and environmental protection. While it’s clear the global community has made significant strides in meeting the 17 interlinked goals since their adoption by 193 countries, challenges persist in every country. Currently, no country is on track to meet all the goals by 2030.

From the launch of the SDGs in 2015, Google has believed in their importance and looked for ways to support progress. We know that advanced technology, such as AI, can be a powerful tool in advancing these goals. Research that I co-led before joining Google found AI could contribute to progress on all the SDGs — a finding confirmed by the UN. In 2018 Google launched AI for Social Good, focusing applied research and grantmaking efforts on some of the most intractable issues. But we know more needs to be done.

So today we’re expanding our efforts with AI for the Global Goals, which will bring together research, technology and funding to accelerate progress on the SDGs. This commitment will include $25 million to support NGOs and social enterprises working with AI to accelerate progress towards these goals. Based on what we’ve learned so far, we believe that with the AI capabilities and financial support we will provide, grantees can cut in half the time or cost to achieve their goals. In addition to funding, where appropriate, we’ll provide Google.org Fellowships, where teams of Google employees work alongside organizations for up to six months. Importantly, projects will be open-sourced so other organizations can build on the work. All of Google’s work and contributions will be guided by our Responsible AI Principles.

Since 2018, we’ve been focusing applied research and grantmaking efforts on some of the most intractable issues with over 50 organizations in countries ranging from Japan to Kenya to Brazil. We’ve supported organizations making progress on emissions monitoring, antimicrobial image analysis and mental health for LGBTQ+ youth. Working side-by-side with these organizations has shown us the creative ways a thriving ecosystem of companies, nonprofits and universities can use AI. We think we can use the same model to help countries make progress on the SDGs.

A critical time for global progress

COVID-19, global conflict, and climate change have set us back. Fewer people have the opportunity to move out of poverty, inequitable access to healthcare and education continues, gender inequality persists, and environmental threats pose immediate and long-term risks. We know that AI and other advanced technology can help tackle these setbacks. For example, in a significant development for biology and human health, DeepMind used AI to predict 200 million protein structures. They open-sourced the structures in partnership with EMBL-EBI, giving over 500,000 biologists tools to accelerate work on drug discovery, treatment and therapies — thereby making it possible to tackle many of the world’s neglected diseases.

As someone who has spent the last several decades working at the nexus of technology and societal good, it matters deeply that progress here will benefit communities everywhere. No single organization alone will develop and deploy all the solutions we’ll need; we all need to do our part. We’re looking forward to continuing to partner with experts around the world and learning what we can accomplish together.

Read More

LOLNeRF: Learn from One Look

LOLNeRF: Learn from One Look

An important aspect of human vision is our ability to comprehend 3D shape from the 2D images we observe. Achieving this kind of understanding with computer vision systems has been a fundamental challenge in the field. Many successful approaches rely on multi-view data, where two or more images of the same scene are available from different perspectives, which makes it much easier to infer the 3D shape of objects in the images.

There are, however, many situations where it would be useful to know 3D structure from a single image, but this problem is generally difficult or impossible to solve. For example, it isn’t necessarily possible to tell the difference between an image of an actual beach and an image of a flat poster of the same beach. However it is possible to estimate 3D structure based on what kind of 3D objects occur commonly and what similar structures look like from different perspectives.

In “LOLNeRF: Learn from One Look”, presented at CVPR 2022, we propose a framework that learns to model 3D structure and appearance from collections of single-view images. LOLNeRF learns the typical 3D structure of a class of objects, such as cars, human faces or cats, but only from single views of any one object, never the same object twice. We build our approach by combining Generative Latent Optimization (GLO) and neural radiance fields (NeRF) to achieve state-of-the-art results for novel view synthesis and competitive results for depth estimation.

We learn a 3D object model by reconstructing a large collection of single-view images using a neural network conditioned on latent vectors, z (left). This allows for a 3D model to be lifted from the image, and rendered from novel viewpoints. Holding the camera fixed, we can interpolate or sample novel identities (right).

Combining GLO and NeRF
GLO is a general method that learns to reconstruct a dataset (such as a set of 2D images) by co-learning a neural network (decoder) and table of codes (latents) that is also an input to the decoder. Each of these latent codes re-creates a single element (such as an image) from the dataset. Because the latent codes have fewer dimensions than the data elements themselves, the network is forced to generalize, learning common structure in the data (such as the general shape of dog snouts).

NeRF is a technique that is very good at reconstructing a static 3D object from 2D images. It represents an object with a neural network that outputs color and density for each point in 3D space. Color and density values are accumulated along rays, one ray for each pixel in a 2D image. These are then combined using standard computer graphics volume rendering to compute a final pixel color. Importantly, all these operations are differentiable, allowing for end-to-end supervision. By enforcing that each rendered pixel (of the 3D representation) matches the color of ground truth (2D) pixels, the neural network creates a 3D representation that can be rendered from any viewpoint.

We combine NeRF with GLO by assigning each object a latent code and concatenating it with standard NeRF inputs, giving it the ability to reconstruct multiple objects. Following GLO, we co-optimize these latent codes along with network weights during training to reconstruct the input images. Unlike standard NeRF, which requires multiple views of the same object, we supervise our method with only single views of any one object (but multiple examples of that type of object). Because NeRF is inherently 3D, we can then render the object from arbitrary viewpoints. Combining NeRF with GLO gives it the ability to learn common 3D structure across instances from only single views while still retaining the ability to recreate specific instances of the dataset.

Camera Estimation
In order for NeRF to work, it needs to know the exact camera location, relative to the object, for each image. Unless this was measured when the image was taken, it is generally unknown. Instead, we use the MediaPipe Face Mesh to extract five landmark locations from the images. Each of these 2D predictions correspond to a semantically consistent point on the object (e.g., the tip of the nose or corners of the eyes). We can then derive a set of canonical 3D locations for the semantic points, along with estimates of the camera poses for each image, such that the projection of the canonical points into the images is as consistent as possible with the 2D landmarks.

We train a per-image table of latent codes alongside a NeRF model. Output is subject to per-ray RGB, mask and hardness losses. Cameras are derived from a fit of predicted landmarks to canonical 3D keypoints.
Example MediaPipe landmarks and segmentation masks (images from CelebA).

Hard Surface and Mask Losses
Standard NeRF is effective for accurately reproducing the images, but in our single-view case, it tends to produce images that look blurry when viewed off-axis. To address this, we introduce a novel hard surface loss, which encourages the density to adopt sharp transitions from exterior to interior regions, reducing blurring. This essentially tells the network to create “solid” surfaces, and not semi-transparent ones like clouds.

We also obtained better results by splitting the network into separate foreground and background networks. We supervised this separation with a mask from the MediaPipe Selfie Segmenter and a loss to encourage network specialization. This allows the foreground network to specialize only on the object of interest, and not get “distracted” by the background, increasing its quality.

Results
We surprisingly found that fitting only five key points gave accurate enough camera estimates to train a model for cats, dogs, or human faces. This means that given only a single view of your beloved cats Schnitzel, Widget and friends, you can create a new image from any other angle.

Top: example cat images from AFHQ. Bottom: A synthesis of novel 3D views created by LOLNeRF.

Conclusion
We’ve developed a technique that is effective at discovering 3D structure from single 2D images. We see great potential in LOLNeRF for a variety of applications and are currently investigating potential use-cases.

Interpolation of feline identities from linear interpolation of learned latent codes for different examples in AFHQ.

Code Release
We acknowledge the potential for misuse and importance of acting responsibly. To that end, we will only release the code for reproducibility purposes, but will not release any trained generative models.

Acknowledgements
We would like to thank Andrea Tagliasacchi, Kwang Moo Yi, Viral Carpenter, David Fleet, Danica Matthews, Florian Schroff, Hartwig Adam and Dmitry Lagun for continuous help in building this technology.

Read More

Learning to Walk in the Wild from Terrain Semantics

Learning to Walk in the Wild from Terrain Semantics

An important promise for quadrupedal robots is their potential to operate in complex outdoor environments that are difficult or inaccessible for humans. Whether it’s to find natural resources deep in the mountains, or to search for life signals in heavily-damaged earthquake sites, a robust and versatile quadrupedal robot could be very helpful. To achieve that, a robot needs to perceive the environment, understand its locomotion challenges, and adapt its locomotion skill accordingly. While recent advances in perceptive locomotion have greatly enhanced the capability of quadrupedal robots, most works focus on indoor or urban environments, thus they cannot effectively handle the complexity of off-road terrains. In these environments, the robot needs to understand not only the terrain shape (e.g., slope angle, smoothness), but also its contact properties (e.g., friction, restitution, deformability), which are important for a robot to decide its locomotion skills. As existing perceptive locomotion systems mostly focus on the use of depth cameras or LiDARs, it can be difficult for these systems to estimate such terrain properties accurately.

In “Learning Semantics-Aware Locomotion Skills from Human Demonstrations”, we design a hierarchical learning framework to improve a robot’s ability to traverse complex, off-road environments. Unlike previous approaches that focus on environment geometry, such as terrain shape and obstacle locations, we focus on environment semantics, such as terrain type (grass, mud, etc.) and contact properties, which provide a complementary set of information useful for off-road environments. As the robot walks, the framework decides the locomotion skill, including the speed and gait (i.e., shape and timing of the legs’ movement) of the robot based on the perceived semantics, which allows the robot to walk robustly on a variety of off-road terrains, including rocks, pebbles, deep grass, mud, and more.

Our framework selects skills (gait and speed) of the robot from the camera RGB image. We first compute the speed from terrain semantics, and then select a gait based on the speed.

Overview
The hierarchical framework consists of a high-level skill policy and a low level motor controller. The skill policy selects a locomotion skill based on camera images, and the motor controller converts the selected skill into motor commands. The high-level skill policy is further decomposed into a learned speed policy and a heuristic-based gait selector. To decide a skill, the speed policy first computes the desired forward speed, based on the semantic information from the onboard RGB camera. For energy efficiency and robustness, quadrupedal robots usually select a different gait for each speed, so we designed the gait selector to compute a desired gait based on the forward speed. Lastly, a low-level convex model-predictive controller (MPC) converts the desired locomotion skill into motor torque commands, and executes them on the real hardware. We train the speed policy directly in the real world using imitation learning because it requires fewer training data compared to standard reinforcement learning algorithms.

The framework consists of a high-level skill policy and a low-level motor controller.

Learning Speed Command from Human Demonstrations
As the central component in our pipeline, the speed policy outputs the desired forward speed of the robot based on the RGB image from the onboard camera. Although many robot learning tasks can leverage simulation as a source of lower-cost data collection, we train the speed policy in the real world because accurate simulation of complex and diverse off-road environments is not yet available. As policy learning in the real world is time-consuming and potentially unsafe, we make two key design choices to improve the data efficiency and safety of our system.

The first is learning from human demonstrations. Standard reinforcement learning algorithms typically learn by exploration, where the agent attempts different actions in an environment and builds preferences based on the rewards received. However, such explorations can be potentially unsafe, especially in off-road environments, since any robot failures can damage both the robot hardware and the surrounding environment. To ensure safety, we train the speed policy using imitation learning from human demonstrations. We first ask a human operator to teleoperate the robot on a variety of off-road terrains, where the operator controls the speed and heading of the robot using a remote joystick. Next, we collect the training data by storing (image, forward_speed) pairs. We then train the speed policy using standard supervised learning to predict the human operator’s speed command. As it turns out, the human demonstration is both safe and high-quality, and allows the robot to learn a proper speed choice for different terrains.

The second key design choice is the training method. Deep neural networks, especially those involving high-dimensional visual inputs, typically require lots of data to train. To reduce the amount of real-world training data required, we first pre-train a semantic segmentation model on RUGD (an off-road driving dataset where the images look similar to those captured by the robot’s onboard camera), where the model predicts the semantic class (grass, mud, etc.) for every pixel in the camera image. We then extract a semantic embedding from the model’s intermediate layers and use that as the feature for on-robot training. With the pre-trained semantic embedding, we can train the speed policy effectively using less than 30 minutes of real-world data, which greatly reduces the amount of effort required.

We pre-train a semantic segmentation model and extract a semantic embedding to be fine-tuned on robot data.

Gait Selection and Motor Control
The next component in the pipeline, the gait selector, computes the appropriate gait based on the speed command from the speed policy. The gait of a robot, including its stepping frequency, swing height, and base height, can greatly affect the robot’s ability to traverse different terrains.

Scientific studies have shown that animals switch between different gaits at different speeds, and this result is further validated in quadrupedal robots, so we designed the gait selector to compute a robust gait for each speed. Compared to using a fixed gait across all speeds, we find that the gait selector further enhances the robot’s navigation performance on off-road terrains (more details in the paper).

The last component of the pipeline is a motor controller, which converts the speed and gait commands into motor torques. Similar to previous work, we use separate control strategies for swing and stance legs. By separating the task of skill learning and motor control, the skill policy only needs to output the desired speed, and does not need to learn low-level locomotion controls, which greatly simplifies the learning process.

Experiment Results
We implemented our framework on an A1 quadrupedal robot and tested it on an outdoor trail with multiple terrain types, including grass, gravel, and asphalt, which pose varying degrees of difficulty for the robot. For example, while the robot needs to walk slowly with high foot swings in deep grass to prevent its foot from getting stuck, on asphalt it can walk much faster with lower foot swings for better energy efficiency. Our framework captures such differences and selects an appropriate skill for each terrain type: slow speed (0.5m/s) on deep grass, medium speed (1m/s) on gravel, and high speed (1.4m/s) on asphalt. It completes the 460m-long trail in 9.6 minutes with an average speed of 0.8m/s (i.e., that’s 1.8 miles or 2.9 kilometers per hour). In contrast, non-adaptive policies either cannot complete the trail safely or walk significantly slower (0.5m/s), illustrating the importance of adapting locomotion skills based on the perceived environments.

The framework selects different speeds based on conditions of the trail.

To test generalizability, we also deployed the robot to a number of trails that are not seen during training. The robot traverses through all of them without failure, and adjusts its locomotion skills based on terrain semantics. In general, the skill policy selects a faster skill on rigid and flat terrains and a slower speed on deformable or uneven terrain. At the time of writing, the robot has traversed over 6km of outdoor trails without failure.

With the framework, the robot walks safely on a variety of outdoor terrains not seen during training.

Conclusion
In this work, we present a hierarchical framework to learn semantic-aware locomotion skills for off-road locomotion. Using less than 30 minutes of human demonstration data, the framework learns to adjust the speed and gait of the robot based on the perceived semantics of the environment. The robot can walk safely and efficiently on a wide variety of off-road terrains. One limitation of our framework is that it only adjusts locomotion skills for standard walking and does not support more agile behaviors such as jumping, which can be essential for traversing more difficult terrains with gaps or hurdles. Another limitation is that our framework currently requires manual steering commands to follow a desired path and reach the goal. In future work, we plan to look into a deeper integration of high-level skill policy with the low-level controller for more agile behaviors, and incorporate navigation and path planning into the framework so that the robot can operate fully autonomously in challenging off-road environments.

Acknowledgements
We would like to thank our paper co-authors: Xiangyun Meng, Wenhao Yu, Tingnan Zhang, Jie Tan, and Byron Boots. We would also like to thank the team members of Robotics at Google for discussions and feedback.

Read More

A Multi-Axis Approach for Vision Transformer and MLP Models

A Multi-Axis Approach for Vision Transformer and MLP Models

Convolutional neural networks have been the dominant machine learning architecture for computer vision since the introduction of AlexNet in 2012. Recently, inspired by the evolution of Transformers in natural language processing, attention mechanisms have been prominently incorporated into vision models. These attention methods boost some parts of the input data while minimizing other parts so that the network can focus on small but important parts of the data. The Vision Transformer (ViT) has created a new landscape of model designs for computer vision that is completely free of convolution. ViT regards image patches as a sequence of words, and applies a Transformer encoder on top. When trained on sufficiently large datasets, ViT demonstrates compelling performance on image recognition.

While convolutions and attention are both sufficient for good performance, neither of them are necessary. For example, MLP-Mixer adopts a simple multi-layer perceptron (MLP) to mix image patches across all the spatial locations, resulting in an all-MLP architecture. It is a competitive alternative to existing state-of-the-art vision models in terms of the trade-off between accuracy and computation required for training and inference. However, both ViT and the MLP models struggle to scale to higher input resolution because the computational complexity increases quadratically with respect to the image size.

Today we present a new multi-axis approach that is simple and effective, improves on the original ViT and MLP models, can better adapt to high-resolution, dense prediction tasks, and can naturally adapt to different input sizes with high flexibility and low complexity. Based on this approach, we have built two backbone models for high-level and low-level vision tasks. We describe the first in “MaxViT: Multi-Axis Vision Transformer”, to be presented in ECCV 2022, and show it significantly improves the state of the art for high-level tasks, such as image classification, object detection, segmentation, quality assessment, and generation. The second, presented in “MAXIM: Multi-Axis MLP for Image Processing” at CVPR 2022, is based on a UNet-like architecture and achieves competitive performance on low-level imaging tasks including denoising, deblurring, dehazing, deraining, and low-light enhancement. To facilitate further research on efficient Transformer and MLP models, we have open-sourced the code and models for both MaxViT and MAXIM.

A demo of image deblurring using MAXIM frame by frame.

Overview
Our new approach is based on multi-axis attention, which decomposes the full-size attention (each pixel attends to all the pixels) used in ViT into two sparse forms — local and (sparse) global. As shown in the figure below, the multi-axis attention contains a sequential stack of block attention and grid attention. The block attention works within non-overlapping windows (small patches in intermediate feature maps) to capture local patterns, while the grid attention works on a sparsely sampled uniform grid for long-range (global) interactions. The window sizes of grid and block attentions can be fully controlled as hyperparameters to ensure a linear computational complexity to the input size.

The proposed multi-axis attention conducts blocked local and dilated global attention sequentially followed by a FFN, with only a linear complexity. The pixels in the same colors are attended together.

Such low-complexity attention can significantly improve its wide applicability to many vision tasks, especially for high-resolution visual predictions, demonstrating greater generality than the original attention used in ViT. We build two backbone instantiations out of this multi-axis attention approach – MaxViT and MAXIM, for high-level and low-level tasks, respectively.

MaxViT
In MaxViT, we first build a single MaxViT block (shown below) by concatenating MBConv (proposed by EfficientNet, V2) with the multi-axis attention. This single block can encode local and global visual information regardless of input resolution. We then simply stack repeated blocks composed of attention and convolutions in a hierarchical architecture (similar to ResNet, CoAtNet), yielding our homogenous MaxViT architecture. Notably, MaxViT is distinguished from previous hierarchical approaches as it can “see” globally throughout the entire network, even in earlier, high-resolution stages, demonstrating stronger model capacity on various tasks.

The meta-architecture of MaxViT.

MAXIM
Our second backbone, MAXIM, is a generic UNet-like architecture tailored for low-level image-to-image prediction tasks. MAXIM explores parallel designs of the local and global approaches using the gated multi-layer perceptron (gMLP) network (patching-mixing MLP with a gating mechanism). Another contribution of MAXIM is the cross-gating block that can be used to apply interactions between two different input signals. This block can serve as an efficient alternative to the cross-attention module as it only employs the cheap gated MLP operators to interact with various inputs without relying on the computationally heavy cross-attention. Moreover, all the proposed components including the gated MLP and cross-gating blocks in MAXIM enjoy linear complexity to image size, making it even more efficient when processing high-resolution pictures.

Results
We demonstrate the effectiveness of MaxViT on a broad range of vision tasks. On image classification, MaxViT achieves state-of-the-art results under various settings: with only ImageNet-1K training, MaxViT attains 86.5% top-1 accuracy; with ImageNet-21K (14M images, 21k classes) pre-training, MaxViT achieves 88.7% top-1 accuracy; and with JFT (300M images, 18k classes) pre-training, our largest model MaxViT-XL achieves a high accuracy of 89.5% with 475M parameters.

Performance comparison of MaxViT with state-of-the-art models on ImageNet-1K. Top: Accuracy vs. FLOPs performance scaling with 224×224 image resolution. Bottom: Accuracy vs. parameters scaling curve under ImageNet-1K fine-tuning setting.

For downstream tasks, MaxViT as a backbone delivers favorable performance on a broad spectrum of tasks. For object detection and segmentation on the COCO dataset, the MaxViT backbone achieves 53.4 AP, outperforming other base-level models while requiring only about 60% the computational cost. For image aesthetics assessment, the MaxViT model advances the state-of-the-art MUSIQ model by 3.5% in terms of linear correlation with human opinion scores. The standalone MaxViT building block also demonstrates effective performance on image generation, achieving better FID and IS scores on the ImageNet-1K unconditional generation task with a significantly lower number of parameters than the state-of-the-art model, HiT.

The UNet-like MAXIM backbone, customized for image processing tasks, has also demonstrated state-of-the-art results on 15 out of 20 tested datasets, including denoising, deblurring, deraining, dehazing, and low-light enhancement, while requiring fewer or comparable number of parameters and FLOPs than competitive models. Images restored by MAXIM show more recovered details with less visual artifacts.

Visual results of MAXIM for image deblurring, deraining, and low-light enhancement.

Summary
Recent works in the last two or so years have shown that ConvNets and Vision Transformers can achieve similar performance. Our work presents a unified design that takes advantage of the best of both worlds — efficient convolution and sparse attention — and demonstrates that a model built on top, namely MaxViT, can achieve state-of-the-art performance on a variety of vision tasks. More importantly, MaxViT scales well to very large data sizes. We also show that an alternative multi-axis design using MLP operators, MAXIM, achieves state-of-the-art performance on a broad range of low-level vision tasks.

Even though we present our models in the context of vision tasks, the proposed multi-axis approach can easily extend to language modeling to capture both local and global dependencies in linear time. Motivated by the work here, we expect that it is worthwhile to study other forms of sparse attention in higher-dimensional or multimodal signals such as videos, point clouds, and vision-language models.

We have open-sourced the code and models of MAXIM and MaxViT to facilitate future research on efficient attention and MLP models.

Acknowledgments
We would like to thank our co-authors: Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, and Alan Bovik. We would also like to acknowledge the valuable discussion and support from Xianzhi Du, Long Zhao, Wuyang Chen, Hanxiao Liu, Zihang Dai, Anurag Arnab, Sungjoon Choi, Junjie Ke, Mauricio Delbracio, Irene Zhu, Innfarn Yoo, Huiwen Chang, and Ce Liu.

Read More

Digitizing Smell: Using Molecular Maps to Understand Odor

Digitizing Smell: Using Molecular Maps to Understand Odor

Did you ever try to measure a smell? …Until you can measure their likenesses and differences you can have no science of odor. If you are ambitious to found a new science, measure a smell.
— Alexander Graham Bell, 1914.

How can we measure a smell? Smells are produced by molecules that waft through the air, enter our noses, and bind to sensory receptors. Potentially billions of molecules can produce a smell, so figuring out which ones produce which smells is difficult to catalog or predict. Sensory maps can help us solve this problem. Color vision has the most familiar examples of these maps, from the color wheel we each learn in primary school to more sophisticated variants used to perform color correction in video production. While these maps have existed for centuries, useful maps for smell have been missing, because smell is a harder problem to crack: molecules vary in many more ways than photons do; data collection requires physical proximity between the smeller and smell (we don’t have good smell “cameras” and smell “monitors”); and the human eye only has three sensory receptors for color while the human nose has > 300 for odor. As a result, previous efforts to produce odor maps have failed to gain traction.

In 2019, we developed a graph neural network (GNN) model that began to explore thousands of examples of distinct molecules paired with the smell labels that they evoke, e.g., “beefy”, “floral”, or “minty”, to learn the relationship between a molecule’s structure and the probability that such a molecule would have each smell label. The embedding space of this model contains a representation of each molecule as a fixed-length vector describing that molecule in terms of its odor, much as the RGB value of a visual stimulus describes its color.

Left: An example of a color map (CIE 1931) in which coordinates can be directly translated into values for hue and saturation. Similar colors lie near each other, and specific wavelengths of light (and combinations thereof) can be identified with positions on the map. Right: Odors in the Principal Odor Map operate similarly. Individual molecules correspond to points (grey), and the locations of these points reflect predictions of their odor character.

Today we introduce the “Principal Odor Map” (POM), which identifies the vector representation of each odorous molecule in the model’s embedding space as a single point in a high-dimensional space. The POM has the properties of a sensory map: first, pairs of perceptually similar odors correspond to two nearby points in the POM (by analogy, red is nearer to orange than to green on the color wheel). Second, the POM enables us to predict and discover new odors and the molecules that produce them. In a series of papers, we demonstrate that the map can be used to prospectively predict the odor properties of molecules, understand these properties in terms of fundamental biology, and tackle pressing global health problems. We discuss each of these promising applications of the POM and how we test them below.

Test 1: Challenging the Model with Molecules Never Smelled Before
First, we asked if the underlying model could correctly predict the odors of new molecules that no one had ever smelled before and that were very different from molecules used during model development. This is an important test — many models perform well on data that looks similar to what the model has seen before, but break down when tested on novel cases.

To test this, we collected the largest ever dataset of odor descriptions for novel molecules. Our partners at the Monell Center trained panelists to rate the smell of each of 400 molecules using 55 distinct labels (e.g., “minty”) that were selected to cover the space of possible smells while being neither redundant nor too sparse. Unsurprisingly, we found that different people had different characterizations of the same molecule. This is why sensory research typically uses panels of dozens or hundreds of people and highlights why smell is a hard problem to solve. Rather than see if the model could match any one person, we asked how close it was to the consensus: the average across all of the panelists. We found that the predictions of the model were closer to the consensus than the average panelist was. In other words, the model demonstrated an exceptional ability to predict odor from a molecule’s structure.

Predictions made by two models, our GNN model (orange) and a baseline chemoinformatic random forest (RF) model (blue), compared with the mean ratings given by trained panelists (green) for the molecule 2,3-dihydrobenzofuran-5-carboxaldehyde. Each bar corresponds to one odor character label (with only the top 17 of 55 shown for clarity). The top five are indicated in color; our model correctly identifies four of the top five, with high confidence, vs. only three of five, with low confidence, for the RF model. The correlation (R) to the full set of 55 labels is also higher in our model.

<!–

Predictions made by two models, our GNN model (orange) and a baseline chemoinformatic random forest (RF) model (blue), compared with the mean ratings given by trained panelists (green) for the molecule 2,3-dihydrobenzofuran-5-carboxaldehyde. Each bar corresponds to one odor character label (with only the top 17 of 55 shown for clarity). The top five are indicated in color; our model correctly identifies four of the top five, with high confidence, vs. only three of five, with low confidence, for the RF model. The correlation (R) to the full set of 55 labels is also higher in our model.

–>

Unlike alternative benchmark models (RF and nearest-neighbor models trained on various sets of chemoinformatic features), our GNN model outperforms the median human panelist at predicting the panel mean rating. In other words, our GNN model better reflects the panel consensus than the typical panelist.

<!–

Unlike alternative benchmark models (RF and nearest-neighbor models trained on various sets of chemoinformatic features), our GNN model outperforms the median human panelist at predicting the panel mean rating. In other words, our GNN model better reflects the panel consensus than the typical panelist.

–>

The POM also exhibited state-of-the-art performance on alternative human olfaction tasks like detecting the strength of a smell or the similarity of different smells. Thus, with the POM, it should be possible to predict the odor qualities of any of billions of as-yet-unknown odorous molecules, with broad applications to flavor and fragrance.

Test 2: Linking Odor Quality Back to Fundamental Biology
Because the Principal Odor Map was useful in predicting human odor perception, we asked whether it could also predict odor perception in animals, and the brain activity that underlies it. We found that the map could successfully predict the activity of sensory receptors, neurons, and behavior in most animals that olfactory neuroscientists have studied, including mice and insects.

What common feature of the natural world makes this map applicable to species separated by hundreds of millions of years of evolution? We realized that the common purpose of the ability to smell might be to detect and discriminate between metabolic states, i.e., to sense when something is ripe vs. rotten, nutritious vs. inert, or healthy vs. sick. We gathered data about metabolic reactions in dozens of species across the kingdoms of life and found that the map corresponds closely to metabolism itself. When two molecules are far apart in odor, according to the map, a long series of metabolic reactions is required to convert one to the other; by contrast, similarly smelling molecules are separated by just one or a few reactions. Even long reaction pathways containing many steps trace smooth paths through the map. And molecules that co-occur in the same natural substances (e.g., an orange) are often very tightly clustered on the map. The POM shows that olfaction is linked to our natural world through the structure of metabolism and, perhaps surprisingly, captures fundamental principles of biology.

Left: We aggregated metabolic reactions found in 17 species across 4 kingdoms to construct a metabolic graph. In this illustration, each circle is a distinct metabolite molecule and an arrow indicates that there is a metabolic reaction that converts one molecule to another. Some metabolites have an odor (color) and others do not (gray), and the metabolic distance between two odorous metabolites is the minimum number of reactions necessary to convert one into the other. In the path shown in bold, the distance is 3. Right: Metabolic distance was highly correlated with distance in the POM, an estimate of perceived odor dissimilarity.

<!–

Left: We aggregated metabolic reactions found in 17 species across 4 kingdoms to construct a metabolic graph. In this illustration, each circle is a distinct metabolite molecule and an arrow indicates that there is a metabolic reaction that converts one molecule to another. Some metabolites have an odor (color) and others do not (gray), and the metabolic distance between two odorous metabolites is the minimum number of reactions necessary to convert one into the other. In the path shown in bold, the distance is 3. Right: Metabolic distance was highly correlated with distance in the POM, an estimate of perceived odor dissimilarity.

–>

Test 3: Extending the Model to Tackle a Global Health Challenge
A map of odor that is tightly connected to perception and biology across the animal kingdom opens new doors. Mosquitos and other insect pests are drawn to humans in part by their odor perception. Since the POM can be used to predict animal olfaction generally, we retrained it to tackle one of humanity’s biggest problems, the scourge of diseases transmitted by mosquitoes and ticks, which kill hundreds of thousands of people each year.

For this purpose, we improved our original model with two new sources of data: (1) a long-forgotten set of experiments conducted by the USDA on human volunteers beginning 80 years ago and recently made discoverable by Google Books, which we subsequently made machine-readable; and (2) a new dataset collected by our partners at TropIQ, using their high-throughput laboratory mosquito assay. Both datasets measure how well a given molecule keeps mosquitos away. Together, the resulting model can predict the mosquito repellency of nearly any molecule, enabling a virtual screen over huge swaths of molecular space. We validated this screen experimentally using entirely new molecules and found over a dozen of them with repellency at least as high as DEET, the active ingredient in most insect repellents. Less expensive, longer lasting, and safer repellents can reduce the worldwide incidence of diseases like malaria, potentially saving countless lives.

We digitized USDA mosquito repellency data for thousands of molecules previously scanned by Google Books, and used it to refine the learned representation (the map) at the heart of the model. We added additional layers, specifically to predict repellency in a mosquito feeder assay, and iteratively trained the model to improve assay predictions while running computational screens for candidate repellents.
Many molecules showing mosquito repellency in the laboratory assay also showed repellency when applied to humans. Several showed repellency greater than the most common repellents used today (DEET and picaridin).

The Road Ahead
We discovered that our modeling approach to smell prediction could be used to draw a Principal Odor Map for tackling odor-related problems more generally. This map was the key to measuring smell: it answered a range of questions about novel smells and the molecules that produce them, it connected smells back to their origins in evolution and the natural world, and it is helping us tackle important human-health challenges that affect millions of people. Going forward, we hope that this approach can be used to find new solutions to problems in food and fragrance formulation, environmental quality monitoring, and the detection of human and animal diseases.

Acknowledgements
This work was performed by the ML olfaction research team, including Benjamin Sanchez-Lengeling, Brian K. Lee, Jennifer N. Wei, Wesley W. Qian, and Jake Yasonik (the latter two were partly supported by the Google Student Researcher program) and our external partners including Emily Mayhew and Joel D. Mainland from the Monell Center, and Koen Dechering and Marnix Vlot from TropIQ. The Google Books team brought the USDA dataset online. Richard C. Gerkin was supported by the Google Visiting Faculty Researcher program and is also an Associate Research Professor at Arizona State University.

Read More