Enhancing Backpropagation via Local Loss Optimization

While model design and training data are key ingredients in a deep neural network’s (DNN’s) success, less-often discussed is the specific optimization method used for updating the model parameters (weights). Training DNNs involves minimizing a loss function that measures the discrepancy between the ground truth labels and the model’s predictions. Training is carried out by backpropagation, which adjusts the model weights via gradient descent steps. Gradient descent, in turn, updates the weights by using the gradient (i.e., derivative) of the loss with respect to the weights.

The simplest weight update corresponds to stochastic gradient descent, which, in every step, moves the weights in the negative direction with respect to the gradients (with an appropriate step size, a.k.a. the learning rate). More advanced optimization methods modify the direction of the negative gradient before updating the weights by using information from the past steps and/or the local properties (such as the curvature information) of the loss function around the current weights. For instance, a momentum optimizer encourages moving along the average direction of past updates, and the AdaGrad optimizer scales each coordinate based on the past gradients. These optimizers are commonly known as first-order methods since they generally modify the update direction using only information from the first-order derivative (i.e., gradient). More importantly, the components of the weight parameters are treated independently from each other.

More advanced optimization, such as Shampoo and K-FAC, capture the correlations between gradients of parameters and have been shown to improve convergence, reducing the number of iterations and improving the quality of the solution. These methods capture information about the local changes of the derivatives of the loss, i.e., changes in gradients. Using this additional information, higher-order optimizers can discover much more efficient update directions for training models by taking into account the correlations between different groups of parameters. On the downside, calculating higher-order update directions is computationally more expensive than first-order updates. The operation uses more memory for storing statistics and involves matrix inversion, thus hindering the applicability of higher-order optimizers in practice.

In “LocoProp: Enhancing BackProp via Local Loss Optimization”, we introduce a new framework for training DNN models. Our new framework, LocoProp, conceives neural networks as a modular composition of layers. Generally, each layer in a neural network applies a linear transformation on its inputs, followed by a non-linear activation function. In the new construction, each layer is allotted its own weight regularizer, output target, and loss function. The loss function of each layer is designed to match the activation function of the layer. Using this formulation, training minimizes the local losses for a given mini-batch of examples, iteratively and in parallel across layers. Our method performs multiple local updates per batch of examples using a first-order optimizer (like RMSProp), which avoids computationally expensive operations such as the matrix inversions required for higher-order optimizers. However, we show that the combined local updates look rather like a higher-order update. Empirically, we show that LocoProp outperforms first-order methods on a deep autoencoder benchmark and performs comparably to higher-order optimizers, such as Shampoo and K-FAC, without the high memory and computation requirements.

Method
Neural networks are generally viewed as composite functions that transform model inputs into output representations, layer by layer. LocoProp adopts this view while decomposing the network into layers. In particular, instead of updating the weights of the layer to minimize the loss function at the output, LocoProp applies pre-defined local loss functions specific to each layer. For a given layer, the loss function is selected to match the activation function, e.g., a tanh loss would be selected for a layer with a tanh activation. Each layerwise loss measures the discrepancy between the layer’s output (for a given mini-batch of examples) and a notion of a target output for that layer. Additionally, a regularizer term ensures that the updated weights do not drift too far from the current values. The combined layerwise loss function (with a local target) plus regularizer is used as the new objective function for each layer.

Similar to backpropagation, LocoProp applies a forward pass to compute the activations. In the backward pass, LocoProp sets per neuron “targets” for each layer. Finally, LocoProp splits model training into independent problems across layers where several local updates can be applied to each layer’s weights in parallel.

Perhaps the simplest loss function one can think of for a layer is the squared loss. While the squared loss is a valid choice of a loss function, LocoProp takes into account the possible non-linearity of the activation functions of the layers and applies layerwise losses tailored to the activation function of each layer. This enables the model to emphasize regions at the input that are more important for the model prediction while deemphasizing the regions that do not affect the output as much. Below we show examples of tailored losses for the tanh and ReLU activation functions.

Loss functions induced by the (left) tanh and (right) ReLU activation functions. Each loss is more sensitive to the regions affecting the output prediction. For instance, ReLU loss is zero as long as both the prediction (â) and the target (a) are negative. This is because the ReLU function applied to any negative number equals zero.

After forming the objective in each layer, LocoProp updates the layer weights by repeatedly applying gradient descent steps on its objective. The update typically uses a first-order optimizer (like RMSProp). However, we show that the overall behavior of the combined updates closely resembles higher-order updates (shown below). Thus, LocoProp provides training performance close to what higher-order optimizers achieve without the high memory or computation needed for higher-order methods, such as matrix inverse operations. We show that LocoProp is a flexible framework that allows the recovery of well-known algorithms and enables the construction of new algorithms via different choices of losses, targets, and regularizers. LocoProp’s layerwise view of neural networks also allows updating the weights in parallel across layers.

Experiments
In our paper, we describe experiments on the deep autoencoder model, which is a commonly used baseline for evaluating the performance of optimization algorithms. We perform extensive tuning on multiple commonly used first-order optimizers, including SGD, SGD with momentum, AdaGrad, RMSProp, and Adam, as well as the higher-order Shampoo and K-FAC optimizers, and compare the results with LocoProp. Our findings indicate that the LocoProp method performs significantly better than first-order optimizers and is comparable to those of higher-order, while being significantly faster when run on a single GPU.

Train loss vs. number of epochs (left) and wall-clock time, i.e., the real time that passes during training, (right) for RMSProp, Shampoo, K-FAC, and LocoProp on the deep autoencoder model.

Summary and Future Directions
We introduced a new framework, called LocoProp, for optimizing deep neural networks more efficiently. LocoProp decomposes neural networks into separate layers with their own regularizer, output target, and loss function and applies local updates in parallel to minimize the local objectives. While using first-order updates for the local optimization problems, the combined updates closely resemble higher-order update directions, both theoretically and empirically.

LocoProp provides flexibility to choose the layerwise regularizers, targets, and loss functions. Thus, it allows the development of new update rules based on these choices. Our code for LocoProp is available online on GitHub. We are currently working on scaling up ideas induced by LocoProp to much larger scale models; stay tuned!

Acknowledgments
We would like to thank our co-author, Manfred K. Warmuth, for his critical contributions and inspiring vision. We would like to thank Sameer Agarwal for discussions looking at this work from a composite functions perspective, Vineet Gupta for discussions and development of Shampoo, Zachary Nado on K-FAC, Tom Small for development of the animation used in this blogpost and finally, Yonghui Wu and Zoubin Ghahramani for providing us with a nurturing research environment in the Google Brain Team.

Read More

Look and Talk: Natural Conversations with Google Assistant

In natural conversations, we don’t say people’s names every time we speak to each other. Instead, we rely on contextual signaling mechanisms to initiate conversations, and eye contact is often all it takes. Google Assistant, now available in more than 95 countries and over 29 languages, has primarily relied on a hotword mechanism (“Hey Google” or “OK Google”) to help more than 700 million people every month get things done across Assistant devices. As virtual assistants become an integral part of our everyday lives, we’re developing ways to initiate conversations more naturally.

At Google I/O 2022, we announced Look and Talk, a major development in our journey to create natural and intuitive ways to interact with Google Assistant-powered home devices. This is the first multimodal, on-device Assistant feature that simultaneously analyzes audio, video, and text to determine when you are speaking to your Nest Hub Max. Using eight machine learning models together, the algorithm can differentiate intentional interactions from passing glances in order to accurately identify a user’s intent to engage with Assistant. Once within 5ft of the device, the user may simply look at the screen and talk to start interacting with the Assistant.

We developed Look and Talk in alignment with our AI Principles. It meets our strict audio and video processing requirements, and like our other camera sensing features, video never leaves the device. You can always stop, review and delete your Assistant activity at myactivity.google.com. These added layers of protection enable Look and Talk to work just for those who turn it on, while keeping your data safe.

Google Assistant relies on a number of signals to accurately determine when the user is speaking to it. On the right is a list of signals used with indicators showing when each signal is triggered based on the user’s proximity to the device and gaze direction.

Modeling Challenges
The journey of this feature began as a technical prototype built on top of models developed for academic research. Deployment at scale, however, required solving real-world challenges unique to this feature. It had to:

  1. Support a range of demographic characteristics (e.g., age, skin tones).
  2. Adapt to the ambient diversity of the real world, including challenging lighting (e.g., backlighting, shadow patterns) and acoustic conditions (e.g., reverberation, background noise).
  3. Deal with unusual camera perspectives, since smart displays are commonly used as countertop devices and look up at the user(s), unlike the frontal faces typically used in research datasets to train models.
  4. Run in real-time to ensure timely responses while processing video on-device.

The evolution of the algorithm involved experiments with approaches ranging from domain adaptation and personalization to domain-specific dataset development, field-testing and feedback, and repeated tuning of the overall algorithm.

Technology Overview
A Look and Talk interaction has three phases. In the first phase, Assistant uses visual signals to detect when a user is demonstrating an intent to engage with it and then “wakes up” to listen to their utterance. The second phase is designed to further validate and understand the user’s intent using visual and acoustic signals. If any signal in the first or second processing phases indicates that it isn’t an Assistant query, Assistant returns to standby mode. These two phases are the core Look and Talk functionality, and are discussed below. The third phase of query fulfillment is typical query flow, and is beyond the scope of this blog.

Phase One: Engaging with Assistant
The first phase of Look and Talk is designed to assess whether an enrolled user is intentionally engaging with Assistant. Look and Talk uses face detection to identify the user’s presence, filters for proximity using the detected face box size to infer distance, and then uses the existing Face Match system to determine whether they are enrolled Look and Talk users.

For an enrolled user within range, an custom eye gaze model determines whether they are looking at the device. This model estimates both the gaze angle and a binary gaze-on-camera confidence from image frames using a multi-tower convolutional neural network architecture, with one tower processing the whole face and another processing patches around the eyes. Since the device screen covers a region underneath the camera that would be natural for a user to look at, we map the gaze angle and binary gaze-on-camera prediction to the device screen area. To ensure that the final prediction is resilient to spurious individual predictions and involuntary eye blinks and saccades, we apply a smoothing function to the individual frame-based predictions to remove spurious individual predictions.

Eye-gaze prediction and post-processing overview.

We enforce stricter attention requirements before informing users that the system is ready for interaction to minimize false triggers, e.g., when a passing user briefly glances at the device. Once the user looking at the device starts speaking, we relax the attention requirement, allowing the user to naturally shift their gaze.

The final signal necessary in this processing phase checks that the Face Matched user is the active speaker. This is provided by a multimodal active speaker detection model that takes as input both video of the user’s face and the audio containing speech, and predicts whether they are speaking. A number of augmentation techniques (including RandAugment, SpecAugment, and augmenting with AudioSet sounds) helps improve prediction quality for the in-home domain, boosting end-feature performance by over 10%.The final deployed model is a quantized, hardware-accelerated TFLite model, which uses five frames of context for the visual input and 0.5 seconds for the audio input.

Active speaker detection model overview: The two-tower audiovisual model provides the “speaking” probability prediction for the face. The visual network auxiliary prediction pushes the visual network to be as good as possible on its own, improving the final multimodal prediction.

Phase Two: Assistant Starts Listening
In phase two, the system starts listening to the content of the user’s query, still entirely on-device, to further assess whether the interaction is intended for Assistant using additional signals. First, Look and Talk uses Voice Match to further ensure that the speaker is enrolled and matches the earlier Face Match signal. Then, it runs a state-of-the-art automatic speech recognition model on-device to transcribe the utterance.

The next critical processing step is the intent understanding algorithm, which predicts whether the user’s utterance was intended to be an Assistant query. This has two parts: 1) a model that analyzes the non-lexical information in the audio (i.e., pitch, speed, hesitation sounds) to determine whether the utterance sounds like an Assistant query, and 2) a text analysis model that determines whether the transcript is an Assistant request. Together, these filter out queries not intended for Assistant. It also uses contextual visual signals to determine the likelihood that the interaction was intended for Assistant.

Overview of the semantic filtering approach to determine if a user utterance is a query intended for the Assistant.

Finally, when the intent understanding model determines that the user utterance was likely meant for Assistant, Look and Talk moves into the fulfillment phase where it communicates with the Assistant server to obtain a response to the user’s intent and query text.

Performance, Personalization and UX
Each model that supports Look and Talk was evaluated and improved in isolation and then tested in the end-to-end Look and Talk system. The huge variety of ambient conditions in which Look and Talk operates necessitates the introduction of personalization parameters for algorithm robustness. By using signals obtained during the user’s hotword-based interactions, the system personalizes parameters to individual users to deliver improvements over the generalized global model. This personalization also runs entirely on-device.

Without a predefined hotword as a proxy for user intent, latency was a significant concern for Look and Talk. Often, a strong enough interaction signal does not occur until well after the user has started speaking, which can add hundreds of milliseconds of latency, and existing models for intent understanding add to this since they require complete, not partial, queries. To bridge this gap, Look and Talk completely forgoes streaming audio to the server, with transcription and intent understanding being on-device. The intent understanding models can work off of partial utterances. This results in an end-to-end latency comparable with current hotword-based systems.

The UI experience is based on user research to provide well-balanced visual feedback with high learnability. This is illustrated in the figure below.

Left: The spatial interaction diagram of a user engaging with Look and Talk. Right: The User Interface (UI) experience.

We developed a diverse video dataset with over 3,000 participants to test the feature across demographic subgroups. Modeling improvements driven by diversity in our training data improved performance for all subgroups.

Conclusion
Look and Talk represents a significant step toward making user engagement with Google Assistant as natural as possible. While this is a key milestone in our journey, we hope this will be the first of many improvements to our interaction paradigms that will continue to reimagine the Google Assistant experience responsibly. Our goal is to make getting help feel natural and easy, ultimately saving time so users can focus on what matters most.

Acknowledgements
This work involved collaborative efforts from a multidisciplinary team of software engineers, researchers, UX, and cross-functional contributors. Key contributors from Google Assistant include Alexey Galata, Alice Chuang‎, Barbara Wang, Britanie Hall, Gabriel Leblanc, Gloria McGee, Hideaki Matsui, James Zanoni, Joanna (Qiong) Huang, Krunal Shah, Kavitha Kandappan, Pedro Silva, Tanya Sinha, Tuan Nguyen, Vishal Desai, Will Truong‎, Yixing Cai‎, Yunfan Ye; from Research including Hao Wu, Joseph Roth, Sagar Savla, Sourish Chaudhuri, Susanna Ricco. Thanks to Yuan Yuan and Caroline Pantofaru for their leadership, and everyone on the Nest, Assistant, and Research teams who provided invaluable input toward the development of Look and Talk.

Read More

ML-Enhanced Code Completion Improves Developer Productivity

The increasing complexity of code poses a key challenge to productivity in software engineering. Code completion has been an essential tool that has helped mitigate this complexity in integrated development environments (IDEs). Conventionally, code completion suggestions are implemented with rule-based semantic engines (SEs), which typically have access to the full repository and understand its semantic structure. Recent research has demonstrated that large language models (e.g., Codex and PaLM) enable longer and more complex code suggestions, and as a result, useful products have emerged (e.g., Copilot). However, the question of how code completion powered by machine learning (ML) impacts developer productivity, beyond perceived productivity and accepted suggestions, remains open.

Today we describe how we combined ML and SE to develop a novel Transformer-based hybrid semantic ML code completion, now available to internal Google developers. We discuss how ML and SEs can be combined by (1) re-ranking SE single token suggestions using ML, (2) applying single and multi-line completions using ML and checking for correctness with the SE, or (3) using single and multi-line continuation by ML of single token semantic suggestions. We compare the hybrid semantic ML code completion of 10k+ Googlers (over three months across eight programming languages) to a control group and see a 6% reduction in coding iteration time (time between builds and tests) and a 7% reduction in context switches (i.e., leaving the IDE) when exposed to single-line ML completion. These results demonstrate that the combination of ML and SEs can improve developer productivity. Currently, 3% of new code (measured in characters) is now generated from accepting ML completion suggestions.

Transformers for Completion
A common approach to code completion is to train transformer models, which use a self-attention mechanism for language understanding, to enable code understanding and completion predictions. We treat code similar to language, represented with sub-word tokens and a SentencePiece vocabulary, and use encoder-decoder transformer models running on TPUs to make completion predictions. The input is the code that is surrounding the cursor (~1000-2000 tokens) and the output is a set of suggestions to complete the current or multiple lines. Sequences are generated with a beam search (or tree exploration) on the decoder.

During training on Google’s monorepo, we mask out the remainder of a line and some follow-up lines, to mimic code that is being actively developed. We train a single model on eight languages (C++, Java, Python, Go, Typescript, Proto, Kotlin, and Dart) and observe improved or equal performance across all languages, removing the need for dedicated models. Moreover, we find that a model size of ~0.5B parameters gives a good tradeoff for high prediction accuracy with low latency and resource cost. The model strongly benefits from the quality of the monorepo, which is enforced by guidelines and reviews. For multi-line suggestions, we iteratively apply the single-line model with learned thresholds for deciding whether to start predicting completions for the following line.

Encoder-decoder transformer models are used to predict the remainder of the line or lines of code.

Re-rank Single Token Suggestions with ML
While a user is typing in the IDE, code completions are interactively requested from the ML model and the SE simultaneously in the backend. The SE typically only predicts a single token. The ML models we use predict multiple tokens until the end of the line, but we only consider the first token to match predictions from the SE. We identify the top three ML suggestions that are also contained in the SE suggestions and boost their rank to the top. The re-ranked results are then shown as suggestions for the user in the IDE.

In practice, our SEs are running in the cloud, providing language services (e.g., semantic completion, diagnostics, etc.) with which developers are familiar, and so we collocated the SEs to run on the same locations as the TPUs performing ML inference. The SEs are based on an internal library that offers compiler-like features with low latencies. Due to the design setup, where requests are done in parallel and ML is typically faster to serve (~40 ms median), we do not add any latency to completions. We observe a significant quality improvement in real usage. For 28% of accepted completions, the rank of the completion is higher due to boosting, and in 0.4% of cases it is worse. Additionally, we find that users type >10% fewer characters before accepting a completion suggestion.

Check Single / Multi-line ML Completions for Semantic Correctness
At inference time, ML models are typically unaware of code outside of their input window, and code seen during training might miss recent additions needed for completions in actively changing repositories. This leads to a common drawback of ML-powered code completion whereby the model may suggest code that looks correct, but doesn’t compile. Based on internal user experience research, this issue can lead to the erosion of user trust over time while reducing productivity gains.

We use SEs to perform fast semantic correctness checks within a given latency budget (<100ms for end-to-end completion) and use cached abstract syntax trees to enable a “full” structural understanding. Typical semantic checks include reference resolution (i.e., does this object exist), method invocation checks (e.g., confirming the method was called with a correct number of parameters), and assignability checks (to confirm the type is as expected).

For example, for the coding language Go, ~8% of suggestions contain compilation errors before semantic checks. However, the application of semantic checks filtered out 80% of uncompilable suggestions. The acceptance rate for single-line completions improved by 1.9x over the first six weeks of incorporating the feature, presumably due to increased user trust. As a comparison, for languages where we did not add semantic checking, we only saw a 1.3x increase in acceptance.

Language servers with access to source code and the ML backend are collocated on the cloud. They both perform semantic checking of ML completion suggestions.

Results
With 10k+ Google-internal developers using the completion setup in their IDE, we measured a user acceptance rate of 25-34%. We determined that the transformer-based hybrid semantic ML code completion completes >3% of code, while reducing the coding iteration time for Googlers by 6% (at a 90% confidence level). The size of the shift corresponds to typical effects observed for transformational features (e.g., key framework) that typically affect only a subpopulation, whereas ML has the potential to generalize for most major languages and engineers.

Fraction of all code added by ML 2.6%
Reduction in coding iteration duration 6%
Reduction in number of context switches 7%
Acceptance rate (for suggestions visible for >750ms) 25%
Average characters per accept 21
Key metrics for single-line code completion measured in production for 10k+ Google-internal developers using it in their daily development across eight languages.
Fraction of all code added by ML (with >1 line in suggestion) 0.6%
Average characters per accept 73
Acceptance rate (for suggestions visible for >750ms) 34%
Key metrics for multi-line code completion measured in production for 5k+ Google-internal developers using it in their daily development across eight languages.

Providing Long Completions while Exploring APIs
We also tightly integrated the semantic completion with full line completion. When the dropdown with semantic single token completions appears, we display inline the single-line completions returned from the ML model. The latter represent a continuation of the item that is the focus of the dropdown. For example, if a user looks at possible methods of an API, the inline full line completions show the full method invocation also containing all parameters of the invocation.

Integrated full line completions by ML continuing the semantic dropdown completion that is in focus.
Suggestions of multiple line completions by ML.

Conclusion and Future Work
We demonstrate how the combination of rule-based semantic engines and large language models can be used to significantly improve developer productivity with better code completion. As a next step, we want to utilize SEs further, by providing extra information to ML models at inference time. One example can be for long predictions to go back and forth between the ML and the SE, where the SE iteratively checks correctness and offers all possible continuations to the ML model. When adding new features powered by ML, we want to be mindful to go beyond just “smart” results, but ensure a positive impact on productivity.

Acknowledgements
This research is the outcome of a two-year collaboration between Google Core and Google Research, Brain Team. Special thanks to Marc Rasi, Yurun Shen, Vlad Pchelin, Charles Sutton, Varun Godbole, Jacob Austin, Danny Tarlow, Benjamin Lee, Satish Chandra, Ksenia Korovina, Stanislav Pyatykh, Cristopher Claeys, Petros Maniatis, Evgeny Gryaznov, Pavel Sychev, Chris Gorgolewski, Kristof Molnar, Alberto Elizondo, Ambar Murillo, Dominik Schulz, David Tattersall, Rishabh Singh, Manzil Zaheer, Ted Ying, Juanjo Carin, Alexander Froemmgen and Marcus Revaj for their contributions.

Read More

Training Generalist Agents with Multi-Game Decision Transformers

Current deep reinforcement learning (RL) methods can train specialist artificial agents that excel at decision-making on various individual tasks in specific environments, such as Go or StarCraft. However, little progress has been made to extend these results to generalist agents that would not only be capable of performing many different tasks, but also upon a variety of environments with potentially distinct embodiments.

Looking across recent progress in the fields of natural language processing, vision, and generative models (such as PaLM, Imagen, and Flamingo), we see that breakthroughs in making general-purpose models are often achieved by scaling up Transformer-based models and training them on large and semantically diverse datasets. It is natural to wonder, can a similar strategy be used in building generalist agents for sequential decision making? Can such models also enable fast adaptation to new tasks, similar to PaLM and Flamingo?

As an initial step to answer these questions, in our recent paper “Multi-Game Decision Transformers” we explore how to build a generalist agent to play many video games simultaneously. Our model trains an agent that can play 41 Atari games simultaneously at close-to-human performance and that can also be quickly adapted to new games via fine-tuning. This approach significantly improves upon the few existing alternatives to learning multi-game agents, such as temporal difference (TD) learning or behavioral cloning (BC).

A Multi-Game Decision Transformer (MGDT) can play multiple games at desired level of competency from training on a range of trajectories spanning all levels of expertise.

Don’t Optimize for Return, Just Ask for Optimality
In reinforcement learning, reward refers to the incentive signals that are relevant to completing a task, and return refers to cumulative rewards in a course of interactions between an agent and its surrounding environment. Traditional deep reinforcement learning agents (DQN, SimPLe, Dreamer, etc) are trained to optimize decisions to achieve the optimal return. At every time step, an agent observes the environment (some also consider the interactions that happened in the past) and decides what action to take to help itself achieve a higher return magnitude in future interactions.

In this work, we use Decision Transformers as our backbone approach to training an RL agent. A Decision Transformer is a sequence model that predicts future actions by considering past interactions between an agent and the surrounding environment, and (most importantly) a desired return to be achieved in future interactions. Instead of learning a policy to achieve high return magnitude as in traditional reinforcement learning, Decision Transformers map diverse experiences, ranging from expert-level to beginner-level, to their corresponding return magnitude during training. The idea is that training an agent on a range of experiences (from beginner to expert level) exposes the model to a wider range of variations in gameplay, which in turn helps it extract useful rules of gameplay that allow it to succeed under any circumstance. So during inference, the Decision Transformer can achieve any return value in the range it has seen during training, including the optimal return.

But, how do you know if a return is both optimal and stably achievable in a given environment? Previous applications of Decision Transformers relied on customized definitions of the desired return for each individual task, which required manually defining a plausible and informative range of scalar values that are appropriately interpretable signals for each specific game — a task that is non-trivial and rather unscalable. To address this issue, we instead model a distribution of return magnitudes based on past interactions with the environment during training. At inference time, we simply add an optimality bias that increases the probability of generating actions that are associated with higher returns.

To more comprehensively capture spatial-temporal patterns of agent-environment interactions, we also modified the Decision Transformer architecture to consider image patches instead of a global image representation. Patches allow the model to focus on local dynamics, which helps model game specific information in further detail.

These pieces together give us the backbone of Multi-Game Decision Transformers:

Each observation image is divided into a set of M patches of pixels which are denoted O. Return R, action a, and reward r follows these image patches in each input casual sequence. A Decision Transformer is trained to predict the next input (except for the image patches) to establish causality.

Training a Multi-Game Decision Transformer to Play 41 Games at Once
We train one Decision Transformer agent on a large (~1B) and broad set of gameplay experiences from 41 Atari games. In our experiments, this agent, which we call the Multi-Game Decision Transformer (MGDT), clearly outperforms existing reinforcement learning and behavioral cloning methods — by almost 2 times — on learning to play 41 games simultaneously and performs near human-level competency (100% in the following figure corresponds to the level of human gameplay). These results hold when comparing across training methods in both settings where a policy must be learned from static datasets (offline) as well as those where new data can be gathered from interacting with the environment (online).

Each bar is a combined score across 41 games, where 100% indicates human-level performance. Each blue bar is from a model trained on 41 games simultaneously, whereas each gray bar is from 41 specialist agents. Multi-Game Decision Transformer achieves human-level performance, significantly better than other multi-game agents, even comparable to specialist agents.

This result indicates that Decision Transformers are well-suited for multi-task, multi-environment, and multi-embodiment agents.

A concurrent work, “A Generalist Agent”, shows a similar result, demonstrating that large transformer-based sequence models can memorize expert behaviors very well across many more environments. In addition, their work and our work have nicely complementary findings: They show it’s possible to train across a wide range of environments beyond Atari games, while we show it’s possible and useful to train across a wide range of experiences.

In addition to the performance shown above, empirically we found that MGDT trained on a wide variety of experience is better than MDGT trained only on expert-level demonstrations or simply cloning demonstration behaviors.

Scaling Up Multi-Game Model Size to Achieve Better Performance
Argurably, scale has become the main driving force in many recent machine learning breakthroughs, and it is usually achieved by increasing the number of parameters in a transformer-based model. Our observation on Multi-Game Decision Transformers is similar: the performance increases predictably with larger model size. In particular, its performance appears to have not yet hit a ceiling, and compared to other learning systems performance gains are more significant with increases in model size.

Performance of Multi-Game Decision Transformer (shown by the blue line) increases predictably with larger model size, whereas other models do not.

Pre-trained Multi-Game Decision Transformers Are Fast Learners
Another benefit of MGDTs is that they can learn how to play a new game from very few gameplay demonstrations (which don’t need to all be expert-level). In that sense, MGDTs can be considered pre-trained models capable of being fine-tuned rapidly on small new gameplay data. Compared with other popular pre-training methods, it clearly shows consistent advantages in obtaining higher scores.

Multi-Game Decision Transformer pre-training (DT pre-training, shown in light blue) demonstrates consistent advantages over other popular models in adaptation to new tasks.

Where Is the Agent Looking?
In addition to the quantitative evaluation, it’s insightful (and fun) to visualize the agent’s behavior. By probing the attention heads, we find that the MGDT model consistently places weight in its field of view to areas of the observed images that contain meaningful game entities. We visualize the model’s attention when predicting the next action for various games and find it consistently attends to entities such as the agent’s on screen avatar, agent’s free movement space, non-agent objects, and key environment features. For example, in an interactive setting, having an accurate world model requires knowing how and when to focus on known objects (e.g., currently present obstacles) as well as expecting and/or planning over future unknowns (e.g., negative space). This diverse allocation of attention to many key components of each environment ultimately improves performance.

Here we can see the amount of weight the model places on each key asset of the game scene. Brighter red indicates more emphasis on that patch of pixels.

The Future of Large-Scale Generalist Agents
This work is an important step in demonstrating the possibility of training general-purpose agents across many environments, embodiments, and behavior styles. We have shown the benefit of increased scale on performance and the potential with further scaling. These findings seem to point to a generalization narrative similar to other domains like vision and language — we look forward to exploring the great potential of scaling data and learning from diverse experiences.

We look forward to future research towards developing performant agents for multi-environment and multi-embodiment settings. Our code and model checkpoints can soon be accessed here.

Acknowledgements
We’d like to thank all remaining authors of the paper including Igor Mordatch, Ofir Nachum Menjiao Yang, Lisa Lee, Daniel Freeman, Sergio Guadarrama, Ian Fischer, Eric Jang, Henryk Michalewski.

Read More

Simplified Transfer Learning for Chest Radiography Model Development

Every year, nearly a billion chest X-ray (CXR) images are taken globally to aid in the detection and management of health conditions ranging from collapsed lungs to infectious diseases. Generally, CXRs are cheaper and more accessible than other forms of medical imaging. However, existing challenges continue to impede the optimal use of CXRs. For example, in some areas, trained radiologists that can accurately interpret CXR images are in short supply. In addition, interpretation variability between experts, workflow differences between institutions, and the presence of rare conditions familiar only to subspecialists all contribute to making high-quality CXR interpretation a challenge.

Recent research has leveraged machine learning (ML) to explore potential solutions for some of these challenges. There is significant interest and effort devoted to building deep learning models that detect abnormalities in CXRs and improve access, accuracy, and efficiency to identify diseases and conditions that affect the heart and lungs. However, building robust CXR models requires large labeled training datasets, which can be prohibitively expensive and time-consuming to create. In some cases, such as working with underrepresented populations or studying rare medical conditions, only limited data are available. Additionally, CXR images vary in quality across populations, geographies, and institutions, making it difficult to build robust models that perform well globally.

In “Simplified Transfer Learning for Chest Radiography Models Using Less Data”, published in the journal Radiology, we describe how Google Health utilizes advanced ML methods to generate pre-trained “CXR networks” that can convert CXR images to embeddings (i.e., information-rich numerical vectors) to enable the development of CXR models using less data and fewer computational resources. We demonstrate that even with less data and compute, this approach has enabled performance comparable to state-of-the-art deep learning models across various prediction tasks. We are also excited to announce the release of CXR Foundation, a tool that utilizes our CXR-specific network to enable developers to create custom embeddings for their CXR images. We believe this work will help accelerate the development of CXR models, aiding in disease detection and contributing to more equitable health access throughout the world.

Developing a Chest X-ray Network
A common approach to building medical ML models is to pre-train a model on a generic task using non-medical datasets and then refine the model on a target medical task. This process of transfer learning may improve the target task performance or at least speed up convergence by applying the understanding of natural images to medical images. However, transfer learning may still require large labeled medical datasets for the refinement step.

Expanding on this standard approach, our system supports modeling CXR-specific tasks through a three-step model training setup composed of (1) generic image pre-training similar to traditional transfer learning, (2) CXR-specific pre-training, and (3) task-specific training. The first and third steps are common in ML: first pre-training on a large dataset and labels that are not specific to the desired task, and then fine-tuning on the task of interest.

We built a CXR-specific image classifier that employs supervised contrastive learning (SupCon). SupCon pulls together representations of images that have the same label (e.g., abnormal) and pushes apart representations of images that have a different label (e.g., one normal image and one abnormal image). We pre-trained this model on de-identified CXR datasets of over 800,000 images generated in partnership with Northwestern Medicine and Apollo Hospitals in the US and India, respectively. We then leveraged noisy abnormality labels from natural language processing of radiology reports to build our “CXR-specific” network.

This network creates embeddings (i.e., information-rich numerical vectors that can be used to distinguish classes from each other) that can more easily train models for specific medical prediction tasks, such as image finding (e.g., airspace opacity), clinical condition (e.g., tuberculosis), or patient outcome (e.g., hospitalization). For example, the CXR network can generate embeddings for every image in a given CXR dataset. For these images, the generated embeddings and the labels for the desired target task (such as tuberculosis) are used as examples to train a small ML model.

Left: Training a CXR model for a given task generally requires a large number of labeled images and a significant amount of computational resources to create a foundation of neural network layers. Right: With the CXR network and tool providing this foundation, each new task requires only a fraction of the labeled images, computational resources, and neural network parameters compared to rebuilding the entire network from scratch.

Effects of CXR Pre-training
We visualized these embedding layers at each step of the process using airspace opacity as an example (see the figure below). Before SupCon-based pre-training, there was poor separation of normal and abnormal CXR embeddings. After SupCon-based pre-training, the positive examples were grouped more closely together, and the negative examples more closely together as well, indicating that the model had identified that images from each category resembled themselves.

Visualizations of the t-distributed stochastic neighbor embedding for generic vs. CXR-specific network embeddings. Embeddings are information-rich numerical vectors that alone can distinguish classes from each other, in this case, airspace opacity positive vs. negative.

Our research suggests that adding the second stage of pre-training enables high-quality models to be trained with up to 600-fold less data in comparison to traditional transfer learning approaches that leverage pre-trained models on generic, non-medical datasets. We found this to be true regardless of model architecture (e.g., ResNet or EfficientNet) or dataset used for natural image pre-training (e.g., ImageNet or JFT-300M). With this approach, researchers and developers can significantly reduce dataset size requirements.

Top: In a deep learning model, the neural network contains multiple layers of artificial neurons, with the first layer taking the CXR image as input, intermediate layers doing additional computation, and the final layer making the classification (e.g., airspace opacity: present vs. absent). The embedding layer is usually one of the last layers. Bottom left: The traditional transfer learning approach involves a two-step training setup where a generic pre-trained network is optimized directly on a prediction task of interest. Our proposed three-step training setup generates a CXR network using a SupCon ML technique (step 2) before optimization for prediction tasks of interest (step 3). Bottom right: Using the embeddings involves either training smaller models (the first two strategies) or fine-tuning the whole network if there are sufficient data (strategy 3).

Results
After training the initial model, we measured performance using the area under the curve (AUC) metric with both linear and non-linear models applied to CXR embeddings; and a non-linear model produced by fine-tuning the entire network. On public datasets, such as ChestX-ray14 and CheXpert, our work substantially and consistently improved the data-accuracy tradeoff for models developed across a range of training dataset sizes and several findings. For example, when evaluating the tool’s ability to develop tuberculosis models, data efficiency gains were more striking: models trained on the embeddings of just 45 images achieved non-inferiority to radiologists in detecting tuberculosis on an external validation dataset. For both tuberculosis and severe COVID-19 outcomes, we show that non-linear classifiers trained on frozen embeddings outperformed a model that was fine-tuned on the entire dataset.

Comparing CXR-specific networks for transfer learning (red), with a baseline transfer learning approach (blue) across a variety of CXR abnormalities (top left), tuberculosis (bottom left), and COVID-19 outcomes (bottom right). This approach improves performance at the same dataset size, or reduces the dataset size required to reach the same performance. Interestingly, using the CXR network with simpler ML models that are faster to train (red) performs better than training the full network (black) at dataset sizes up to 85 images.

Conclusion and Future Work
To accelerate CXR modeling efforts with low data and computational requirements, we are releasing our CXR Foundation tool, along with scripts to train linear and nonlinear classifiers. Via these embeddings, this tool will allow researchers to jump-start CXR modeling efforts using simpler transfer learning methods. This approach can be particularly useful for predictive modeling using small datasets, and for adapting CXR models when there are distribution shifts in patient populations (whether over time or across different institutions). We are excited to continue working with partners, such as Northwestern Medicine and Apollo Hospitals, to explore the impact of this technology further. By enabling researchers with limited data and compute to develop CXR models, we’re hoping more developers can solve the most impactful problems for their populations.

Acknowledgements
Key contributors to this project at Google include Christina Chen, Yun Liu, Dilip Krishnan, Zaid Nabulsi, Atilla Kiraly, Arnav Agharwal, Eric Wu, Yuanzhen Li, Aaron Maschinot, Aaron Sarna, Jenny Huang, Marilyn Zhang, Charles Lau, Neeral Beladia, Daniel Tse, Krish Eswaran, and Shravya Shetty. Significant contributions and input were also made by collaborators Sreenivasa Raju Kalidindi, Mozziyar Etemadi, Florencia Garcia-Vicente, and David Melnick. For the ChestX-ray14 dataset, we thank the NIH Clinical Center for making it publicly available. The authors would also like to acknowledge many members of the Google Health Radiology and labeling software teams. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study; Jonny Wong for coordinating the imaging annotation work; Craig Mermel and Akinori Mitani for providing feedback on the manuscript; Nicole Linton and Lauren Winer for feedback on the blogpost; and Tom Small for the animation.

Read More

Google at ICML 2022

Google is a leader in machine learning (ML) research with groups innovating across virtually all aspects of the field, from theory to application. We build machine learning systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, algorithm development, and more. Core to our approach is to actively engage with the broader research community by open-sourcing datasets and models, publishing our discoveries, and actively participating in leading conferences.

Google is proud to be a Diamond Sponsor of the thirty-ninth International Conference on Machine Learning (ICML 2022), a premier annual conference, which is being held this week in Baltimore, Maryland. Google has a strong presence at this year’s conference with over 100 accepted publications and active involvement in a number of workshops and tutorials. We look forward to sharing some of our extensive ML research and expanding our partnership with the broader ML research community.

Registered for ICML 2022? We hope you’ll visit the Google booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Take a look below to learn more about the Google research being presented at ICML 2022 (Google affiliations in bold).

Organizing Committee

Tutorial Chairs include: Hanie Sedghi

Emeritus Members include: Andrew McCallum

Board Members include: Hugo Larochelle, Csaba Szepesvari, Corinna Cortes

Publications

Individual Preference Stability for Clustering
Saba Ahmadi, Pranjal Awasthi, Samir Khuller, Matthäus Kleindessner, Jamie Morgenstern, Pattara Sukprasert, Ali Vakilian

Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
Utku Evci, Vincent Dumoulin, Hugo Larochelle, Michael Mozer

H-Consistency Bounds for Surrogate Loss Minimizers
Pranjal Awasthi, Anqi Mao, Mehryar Mohri, Yutao Zhong

Cooperative Online Learning in Stochastic and Adversarial MDPs
Tal Lancewicki, Aviv Rosenberg, Yishay Mansour

Do More Negative Samples Necessarily Hurt in Contrastive Learning?
Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath

Deletion Robust Submodular Maximization Over Matroids
Paul Dütting, Federico Fusco*, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam

Tight and Robust Private Mean Estimation with Few Users
Hossein Esfandiari, Vahab Mirrokni, Shyam Narayanan*

Generative Trees: Adversarial and Copycat
Richard Nock, Mathieu Guillame-Bert

Agnostic Learnability of Halfspaces via Logistic Loss
Ziwei Ji*, Kwangjun Ahn*, Pranjal Awasthi, Satyen Kale, Stefani Karp

Adversarially Trained Actor Critic for Offline Reinforcement Learning
Ching-An Cheng, Tengyang Xie, Nan Jiang, Alekh Agarwal

Unified Scaling Laws for Routed Language Models
Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan, Matthew Johnson, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Marc’Aurelio Ranzato, Jack Rae, Erich Elsen, Koray Kavukcuogu, Karen Simonyan

Large Batch Experience Replay
Thibault Lahire, Matthieu Geist, Emmanuel Rachelson

Robust Training of Neural Networks Using Scale Invariant Architectures
Zhiyuan Li*, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar

The Poisson Binomial Mechanism for Unbiased Federated Learning with Secure Aggregation
Wei-Ning Chen, Ayfer Ozgur, Peter Kairouz

Global Optimization Networks
Sen Zhao, Erez Louidor, Maya Gupta

A Joint Exponential Mechanism for Differentially Private Top-k
Jennifer Gillenwater, Matthew Joseph, Andres Munoz Medina, Mónica Ribero

On the Practicality of Deterministic Epistemic Uncertainty
Janis Postels, Mattia Segu, Tao Sun, Luc Van Gool, Fisher Yu, Federico Tombari

Balancing Discriminability and Transferability for Source-Free Domain Adaptation
Jogendra Nath Kundu, Akshay Kulkarni, Suvaansh Bhambri, Deepesh Mehta, Shreyas Kulkarni, Varun Jampani, Venkatesh Babu Radhakrishnan

Transfer and Marginalize: Explaining Away Label Noise with Privileged Information
Mark Collier, Rodolphe Jenatton, Efi Kokiopoulou, Jesse Berent

In Defense of Dual-Encoders for Neural Ranking
Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Seungyeon Kim, Sashank Jakkam Reddi, Sanjiv Kumar

Surrogate Likelihoods for Variational Annealed Importance Sampling
Martin Jankowiak, Du Phan

Translatotron 2: High-Quality Direct Speech-to-Speech Translation with Voice Preservation (see blog post)
Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz

Differentially Private Approximate Quantiles
Haim Kaplan, Shachar Schnapp, Uri Stemmer

Continuous Control with Action Quantization from Demonstrations
Robert Dadashi, Léonard Hussenot, Damien Vincent, Sertan Girgin, Anton Raichuk, Matthieu Geist, Olivier Pietquin

Data Scaling Laws in NMT: The Effect of Noise and Architecture
Yamini Bansal*, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Maxim Krikun, Colin Cherry, Behnam Neyshabur, Orhan Firat

Debiaser Beware: Pitfalls of Centering Regularized Transport Maps
Aram-Alexandre Pooladian, Marco Cuturi, Jonathan Niles-Weed

A Context-Integrated Transformer-Based Neural Network for Auction Design
Zhijian Duan, Jingwu Tang, Yutong Yin, Zhe Feng, Xiang Yan, Manzil Zaheer, Xiaotie Deng

Algorithms for the Communication of Samples
Lucas Theis, Noureldin Yosri

Being Properly Improper
Tyler Sypherd, Richard Nock, Lalitha Sankar

Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation
Chris Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, Karthik Sridharan

Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error
Scott Fujimoto, David Meger, Doina Precup, Ofir Nachum, Shixiang Shane Gu

Public Data-Assisted Mirror Descent for Private Model Training
Ehsan Amid, Arun Ganesh*, Rajiv Mathews, Swaroop Ramaswamy, Shuang Song, Thomas Steinke, Vinith M. Suriyakumar*, Om Thakkar, Abhradeep Thakurta

Deep Hierarchy in Bandits
Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, Mohammad Ghavamzadeh

Scalable Deep Reinforcement Learning Algorithms for Mean Field Games
Mathieu Lauriere, Sarah Perrin, Sertan Girgin, Paul Muller, Ayush Jain, Theophile Cabannes, Georgios Piliouras, Julien Perolat, Romuald Elie, Olivier Pietquin, Matthieu Geist

Faster Privacy Accounting via Evolving Discretization
Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi

HyperPrompt: Prompt-Based Task-Conditioning of Transformers
Yun He*, Huaixiu Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Metzler, Heng-Tze Cheng, Ed H. Chi

Blocks Assemble! Learning to Assemble with Large-Scale Structured Reinforcement Learning
Seyed Kamyar, Seyed Ghasemipour, Daniel Freeman, Byron David, Shixiang Shane Gu, Satoshi Kataoka, Igor Mordatch

Latent Diffusion Energy-Based Model for Interpretable Text Modelling
Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu

On the Optimization Landscape of Neural Collapse Under MSE Loss: Global Optimality with Unconstrained Features
Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, Zhihui Zhu

Efficient Reinforcement Learning in Block MDPs: A Model-Free Representation Learning Approach
Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun

Robust Training Under Label Noise by Over-Parameterization
Sheng Liu, Zhihui Zhu, Qing Qu, Chong You

FriendlyCore: Practical Differentially Private Aggregation
Eliad Tsfadia, Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer

Adaptive Data Analysis with Correlated Observations
Aryeh Kontorovich, Menachem Sadigurschi,Uri Stemmer

A Resilient Distributed Boosting Algorithm
Yuval Filmus, Idan Mehalel, Shay Moran

On Learning Mixture of Linear Regressions in the Non-Realizable Setting
Avishek Ghosh, Arya Mazumdar,Soumyabrata Pal, Rajat Sen

Online and Consistent Correlation Clustering
Vincent Cohen-Addad, Silvio Lattanzi, Andreas Maggiori, Nikos Parotsidis

From Block-Toeplitz Matrices to Differential Equations on Graphs: Towards a General Theory for Scalable Masked Transformers
Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamas Sarlos, Adrian Weller, Thomas Weingarten

Parsimonious Learning-Augmented Caching
Sungjin Im, Ravi Kumar, Aditya Petety, Manish Purohit

General-Purpose, Long-Context Autoregressive Modeling with Perceiver AR
Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, Joao Carreira, Jesse Engel

Conformal Prediction Sets with Limited False Positives
Adam Fisch, Tal Schuster, Tommi Jaakkola, Regina Barzilay

Dialog Inpainting: Turning Documents into Dialogs
Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu

Benefits of Overparameterized Convolutional Residual Networks: Function Approximation Under Smoothness Constraint
Hao Liu, Minshuo Chen, Siawpeng Er, Wenjing Liao, Tong Zhang, Tuo Zhao

Congested Bandits: Optimal Routing via Short-Term Resets
Pranjal Awasthi, Kush Bhatia, Sreenivas Gollapudi, Kostas Kollias

Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance
Zhuoning Yuan, Yuexin Wu, Zihao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, Tianbao Yang

Examining Scaling and Transfer of Language Model Architectures for Machine Translation
Biao Zhang*, Behrooz Ghorbani, Ankur Bapna, Yong Cheng, Xavier Garcia, Jonathan Shen, Orhan Firat

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (see blog post)
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui

How to Leverage Unlabeled Data in Offline Reinforcement Learning?
Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, Sergey Levine

Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning
Harley Wiltzer, David Meger, Marc G. Bellemare

On the Robustness of CountSketch to Adaptive Inputs
Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, Moshe Shechner, Uri Stemmer

Model Selection in Batch Policy Optimization
Jonathan N. Lee, George Tucker, Ofir Nachum, Bo Dai

The Fundamental Price of Secure Aggregation in Differentially Private Federated Learning
Wei-Ning Chen, Christopher A. Choquette-Choo, Peter Kairouz, Ananda Theertha Suresh

Linear-Time Gromov Wasserstein Distances Using Low Rank Couplings and Costs
Meyer Scetbon, Gabriel Peyré, Marco Cuturi*

Active Sampling for Min-Max Fairness
Jacob Abernethy, Pranjal Awasthi, Matthäus Kleindessner, Jamie Morgenstern, Chris Russell, Jie Zhang

Making Linear MDPs Practical via Contrastive Representation Learning
Tianjun Zhang, Tongzheng Ren, Mengjiao Yang, Joseph E. Gonzalez, Dale Schuurmans, Bo Dai

Achieving Minimax Rates in Pool-Based Batch Active Learning
Claudio Gentile, Zhilei Wang, Tong Zhang

Private Adaptive Optimization with Side Information
Tian Li, Manzil Zaheer, Sashank J. Reddi, Virginia Smith

Self-Supervised Learning With Random-Projection Quantizer for Speech Recognition
Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu

Wide Bayesian Neural Networks Have a Simple Weight Posterior: Theory and Accelerated Sampling
Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

The State of Sparse Training in Deep Reinforcement Learning
Laura Graesser, Utku Evci, Erich Elsen, Pablo Samuel Castro

Constrained Discrete Black-Box Optimization Using Mixed-Integer Programming
Theodore P. Papalexopoulos, Christian Tjandraatmadja, Ross Anderson, Juan Pablo Vielma, David Belanger

Massively Parallel k-Means Clustering for Perturbation Resilient Instances
Vincent Cohen-Addad, Vahab Mirrokni, Peilin Zhong

What Language Model Architecture and Pre-training Objective Works Best for Zero-Shot Generalization?
Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, Colin Raffel

Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt

Synergy and Symmetry in Deep Learning: Interactions Between the Data, Model, and Inference Algorithm
Lechao Xiao, Jeffrey Pennington

Fast Finite Width Neural Tangent Kernel
Roman Novak, Jascha Sohl-Dickstein, Samuel S. Schoenholz

The Combinatorial Brain Surgeon: Pruning Weights that Cancel One Another in Neural Networks
Xin Yu, Thiago Serra, Srikumar Ramalingam, Shandian Zhe

Bayesian Imitation Learning for End-to-End Mobile Manipulation
Yuqing Du, Daniel Ho, Alexander A. Alemi, Eric Jang, Mohi Khansari

HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning
Andrey Zhmoginov, Mark Sandler, Max Vladymyrov

Marginal Distribution Adaptation for Discrete Sets via Module-Oriented Divergence Minimization
Hanjun Dai, Mengjiao Yang, Yuan Xue, Dale Schuurmans, Bo Dai

Correlated Quantization for Distributed Mean Estimation and Optimization
Ananda Theertha Suresh, Ziteng Sun, Jae Hun Ro, Felix Yu

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, Igor Mordatch

Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime
Leonardo Cunha, Gauthier Gidel, Fabian Pedregosa, Damien Scieur, Courtney Paquette

Learning Iterative Reasoning through Energy Minimization
Yilun Du, Shuang Li, Josh Tenenbaum, Igor Mordatch

Interactive Correlation Clustering with Existential Cluster Constraints
Rico Angell, Nicholas Monath, Nishant Yadav, Andrew McCallum

Building Robust Ensembles via Margin Boosting
Dinghuai Zhang, Hongyang Zhang, Aaron Courville, Yoshua Bengio, Pradeep Ravikumar, Arun Sai Suggala

Probabilistic Bilevel Coreset Selection
Xiao Zhou, Renjie Pi, Weizhong Zhang, Yong Lin, Tong Zhang

Model Agnostic Sample Reweighting for Out-of-Distribution Learning
Xiao Zhou, Yong Lin, Renjie Pi, Weizhong Zhang, Renzhe Xu, Peng Cui, Tong Zhang

Sparse Invariant Risk Minimization
Xiao Zhou, Yong Lin, Weizhong Zhang, Tong Zhang

RUMs from Head-to-Head Contests
Matteo Almanza, Flavio Chierichetti, Ravi Kumar, Alessandro Panconesi, Andrew Tomkins

A Parametric Class of Approximate Gradient Updates for Policy Optimization
Ramki Gummadi, Saurabh Kumar, Junfeng Wen, Dale Schuurmans

On Implicit Bias in Overparameterized Bilevel Optimization
Paul Vico, Jonathan Lorraine, Fabian Pedregosa, David Duvenaud, Roger Grosse

Feature and Parameter Selection in Stochastic Linear Bandits
Ahmadreza Moradipari, Berkay Turan, Yasin Abbasi-Yadkori, Mahnoosh Alizadeh, Mohammad Ghavamzadeh

Neural Network Poisson Models for Behavioural and Neural Spike Train Data
Moein Khajehnejad, Forough Habibollahi, Richard Nock, Ehsan Arabzadeh, Peter Dayan and Amir Dezfouli

Deep Equilibrium Networks are Sensitive to Initialization Statistics
Atish Agarwala, Samuel Schoenholz

A Regret Minimization Approach to Multi-Agent Control
Udaya Ghai, Udari Madhushani, Naomi Leonard, Elad Hazan

Transformer Quality in Linear Time
Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le

Workshops

Shift Happens: Crowdsourcing Metrics and Test Datasets Beyond ImageNet
Organizing Committee includes: Roland S. Zimmerman
Invited Speakers include: Chelsea Finn, Lucas Beyer

Machine Learning for Audio Synthesis
Organizing Committee includes: Yu Zhang
Invited Speakers include: Chris Donahue

New Frontiers in Adversarial Machine Learning
Organizing Committee includes: Sanmi Koyejo

Spurious Correlations, Invariance, and Stability (SIC)
Organizing Committee includes: Victor Veitch

DataPerf: Benchmarking Data for Data-Centric AI
Organizing Committee includes: Lora Aroyo, Peter Mattson, Praveen Paritosh
DataPerf Speakers include: Lora Aroyo, Peter Mattson, Praveen Paritosh
Invited Speakers include: Jordi Pont-Tuset

Machine Learning for Astrophysics
Invited Speakers include: Dustin Tran

Dynamic Neural Networks
Organizing Committee includes: Carlos Riquelme
Panel Chairs include: Neil Houlsby

Interpretable Machine Learning in Healthcare (IMLH)
Organizing Committee includes: Ramin Zabih
Invited Speakers include: Been Kim

Human-Machine Collaboration and Teaming
Invited Speakers include: Fernanda Viégas, Martin Wattenberg, Yuhuai (Tony) Wu

Pre-training: Perspectives, Pitfalls, and Paths Forward
Organizing Committee includes: Hugo Larochelle, Chelsea Finn
Invited Speakers include: Hanie Sedgh, Charles Sutton

Responsible Decision Making in Dynamic Environments
Invited Speakers include: Craig Boutilier

Principles of Distribution Shift (PODS)
Organizing Committee includes: Hossein Mobahi

Hardware-Aware Efficient Training (HAET)
Invited Speakers include: Tien-Ju Yang

Updatable Machine Learning
Invited Speakers include: Chelsea Finn, Nicolas Papernot
Organizing Committee includes: Ananda Theertha Suresh, Badih Ghazi, Chiyuan Zhang, Kate Donahue, Peter Kairouz, Ziteng Sun

Knowledge Retrieval and Language Models
Invited Speakers include: Fernando Diaz, Quoc Le, Kenton Lee, Ellie Pavlick
Organizing Committee includes: Urvashi Khandelwal, Chiyuan Zhang

Theory and Practice of Differential Privacy
Organizing Committee includes: Badih Ghazi, Matthew Joseph, Peter Kairouz, Om Thakkar, Thomas Steinke, Ziteng Sun

Beyond Bayes: Paths Towards Universal Reasoning Systems
Invited Speakers include: Charles Sutton
Spotlight Talk: Language Model Cascades | David Dohan, Winnie Xu, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, Charles Sutton

Safe Learning for Autonomous Driving (SL4AD)
Invited Speakers include: Chelsea Finn



*Work done while at Google.  

Read More

Towards Reliability in Deep Learning Systems

Deep learning models have made impressive progress in vision, language, and other modalities, particularly with the rise of large-scale pre-training. Such models are most accurate when applied to test data drawn from the same distribution as their training set. However, in practice, the data confronting models in real-world settings rarely match the training distribution. In addition, the models may not be well-suited for applications where predictive performance is only part of the equation. For models to be reliable in deployment, they must be able to accommodate shifts in data distribution and make useful decisions in a broad array of scenarios.

In “Plex: Towards Reliability Using Pre-trained Large Model Extensions”, we present a framework for reliable deep learning as a new perspective about a model’s abilities; this includes a number of concrete tasks and datasets for stress-testing model reliability. We also introduce Plex, a set of pre-trained large model extensions that can be applied to many different architectures. We illustrate the efficacy of Plex in the vision and language domains by applying these extensions to the current state-of-the-art Vision Transformer and T5 models, which results in significant improvement in their reliability. We are also open-sourcing the code to encourage further research into this approach.

Uncertainty — Dog vs. Cat classifier: Plex can say “I don’t know” for inputs that are neither cat nor dog.
Robust Generalization — A naïve model is sensitive to spurious correlations (“destination”), whereas Plex is robust.
Adaptation — Plex can actively choose the data from which it learns to improve performance more quickly.

Framework for Reliability
First, we explore how to understand the reliability of a model in novel scenarios. We posit three general categories of requirements for reliable machine learning (ML) systems: (1) they should accurately report uncertainty about their predictions (“know what they don’t know”); (2) they should generalize robustly to new scenarios (distribution shift); and (3) they should be able to efficiently adapt to new data (adaptation). Importantly, a reliable model should aim to do well in all of these areas simultaneously out-of-the-box, without requiring any customization for individual tasks.

  • Uncertainty reflects the imperfect or unknown information that makes it difficult for a model to make accurate predictions. Predictive uncertainty quantification allows a model to compute optimal decisions and helps practitioners recognize when to trust the model’s predictions, thereby enabling graceful failures when the model is likely to be wrong.
  • Robust Generalization involves an estimate or forecast about an unseen event. We investigate four types of out-of-distribution data: covariate shift (when the input distribution changes between training and application and the output distribution is unchanged), semantic (or class) shift, label uncertainty, and subpopulation shift.

    Types of distribution shift using an illustration of ImageNet dogs.
  • Adaptation refers to probing the model’s abilities over the course of its learning process. Benchmarks typically evaluate on static datasets with pre-defined train-test splits. However, in many applications, we are interested in models that can quickly adapt to new datasets and efficiently learn with as few labeled examples as possible.
Reliability framework. We propose to simultaneously stress-test the “out-of-the-box” model performance (i.e., the predictive distribution) across uncertainty, robust generalization, and adaptation benchmarks, without any customization for individual tasks.

We apply 10 types of tasks to capture the three reliability areas — uncertainty, robust generalization, and adaptation — and to ensure that the tasks measure a diverse set of desirable properties in each area. Together the tasks comprise 40 downstream datasets across vision and natural language modalities: 14 datasets for fine-tuning (including few-shot and active learning–based adaptation) and 26 datasets for out-of-distribution evaluation.

Plex: Pre-trained Large Model Extensions for Vision and Language
To improve reliability, we develop ViT-Plex and T5-Plex, building on large pre-trained models for vision (ViT) and language (T5), respectively. A key feature of Plex is more efficient ensembling based on submodels that each make a prediction that is then aggregated. In addition, Plex swaps each architecture’s linear last layer with a Gaussian process or heteroscedastic layer to better represent predictive uncertainty. These ideas were found to work very well for models trained from scratch at the ImageNet scale. We train the models with varying sizes up to 325 million parameters for vision (ViT-Plex L) and 1 billion parameters for language (T5-Plex L) and pre-training dataset sizes up to 4 billion examples.

The following figure illustrates Plex’s performance on a select set of tasks compared to the existing state-of-the-art. The top-performing model for each task is usually a specialized model that is highly optimized for that problem. Plex achieves new state-of-the-art on many of the 40 datasets. Importantly, Plex achieves strong performance across all tasks using the out-of-the-box model output without requiring any custom designing or tuning for each task.

The largest T5-Plex (top) and ViT-Plex (bottom) models evaluated on a highlighted set of reliability tasks compared to specialized state-of-the-art models. The spokes display different tasks, quantifying metric performance on various datasets.

<!–

The largest T5-Plex (top) and ViT-Plex (bottom) models evaluated on a highlighted set of reliability tasks compared to specialized state-of-the-art models. The spokes display different tasks, quantifying metric performance on various datasets.

–>

Plex in Action for Different Reliability Tasks
We highlight Plex’s reliability on select tasks below.

Open Set Recognition
We show Plex’s output in the case where the model must defer prediction because the input is one that the model does not support. This task is known as open set recognition. Here, predictive performance is part of a larger decision-making scenario where the model may abstain from making certain predictions. In the following figure, we show structured open set recognition: Plex returns multiple outputs and signals the specific part of the output about which the model is uncertain and is likely out-of-distribution.

Structured open set recognition enables the model to provide nuanced clarifications. Here, T5-Plex L can recognize fine-grained out-of-distribution cases where the request’s vertical (i.e., coarse-level domain of service, such as banking, media, productivity, etc.) and domain are supported but the intent is not.

Label Uncertainty
In real-world datasets, there is often inherent ambiguity behind the ground truth label for each input. For example, this may arise due to human rater ambiguity for a given image. In this case, we’d like the model to capture the full distribution of human perceptual uncertainty. We showcase Plex below on examples from an ImageNet variant we constructed that provides a ground truth label distribution.

Plex for label uncertainty. Using a dataset we construct called ImageNet ReaL-H, ViT-Plex L demonstrates the ability to capture the inherent ambiguity (probability distribution) of image labels.

Active Learning
We examine a large model’s ability to not only learn over a fixed set of data points, but also participate in knowing which data points to learn from in the first place. One such task is known as active learning, where at each training step, the model selects promising inputs among a pool of unlabeled data points on which to train. This procedure assesses an ML model’s label efficiency, where label annotations may be scarce, and so we would like to maximize performance while minimizing the number of labeled data points used. Plex achieves a significant performance improvement over the same model architecture without pre-training. Compared to the state-of-the-art in literature, BASE achieves around 63% accuracy at 100K examples, achieving a lower accuracy and requiring more examples.

Active learning on ImageNet1K. ViT-Plex L is highly label efficient compared to a baseline that doesn’t leverage pre-training. We also find that active learning’s data acquisition strategy is more effective than uniformly selecting data points at random.

Learn more
Check out our paper here and an upcoming contributed talk about the work at the ICML 2022 pre-training workshop on July 23, 2022. To encourage further research in this direction, we are open-sourcing all code for training and evaluation as part of Uncertainty Baselines. We also provide a demo that shows how to use a ViT-Plex model checkpoint. Layer and method implementations use Edward2.

Acknowledgements
We thank all the co-authors for contributing to the project and paper, including Andreas Kirsch, Clara Huiyi Hu, Du Phan, D. Sculley, Honglin Yuan, Jasper Snoek, Jeremiah Liu, Jie Ren, Joost van Amersfoort, Karan Singhal, Kehang Han, Kelly Buchanan, Kevin Murphy, Mark Collier​​, Mike Dusenberry, Neil Band, Nithum Thain, Rodolphe Jenatton, Tim G. J. Rudner, Yarin Gal, Zachary Nado, Zelda Mariet, Zi Wang, and Zoubin Ghahramani. We also thank Anusha Ramesh, Ben Adlam, Dilip Krishnan, Ed Chi, Rif A. Saurous, and Sharat Chikkerur for their helpful feedback, and Tom Small and Ajay Nainani for helping with visualizations.

Read More

Rewriting Image Captions for Visual Question Answering Data Creation

Visual Question Answering (VQA) is a useful machine learning (ML) task that requires a model to answer a visual question about an image. What makes it challenging is its multi-task and open-ended nature; it involves solving multiple technical research questions in computer vision and natural language understanding simultaneously. Yet, progress on this task would enable a wide range of applications, from assisting the blind and the visually-impaired or communicating with robots to enhancing the user’s visual experience with external knowledge.

Effective and robust VQA systems cannot exist without high-quality, semantically and stylistically diverse large-scale training data of image-question-answer triplets. But, creating such data is time consuming and onerous. Perhaps unsurprisingly, the VQA community has focused more on sophisticated model development rather than scalable data creation.

In “All You May Need for VQA are Image Captions,” published at NAACL 2022, we explore VQA data generation by proposing “Visual Question Generation with Question Answering Validation” (VQ2A), a pipeline that works by rewriting a declarative caption into multiple interrogative question-answer pairs. More specifically, we leverage two existing assets — (i) large-scale image-text data and (ii) large-capacity neural text-to-text models — to achieve automatic VQA data generation. As the field has progressed, the research community has been making these assets larger and stronger in isolation (for general purposes such as learning text-only or image-text representations); together, they can achieve more and we adapt them for VQA data creation purposes. We find our approach can generate question-answer pairs with high precision and that this data can successfully be used for training VQA models to improve performance.

The VQ2A technique enables VQA data generation at scale from image captions by rewriting each caption into multiple question-answer pairs.

VQ2A Overview
The first step of the VQ2A approach is to apply heuristics based on named entity recognition, part-of-speech tagging and manually defined rules to generate answer candidates from the image caption. These generated candidates are small pieces of information that may be relevant subjects about which to ask questions. We also add to this list two default answers, “yes” and “no”, which allow us to generate Boolean questions.

Then, we use a T5 model that was fine-tuned to generate questions for the candidate, resulting in [question, candidate answer] pairs. We then filter for the highest quality pairs using another T5 model (fine-tuned to answer questions) by asking it to answer the question based on the caption. was . That is, we compare the candidate answer to the output of this model and if the two answers are similar enough, we define this question as high quality and keep it. Otherwise, we filter it out.

The idea of using both question answering and question generation models to check each other for their round-trip consistency has been previously explored in other contexts. For instance, Q2 uses this idea to evaluate factual consistency in knowledge-grounded dialogues. In the end, the VQ2A approach, as illustrated below, can generate a large number of [image, question, answer] triplets that are high-quality enough to be used as VQA training data.

VQ2A consists of three main steps: (i) candidate answer extraction, (ii) question generation, (iii) question answering and answer validation.

Results
Two examples of our generated VQA data are shown below, one based on human-written COCO Captions (COCO) and the other on automatically-collected Conceptual Captions (CC3M), which we call VQ2A-COCO and VQ2A-CC3M, respectively. We highlight the variety of question types and styles, which are critical for VQA. Overall, the cleaner the captions (i.e., the more closely related they are to their paired image), the more accurate the generated triplets. Based on 800 samples each, 87.3% of VQ2A-COCO and 66.0% VQ2A-CC3M are found by human raters to be valid, suggesting that our approach can generate question-answer pairs with high precision.

Generated question-answer pairs based on COCO Captions (top) and Conceptual Captions (bottom). Grey highlighting denotes questions that do not appear in VQAv2, while green highlighting denotes those that do, indicating that our approach is capable of generating novel questions that an existing VQA dataset does not have.

Finally, we evaluate our generated data by using it to train VQA models (highlights shown below). We observe that our automatically-generated VQA data is competitive with manually-annotated target VQA data. First, our VQA models achieve high performance on target benchmarks “out-of-the-box”, when trained only on our generated data (light blue and light red vs. yellow). Once fine-tuned on target data, our VQA models outperform target-only training slightly on large-scale benchmarks like VQAv2 and GQA, but significantly on the small, knowledge-seeking OK-VQA (dark blue/red vs. light blue/red).

VQA accuracy on popular benchmark datasets.

Conclusion
All we may need for VQA are image captions! This work demonstrates that it is possible to automatically generate high-quality VQA data at scale, serving as an essential building block for VQA and vision-and-language models in general (e.g., ALIGN, CoCa). We hope that our work inspires other work on data-centric VQA.

Acknowledgments
We thank Roee Aharoni, Idan Szpektor, and Radu Soricut for their feedback on this blogpost. We also thank our co-authors: Xi Chen, Nan Ding, Idan Szpektor, and Radu Soricut. We acknowledge contributions from Or Honovich, Hagai Taitelbaum, Roee Aharoni, Sebastian Goodman, Piyush Sharma, Nassim Oufattole, Gal Elidan, Sasha Goldshtein, and Avinatan Hassidim. Finally, we thank the authors of Q2, whose pipeline strongly influences this work.

Read More

Revisiting Mask Transformer from a Clustering Perspective

Panoptic segmentation is a computer vision problem that serves as a core task for many real-world applications. Due to its complexity, previous work often divides panoptic segmentation into semantic segmentation (assigning semantic labels, such as “person” and “sky”, to every pixel in an image) and instance segmentation (identifying and segmenting only countable objects, such as “pedestrians” and “cars”, in an image), and further divides it into several sub-tasks. Each sub-task is processed individually, and extra modules are applied to merge the results from each sub-task stage. This process is not only complex, but it also introduces many hand-designed priors when processing sub-tasks and when combining the results from different sub-task stages.

Recently, inspired by Transformer and DETR, an end-to-end solution for panoptic segmentation with mask transformers (an extension of the Transformer architecture that is used to generate segmentation masks) was proposed in MaX-DeepLab. This solution adopts a pixel path (consisting of either convolutional neural networks or vision transformers) to extract pixel features, a memory path (consisting of transformer decoder modules) to extract memory features, and a dual-path transformer for interaction between pixel features and memory features. However, the dual-path transformer, which utilizes cross-attention, was originally designed for language tasks, where the input sequence consists of dozens or hundreds of words. Nonetheless, when it comes to vision tasks, specifically segmentation problems, the input sequence consists of tens of thousands of pixels, which not only indicates a much larger magnitude of input scale, but also represents a lower-level embedding compared to language words.

In “CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation”, presented at CVPR 2022, and “kMaX-DeepLab: k-means Mask Transformer”, to be presented at ECCV 2022, we propose to reinterpret and redesign cross-attention from a clustering perspective (i.e., grouping pixels with the same semantic labels together), which better adapts to vision tasks. CMT-DeepLab is built upon the previous state-of-the-art method, MaX-DeepLab, and employs a pixel clustering approach to perform cross-attention, leading to a more dense and plausible attention map. kMaX-DeepLab further redesigns cross-attention to be more like a k-means clustering algorithm, with a simple change on the activation function. We demonstrate that CMT-DeepLab achieves significant performance improvements, while kMaX-DeepLab not only simplifies the modification but also further pushes the state-of-the-art by a large margin, without test-time augmentation. We are also excited to announce the open-source release of kMaX-DeepLab, our best performing segmentation model, in the DeepLab2 library.

Overview
Instead of directly applying cross-attention to vision tasks without modifications, we propose to reinterpret it from a clustering perspective. Specifically, we note that the mask Transformer object query can be considered cluster centers (which aim to group pixels with the same semantic labels), and the process of cross-attention is similar to the k-means clustering algorithm, which adopts an iterative process of (1) assigning pixels to cluster centers, where multiple pixels can be assigned to a single cluster center, and some cluster centers may have no assigned pixels, and (2) updating the cluster centers by averaging pixels assigned to the same cluster center, the cluster centers will not be updated if no pixel is assigned to them).

In CMT-DeepLab and kMaX-DeepLab, we reformulate the cross-attention from the clustering perspective, which consists of iterative cluster-assignment and cluster-update steps.

Given the popularity of the k-means clustering algorithm, in CMT-DeepLab we redesign cross-attention so that the spatial-wise softmax operation (i.e., the softmax operation that is applied along the image spatial resolution) that in effect assigns cluster centers to pixels is instead applied along the cluster centers. In kMaX-DeepLab, we further simplify the spatial-wise softmax to cluster-wise argmax (i.e., applying the argmax operation along the cluster centers). We note that the argmax operation is the same as the hard assignment (i.e., a pixel is assigned to only one cluster) used in the k-means clustering algorithm.

Reformulating the cross-attention of the mask transformer from the clustering perspective significantly improves the segmentation performance and simplifies the complex mask transformer pipeline to be more interpretable. First, pixel features are extracted from the input image with an encoder-decoder structure. Then, a set of cluster centers are used to group pixels, which are further updated based on the clustering assignments. Finally, the clustering assignment and update steps are iteratively performed, with the last assignment directly serving as segmentation predictions.

To convert a typical mask Transformer decoder (consisting of cross-attention, multi-head self-attention, and a feed-forward network) into our proposed k-means cross-attention, we simply replace the spatial-wise softmax with cluster-wise argmax.

The meta architecture of our proposed kMaX-DeepLab consists of three components: pixel encoder, enhanced pixel decoder, and kMaX decoder. The pixel encoder is any network backbone, used to extract image features. The enhanced pixel decoder includes transformer encoders to enhance the pixel features, and upsampling layers to generate higher resolution features. The series of kMaX decoders transform cluster centers into (1) mask embedding vectors, which multiply with the pixel features to generate the predicted masks, and (2) class predictions for each mask.

The meta architecture of kMaX-DeepLab.

Results
We evaluate the CMT-DeepLab and kMaX-DeepLab using the panoptic quality (PQ) metric on two of the most challenging panoptic segmentation datasets, COCO and Cityscapes, against MaX-DeepLab and other state-of-the-art methods. CMT-DeepLab achieves significant performance improvement, while kMaX-DeepLab not only simplifies the modification but also further pushes the state-of-the-art by a large margin, with 58.0% PQ on COCO val set, and 68.4% PQ, 44.0% mask Average Precision (mask AP), 83.5% mean Intersection-over-Union (mIoU) on Cityscapes val set, without test-time augmentation or using an external dataset.

Method PQ
MaX-DeepLab 51.1% (-6.9%)
MaskFormer 52.7% (-5.3%)
K-Net 54.6% (-3.4%)
CMT-DeepLab 55.3% (-2.7%)
kMaX-DeepLab 58.0%
Comparison on COCO val set.
Method PQ APmask mIoU
Panoptic-DeepLab 63.0% (-5.4%) 35.3% (-8.7%) 80.5% (-3.0%)
Axial-DeepLab 64.4% (-4.0%) 36.7% (-7.3%) 80.6% (-2.9%)
SWideRNet 66.4% (-2.0%) 40.1% (-3.9%) 82.2% (-1.3%)
kMaX-DeepLab 68.4% 44.0% 83.5%
Comparison on Cityscapes val set.

Designed from a clustering perspective, kMaX-DeepLab not only has a higher performance but also a more plausible visualization of the attention map to understand its working mechanism. In the example below, kMaX-DeepLab iteratively performs clustering assignments and updates, which gradually improves mask quality.

kMaX-DeepLab’s attention map can be directly visualized as a panoptic segmentation, which gives better plausibility for the model working mechanism (image credit: coco_url, and license).

Conclusions
We have demonstrated a way to better design mask transformers for vision tasks. With simple modifications, CMT-DeepLab and kMaX-DeepLab reformulate cross-attention to be more like a clustering algorithm. As a result, the proposed models achieve state-of-the-art performance on the challenging COCO and Cityscapes datasets. We hope that the open-source release of kMaX-DeepLab in the DeepLab2 library will facilitate future research on designing vision-specific transformer architectures.

Acknowledgements
We are thankful to the valuable discussion and support from Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Florian Schroff, Hartwig Adam, and Alan Yuille.

Read More

​​Deep Hierarchical Planning from Pixels

Research into how artificial agents can make decisions has evolved rapidly through advances in deep reinforcement learning. Compared to generative ML models like GPT-3 and Imagen, artificial agents can directly influence their environment through actions, such as moving a robot arm based on camera inputs or clicking a button in a web browser. While artificial agents have the potential to be increasingly helpful to people, current methods are held back by the need to receive detailed feedback in the form of frequently provided rewards to learn successful strategies. For example, despite large computational budgets, even powerful programs such as AlphaGo are limited to a few hundred moves until receiving their next reward.

In contrast, complex tasks like making a meal require decision making at all levels, from planning the menu, navigating to the store to pick up groceries, and following the recipe in the kitchen to properly executing the fine motor skills needed at each step along the way based on high-dimensional sensory inputs. Hierarchical reinforcement learning (HRL) promises to automatically break down such complex tasks into manageable subgoals, enabling artificial agents to solve tasks more autonomously from fewer rewards, also known as sparse rewards. However, research progress on HRL has proven to be challenging; current methods rely on manually specified goal spaces or subtasks, and no general solution exists.

To spur progress on this research challenge and in collaboration with the University of California, Berkeley, we present the Director agent, which learns practical, general, and interpretable hierarchical behaviors from raw pixels. Director trains a manager policy to propose subgoals within the latent space of a learned world model and trains a worker policy to achieve these goals. Despite operating on latent representations, we can decode Director’s internal subgoals into images to inspect and interpret its decisions. We evaluate Director across several benchmarks, showing that it learns diverse hierarchical strategies and enables solving tasks with very sparse rewards where previous approaches fail, such as exploring 3D mazes with quadruped robots directly from first-person pixel inputs.

Director learns to solve complex long-horizon tasks by automatically breaking them down into subgoals. Each panel shows the environment interaction on the left and the decoded internal goals on the right.

How Director Works
Director learns a world model from pixels that enables efficient planning in a latent space. The world model maps images to model states and then predicts future model states given potential actions. From predicted trajectories of model states, Director optimizes two policies: The manager chooses a new goal every fixed number of steps, and the worker learns to achieve the goals through low-level actions. However, choosing goals directly in the high-dimensional continuous representation space of the world model would be a challenging control problem for the manager. Instead, we learn a goal autoencoder to compress the model states into smaller discrete codes. The manager then selects discrete codes and the goal autoencoder turns them into model states before passing them as goals to the worker.

Left: The goal autoencoder (blue) compresses the world model (green) state (st) into discrete codes (z). Right: The manager policy (orange) selects a code that the goal decoder (blue) turns into a feature space goal (g). The worker policy (red) learns to achieve the goal from future trajectories (s1, …, s4) predicted by the world model.

All components of Director are optimized concurrently, so the manager learns to select goals that are achievable by the worker. The manager learns to select goals to maximize both the task reward and an exploration bonus, leading the agent to explore and steer towards remote parts of the environment. We found that preferring model states where the goal autoencoder incurs high prediction error is a simple and effective exploration bonus. Unlike prior methods, such as Feudal Networks, our worker receives no task reward and learns purely from maximizing the feature space similarity between the current model state and the goal. This means the worker has no knowledge of the task and instead concentrates all its capacity on achieving goals.

Benchmark Results
Whereas prior work in HRL often resorted to custom evaluation protocols — such as assuming diverse practice goals, access to the agents’ global position on a 2D map, or ground-truth distance rewards — Director operates in the end-to-end RL setting. To test the ability to explore and solve long-horizon tasks, we propose the challenging Egocentric Ant Maze benchmark. This challenging suite of tasks requires finding and reaching goals in 3D mazes by controlling the joints of a quadruped robot, given only proprioceptive and first-person camera inputs. The sparse reward is given when the robot reaches the goal, so the agents have to autonomously explore in the absence of task rewards throughout most of their learning.

The Egocentric Ant Maze benchmark measures the ability of agents to explore in a temporally-abstract manner to find the sparse reward at the end of the maze.

We evaluate Director against two state-of-the-art algorithms that are also based on world models: Plan2Explore, which maximizes both task reward and an exploration bonus based on ensemble disagreement, and Dreamer, which simply maximizes the task reward. Both baselines learn non-hierarchical policies from imagined trajectories of the world model. We find that Plan2Explore results in noisy movements that flip the robot onto its back, preventing it from reaching the goal. Dreamer reaches the goal in the smallest maze but fails to explore the larger mazes. In these larger mazes, Director is the only method to find and reliably reach the goal.

To study the ability of agents to discover very sparse rewards in isolation and separately from the challenge of representation learning of 3D environments, we propose the Visual Pin Pad suite. In these tasks, the agent controls a black square, moving it around to step on differently colored pads. At the bottom of the screen, the history of previously activated pads is shown, removing the need for long-term memory. The task is to discover the correct sequence for activating all the pads, at which point the agent receives the sparse reward. Again, Director outperforms previous methods by a large margin.

The Visual Pin Pad benchmark allows researchers to evaluate agents under very sparse rewards and without confounding challenges such as perceiving 3D scenes or long-term memory.

In addition to solving tasks with sparse rewards, we study Director’s performance on a wide range of tasks common in the literature that typically require no long-term exploration. Our experiment includes 12 tasks that cover Atari games, Control Suite tasks, DMLab maze environments, and the research platform Crafter. We find that Director succeeds across all these tasks with the same hyperparameters, demonstrating the robustness of the hierarchy learning process. Additionally, providing the task reward to the worker enables Director to learn precise movements for the task, fully matching or exceeding the performance of the state-of-the-art Dreamer algorithm.

Director solves a wide range of standard tasks with dense rewards with the same hyperparameters, demonstrating the robustness of the hierarchy learning process.

Goal Visualizations
While Director uses latent model states as goals, the learned world model allows us to decode these goals into images for human interpretation. We visualize the internal goals of Director for multiple environments to gain insights into its decision making and find that Director learns diverse strategies for breaking down long-horizon tasks. For example, on the Walker and Humanoid tasks, the manager requests a forward leaning pose and shifting floor patterns, with the worker filling in the details of how the legs need to move. In the Egocentric Ant Maze, the manager steers the ant robot by requesting a sequence of different wall colors. In the 2D research platform Crafter, the manager requests resource collection and tools via the inventory display at the bottom of the screen, and in DMLab mazes, the manager encourages the worker via the teleport animation that occurs right after collecting the desired object.

Left: In Egocentric Ant Maze XL, the manager directs the worker through the maze by targeting walls of different colors. Right: In Visual Pin Pad Six, the manager specifies subgoals via the history display at the bottom and by highlighting different pads.
Left: In Walker, the manager requests a forward leaning pose with both feet off the ground and a shifting floor pattern, with the worker filling in the details of leg movement. Right: In the challenging Humanoid task, Director learns to stand up and walk reliably from pixels and without early episode terminations.
Left: In Crafter, the manager requests resource collection via the inventory display at the bottom of the screen. Right: In DMLab Goals Small, the manager requests the teleport animation that occurs when receiving a reward as a way to communicate the task to the worker.

Future Directions
We see Director as a step forward in HRL research and are preparing its code to be released in the future. Director is a practical, interpretable, and generally applicable algorithm that provides an effective starting point for the future development of hierarchical artificial agents by the research community, such as allowing goals to only correspond to subsets of the full representation vectors, dynamically learning the duration of the goals, and building hierarchical agents with three or more levels of temporal abstraction. We are optimistic that future algorithmic advances in HRL will unlock new levels of performance and autonomy of intelligent agents.

Read More