Permutation-Invariant Neural Networks for Reinforcement Learning

Posted by David Ha, Staff Research Scientist and Yujin Tang, Research Software Engineer, Google Research, Tokyo

“The brain is able to use information coming from the skin as if it were coming from the eyes. We don’t see with the eyes or hear with the ears, these are just the receptors, seeing and hearing in fact goes on in the brain.”
Paul Bach-y-Rita, quoted in Livewired

People have the amazing ability to use one sensory modality (e.g., touch) to supply environmental information normally gathered by another sense (e.g., vision). This adaptive ability, called sensory substitution, is a phenomenon well-known to neuroscience. While difficult adaptations — such as adjusting to seeing things upside-down, learning to ride a “backwards” bicycle, or learning to “see” by interpreting visual information emitted from a grid of electrodes placed on one’s tongue — require anywhere from weeks, months or even years to attain mastery, people are able to eventually adjust to sensory substitutions.

Examples of Sensory Substitution. Left: Tongue Display Unit (Maris and Bach-y-Rita, 2001; Image: Kaczmarek, 2011). Right: “Upside down goggles” initially conceived by Erismann and Kohler in 1931. (Image Wikipedia).

In contrast, most neural networks are not able to adapt to sensory substitutions at all. For instance, most reinforcement learning (RL) agents require their inputs to be in a pre-specified format, or else they will fail. They expect fixed-size inputs and assume that each element of the input carries a precise meaning, such as the pixel intensity at a specified location, or state information, like position or velocity. In popular RL benchmark tasks (e.g., Ant or Cart-pole), an agent trained using current RL algorithms will fail if its sensory inputs are changed or if the agent is fed additional noisy inputs that are unrelated to the task at hand.

In “The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning”, a spotlight paper at NeurIPS 2021, we explore permutation invariant neural network agents, which require each of their sensory neurons (receptors that receive sensory inputs from the environment) to figure out the meaning and context of its input signal, rather than explicitly assuming a fixed meaning. Our experiments show that such agents are robust to observations that contain additional redundant or noisy information, and to observations that are corrupt and incomplete.

Permutation invariant reinforcement learning agents adapting to sensory substitutions. Left: The ordering of the ant’s 28 observations are randomly shuffled every 200 time-steps. Unlike the standard policy, our policy is not affected by the suddenly permuted inputs. Right: Cart-pole agent given many redundant noisy inputs (Interactive web-demo).

In addition to adapting to sensory substitutions in state-observation environments (like the ant and cart-pole examples), we show that these agents can also adapt to sensory substitutions in complex visual-observation environments (such as a CarRacing game that uses only pixel observations) and can perform when the stream of input images is constantly being reshuffled:

We partition the visual input from CarRacing into a 2D grid of small patches, and shuffled their ordering. Without any additional training, our agent still performs even when the original training background (left) is replaced with new images (right).

Method
Our approach takes observations from the environment at each time-step and feeds each element of the observation into distinct, but identical neural networks (called “sensory neurons”), each with no fixed relationship with one another. Each sensory neuron integrates over time information from only their particular sensory input channel. Because each sensory neuron receives only a small part of the full picture, they need to self-organize through communication in order for a global coherent behavior to emerge.

Illustration of observation segmentation.We segment each input into elements, which are then fed to independent sensory neurons. For non-vision tasks where the inputs are usually 1D vectors, each element is a scalar. For vision tasks, we crop each input image into non-overlapping patches.

We encourage neurons to communicate with each other by training them to broadcast messages. While receiving information locally, each individual sensory neuron also continually broadcasts an output message at each time-step. These messages are consolidated and combined into an output vector, called the global latent code, using an attention mechanism similar to that applied in the Transformer architecture. A policy network then uses the global latent code to produce the action that the agent will use to interact with the environment. This action is also fed back into each sensory neuron in the next time-step, closing the communication loop.

Overview of the permutation-invariant RL method. We first feed each individual observation (ot) into a particular sensory neuron (along with the agent’s previous action, at-1). Each neuron then produces and broadcasts a message independently, and an attention mechanism summarizes them into a global latent code (mt) that is given to the agent’s downstream policy network (𝜋) to produce the agent’s action at.

Why is this system permutation invariant? Each sensory neuron is an identical neural network that is not confined to only process information from one particular sensory input. In fact, in our setup, the inputs to each sensory neuron are not defined. Instead, each neuron must figure out the meaning of its input signal by paying attention to the inputs received by the other sensory neurons, rather than explicitly assuming a fixed meaning. This encourages the agent to process the entire input as an unordered set, making the system to be permutation invariant to its input. Furthermore, in principle, the agent can use as many sensory neurons as required, thus enabling it to process observations of arbitrary length. Both of these properties will help the agent adapt to sensory substitutions.

Results
We demonstrate the robustness and flexibility of this approach in simpler, state-observation environments, where the observations the agent receives as inputs are low-dimensional vectors holding information about the agent’s states, such as the position or velocity of its components. The agent in the popular Ant locomotion task has a total of 28 inputs with information that includes positions and velocities. We shuffle the order of the input vector several times during a trial and show that the agent is rapidly able to adapt and is still able to walk forward.

In cart-pole, the agent’s goal is to swing up a cart-pole mounted at the center of the cart and balance it upright. Normally the agent sees only five inputs, but we modify the cartpole environment to provide 15 shuffled input signals, 10 of which are pure noise, and the remainder of which are the actual observations from the environment. The agent is still able to perform the task, demonstrating the system’s capacity to work with a large number of inputs and attend only to channels it deems useful. Such flexibility may find useful applications for processing a large unspecified number of signals, most of which are noise, from ill-defined systems.

We also apply this approach to high-dimensional vision-based environments where the observation is a stream of pixel images. Here, we investigate screen-shuffled versions of vision-based RL environments, where each observation frame is divided into a grid of patches, and like a puzzle, the agent must process the patches in a shuffled order to determine a course of action to take. To demonstrate our approach on vision-based tasks, we created a shuffled version of Atari Pong.

Shuffled Pong results. Left: Pong agent trained to play using only 30% of the patches matches performance of Atari opponent. Right: Without extra training, when we give the agent more puzzle pieces, its performance increases.

Here the agent’s input is a variable-length list of patches, so unlike typical RL agents, the agent only gets to “see” a subset of patches from the screen. In the puzzle pong experiment, we pass to the agent a random sample of patches across the screen, which are then fixed through the remainder of the game. We find that we can discard 70% of the patches (at these fixed-random locations) and still train the agent to perform well against the built-in Atari opponent. Interestingly, if we then reveal additional information to the agent (e.g., allowing it access to more image patches), its performance increases, even without additional training. When the agent receives all the patches, in shuffled order, it wins 100% of the time, achieving the same result with agents that are trained while seeing the entire screen.

We find that imposing additional difficulty during training by using unordered observations has additional benefits, such as improving generalization to unseen variations of the task, like when the background of the CarRacing training environment is replaced with a novel image.

Shuffled CarRacing results. The agent has learned to focus its attention (indicated by the highlighted patches) on the road boundaries. Left: Training environment. Right: Test environment with new background.

Conclusion
The permutation invariant neural network agents presented here can handle ill-defined, varying observation spaces. Our agents are robust to observations that contain redundant or noisy information, or observations that are corrupt and incomplete. We believe that permutation invariant systems open up numerous possibilities in reinforcement learning.

If you’re interested to learn more about this work, we invite readers to read our interactive article (pdf version) or watch our video. We also released code to reproduce our experiments.

Read More

A decade in deep learning, and what’s next

Twenty years ago, Google started using machine learning, and 10 years ago, it helped spur rapid progress in AI using deep learning. Jeff Dean and Marian Croak of Google Research take a look at how we’ve innovated on these techniques and applied them in helpful ways, and look ahead to a responsible and inclusive path forward.

Jeff Dean

From research demos to AI that really works

I was first introduced to neural networks — computer systems that roughly imitate how biological brains accomplish tasks — as an undergrad in 1990. I did my senior thesis on using parallel computation to train neural networks. In those early days, I thought if we could 32X more compute power (using 32 processors at the time!), we could get neural networks to do impressive things. I was way off. It turns out we would need about 1 million times as much computational power before neural networks could scale to real-world problems.

A decade later, as an early employee at Google, I became reacquainted with machine learning when the company was still just a startup. In 2001 we used a simpler version of machine learning, statistical ML, to detect spam and suggest better spellings for people’s web searches. But it would be another decade before we had enough computing power to revive a more computationally-intensive machine learning approach called deep learning. Deep learning uses neural networks with multiple layers (thus the “deep”), so it can learn not just simple statistical patterns, but can learn subtler patterns of patterns — such as what’s in an image or what word was spoken in some audio. One of our first publications in 2012 was on a system that could find patterns among millions of frames from YouTube videos. That meant, of course, that it learned to recognize cats.

To get to the helpful features you use every day — searchable photo albums, suggestions on email replies, language translation, flood alerts, and so on — we needed to make years of breakthroughs on top of breakthroughs, tapping into the best of Google Research in collaboration with the broader research community. Let me give you just a couple examples of how we’ve done this.

A big moment for image recognition

In 2012, a paper wowed the research world for making a huge jump in accuracy on image recognition using deep neural networks, leading to a series of rapid advances by researchers outside and within Google. Further advances led to applications like Google Photos in 2015, letting you search photos by what’s in them. We then developed other deep learning models to help you find addresses in Google Maps, make sense of videos on YouTube, and explore the world around you using Google Lens. Beyond our products, we applied these approaches to health-related problems, such as detecting diabetic retinopathy in 2016, and then cancerous cells in 2017, and breast cancer in 2020. Better understanding of aerial imagery through deep learning let us launch flood forecasting in 2018, now expanded to cover more than 360 million people in 2021. It’s been encouraging to see how helpful these advances in image recognition have been.

Similarly, we’ve used deep learning to accelerate language understanding. With sequence-to-sequence learning in 2014, we began looking at how to understand strings of text using deep learning. This led to neural machine translation in Google Translate in 2016, which was a massive leap in quality, particularly for less prevalent languages. We developed neural language models further for Smart Reply in Gmail in 2017, which made it easier and faster for you to knock through your email, especially on mobile. That same year, Google invented Transformers, leading to BERT in 2018, then T5, and in 2021 MUM, which lets you ask Google much more nuanced questions. And with “sparse” models like GShard, we can dramatically improve on tasks like translation while using less energy.

We’ve driven a similar arc in understanding speech. In 2012, Google used deep neural networks to make major improvements to speech recognition on Android. We kept advancing the state of the art with higher-quality, faster, more efficient speech recognition systems. By 2019, we were able to put the entire neural network on-device so you could get accurate speech recognition even without a connection. And in 2021, we launched Live Translate on the Pixel 6 phone, letting you speak and be translated in 48 languages — all on-device, while you’re traveling with no Internet.

  • image of speech-to-text on phone

    Project Relate: A communication tool for people with speech impairments.

  • image of flood forecasting map on phone

    ML-based flood forecasting helps equip those in harm’s way with accurate and detailed alerts.

  • image of mammogram

    Google Health’s AI system helps radiologistsidentify cancer in mammograms with greater accuracy.

More invention ahead

As our research goes forward, we’re balancing more immediately applied research with more exploratory fundamental research. So we’re looking at how, for example, AI can aid scientific discovery, with a project like mapping the brain of a fly, which could one day help better understand and treat mental illness in people. We’re also pursuing quantum computing, which will likely take a decade or longer to reach wide-scale applications. This is why we publish nearly1000 papers a year, including around 200 related to responsible AI, and we’ve given over 6500 grants to external researchers over the past decade and a half.

Looking ahead from 2021 to 2031, I’m excited about the next-generation AI systems we can build, and how much more helpful they’ll be. We’re planting the seeds today with new architectures like Pathways, with more to come.

Marian Croak

Minding the gap(s)

As we develop these lines of research and turn them into useful technologies, we’re mindful of the broader societal impact of AI, and especially that technology has not always had an equitable impact. This is personal for me — I care deeply about ensuring that people from all different backgrounds and circumstances have a good experience.

So we’re increasing the depth and rigor of how we review and evaluate our research to ensure we’re developing it responsibly. We’re also scaling up what we learn by inventing new tools to understand and calibrate critical AI systems across Google’s products. We’re growing our organization to 200 experts in Responsible AI and Human Centered Technology, and working with hundreds of partners in product, privacy, security, and other teams across Google.

As one example of our work on responsible AI, Google Research began exploring the nascent field of ML fairness in 2016. The teams realized that on top of publishing papers, they could have a greater impact by teaching ML practitioners how to build with fairness in mind, as with the course we launched in 2018. We also started building interactive tools that coders and researchers could use, from the What-If Tool in 2018 to the 2019 launch of our Fairness Indicators tool, all the way to Know Your Data in 2021. All of these are concrete ways that AI developers can test their datasets and models to see what kind of biases and gaps there are, and start to work on mitigations to prevent unfair outcomes.

A principled approach

In fact, fairness is one of the key tenets of our AI Principles. We developed these principles in 2017 and published them in 2018, announcing not only the Principles themselves but a set of responsible AI practices with practical organizational and technical advice from what we’ve learned along the way. I was proud to be involved in the AI Principles review process from early on — I’ve seen firsthand how rigorous the teams at Google are on evaluating the technology we’re developing and deciding how best to deploy it in the real world.

Indeed, there are paths we’ve chosen not to go down — the AI Principles describe a number of areas we avoid. In line with our principles, we’ve taken a very cautious approach on face recognition. We recognize how fraught this area is not only in terms of privacy and surveillance concerns, but also its potential for unfair bias and impacts on historically marginalized groups. I’m glad that we’re taking this so thoughtfully and carefully.

We’re also developing technologies that help engineers apply the AI Principles directly — for example, incorporating privacy design principles. We invented Federated Learning in 2017 as a way to train ML models without your personal data leaving your phone. In 2018 we showed how well this works on Gboard, the free keyboard you can download for your phone — it learns to provide you more useful suggestions, while keeping what you type private on your device.

If you’re curious, you can learn more about all these veins of research, product impact, processes, and external engagement in our 2021 AI Principles Progress Update.

AI by everyone, for everyone

As we look to the decade ahead, it’s incredibly important that AI be built in a way that works well for everyone. That means building as inclusive a team as we can ourselves at Google. It also means ensuring the field as a whole increasingly represents the people whose lives it aims to improve.

I’m proud to lead the Black Leadership Advisory Group (BLAG) at Google. We helped craft and drive programs included in Google’s recent update on racial equity work. For example, we paired up new director-level hires with BLAG members, and the feedback has been really positive, with 80% of respondents saying they’d recommend the program. We’re looking at extending this to other groups, including for Latinx+ and Asian+ Googlers. We’re holding ourselves accountable as leaders too — we now evaluate all VPs and above at Google on progress on diversity, equity, and inclusion. This is crucial if we’re going to have a more representative set of researchers and engineers building future technologies.

For the broader research and computer science communities, we’re providing a wide variety of grants, programs, and collaborations that we hope will welcome a more representative range of researchers. Our Research Scholar Program, begun in 2021, gave grants to more than 50 universities in 15+ countries — and 43% of the principal investigators identify as part of a group that’s been historically marginalized in tech. Similarly, our exploreCSR and CS Research Mentorship programs support thousands of undergrads from marginalized groups. And we’re partnering with groups like the National Science Foundation on their new Institute for Human-AI Collaborations.

We’re doing everything we can to make AI work well for all people. We’ll not only help ensure products across Google are using the latest practices in responsible AI — we’ll also encourage new products and features that serve those who’ve historically missed out on helpful new technologies. One example is Project Relate, which uses machine learning to help people with speech impairments communicate and use technology more easily. Another is Real Tone, which helps our imaging products like our Pixel phone camera and Google Photos more accurately and beautifully represent a diverse range of skin tones. These are just the start.

We’re excited for what’s ahead in AI, for everyone.

Read More

Predicting Text Readability from Scrolling Interactions

Posted by Sian Gooding, Intern, Google Research

Illiteracy affects at least 773 million people globally, both young and old. For these individuals, reading information from unfamiliar sources or on unfamiliar topics can be extremely difficult. Unfortunately, these inequalities have been further magnified by the global pandemic as a result of unequal access to education in reading and writing. In fact, UNESCO reports that over 100 million children are falling behind the minimum proficiency level in reading due to COVID-related school closures.

With increasing world-wide access to technology, reading on a device, such as a tablet or phone, has largely taken the place of traditional formats. This provides a unique opportunity to observe reading interactions, e.g., how a reader scrolls through a text, which can inform our understanding of what can make text difficult to read. This understanding is crucial when designing educational applications for low-proficiency readers and language learners, because it can be used to match learners with appropriately leveled texts as well as to support readers in understanding texts beyond their reading level.

In “Predicting Text Readability from Scrolling Interactions”, presented at CoNLL 2021, we show that data from on-device reading interactions can be used to predict how readable a text is. This novel approach provides insights into subjective readability — whether an individual reader has found a text accessible — and demonstrates that existing readability models can be improved by including feedback from scroll-based reading interactions. In order to encourage research in this area and to help enable more personalized tools for language learning and text simplification, we are releasing the dataset of reading interactions generated from our scrolling behavior–based readability assessment of English-language texts.

Understanding Text Difficulty
There are multiple aspects of a text that impact how difficult it is to read, including the vocabulary level, the syntactic structure, and overall coherence. Traditional machine learning approaches to measure readability have exclusively relied on such linguistic features. However, using these features alone does not work well for online content, because such content often contains abbreviations, emojis, broken text, and short passages, which detrimentally impact the performance of readability models.

To address this, we investigated whether aggregate data about the reading interactions of a group can be used to predict how difficult a text is, as well as how reading interactions may differ based on a readers’ understanding. When reading on a device, readers typically interact with text by scrolling in a vertical fashion, which we hypothesize can be used as a coarse proxy for reading comprehension. With this in mind, we recruited 518 paid participants and asked them to read English-language texts of different difficulty levels. We recorded the reading interactions by measuring different features of the participants’ scrolling behavior, such as the speed, acceleration and number of times areas of text were revisited. We then used this information to produce a set of features for a readability classifier.

Predicting Text Difficulty from Scrolling Behavior
We investigated which types of scrolling behaviors were most impacted by text difficulty and tested the significance using linear mixed effect models. In our set up, we have repeated measures, as multiple participants read the same texts and each participant reads more than one text. Using linear mixed-effect models gives us a higher confidence that the differences in interactions we are observing are because of the text difficulty, and not other random effects.

Our results showed that multiple reading behaviors differed significantly based on the text level, for example, the average, maximum and minimum acceleration of scrolling. We found the most significant features to be the total read time and the maximum reading speeds.

We then used these features as inputs to a machine learning algorithm. We designed and trained a support vector machine (i.e., a binary classifier) to predict whether a text is either advanced or elementary based only on scrolling behaviors as individuals interacted with it. The dataset on which the model was trained contains 60 articles, each of which were read by an average of 17 participants. From these interactions we produced aggregate features by taking the mean of the significant measures across participants.

 

We measured the accuracy of the approach using a metric called f-score, which measures how accurate the model is at classifying a text as either “easy” or “difficult” (where 1.0 reflects perfect classification accuracy). We are able to achieve an f-score of 0.77 on this task, using interaction features alone. This is the first work to show that it is possible to predict the readability of a text using only interaction features.

Improving Readability Models
In order to demonstrate the value of applying readability measures from scrolling behaviors to existing readability models, we integrated scroll-based features into the state-of-the-art automated readability assessment tool, which was released as part of the OneStopEnglish corpus. We found that the addition of interaction features improves the f-score of this model from 0.84 to 0.88. In addition, we were able to significantly outperform this system by using interaction information with simple vocabulary features, such as the number of words in the text, achieving an impressive f-score of 0.96.

In our study, we recorded comprehension scores to evaluate the understanding and readability of text for individuals. Participants were asked three questions per article to assess the reader’s understanding of what they had read. The interaction features of an individual’s scrolling behavior was represented as a high dimensional vector. To explore this data, we visualized the reading interaction features for each participant using t-distributed stochastic neighbor embeddings, which is a statistical method for visualizing high-dimensional data. The results revealed clusters in the comprehension score based on how well individuals understood the text. This shows that there is implicit information in reading interactions about the likelihood that an individual has understood a given text. We refer to this phenomenon as subjective readability. This information can be very useful for educational applications or for simplifying online content.

Plot showing t-SNE projection of scroll interactions in 2-dimensions. The color of each data point corresponds to the comprehension score. Clusters of comprehension scores indicate that there are correlations between reading behaviors and comprehension.

Finally, we investigated the extent to which reading interactions vary across audiences. We compared the average scrolling speed across different reader groups, covering reading proficiency and the reader’s first language. We found that the speed distribution varies depending on the proficiency and first language of the audience. This supports the case that first language and proficiency alter the reading behaviors of audiences, which allows us to contextualize the reading behavior of groups and better understand which areas of text may be harder for them to read.

Histogram showing the average speeds of scrolling (in vertical pixels per millisecond) across readers of different proficiency levels (beginner, intermediate and advanced), with lines showing the smoothed trend for each group. A higher average scroll speed indicates faster reading times. For example, a more challenging text that corresponds to slower scroll speeds by advanced readers is associated with higher scroll speeds by beginners because they engage with the text only superficially.

Histogram showing the average speeds of scrolling (in vertical pixels per millisecond) across audiences by first language of the readers, Tamil or English, with lines showing the smoothed trend for each group. A higher average scroll speed indicates faster reading times. Dark blue bars are where the histograms overlap.

Conclusion
This work is the first to show that reading interactions, such as scrolling behavior, can be used to predict the readability of text, which can yield numerous benefits. Such measures are language agnostic, unobtrusive, and robust to noisy text. Implicit user feedback allows insight into readability at an individual level, thereby allowing for a more inclusive and personalisable assessment of text difficulty. Furthermore, being able to judge the subjective readability of text benefits language learning and educational apps. We conducted a 518 participant study to investigate the impact of text readability on reading interactions and are releasing a novel dataset of the associated reading interactions. We confirm that there are statistically significant differences in the way that readers interact with advanced and elementary texts, and that the comprehension scores of individuals correlate with specific measures of scrolling interaction. For more information our conference presentation is available to view.

Acknowledgements
We thank our collaborators Yevgeni Berzak, Tony Mak and Matt Sharifi, as well as Dmitry Lagun and Blaise Aguera y Arcas for their helpful feedback on the paper.

Read More

RLiable: Towards Reliable Evaluation & Reporting in Reinforcement Learning

Posted by Rishabh Agarwal, Research Scientist and Pablo Samuel Castro, Staff Software Engineer, Google Research, Brain Team

Reinforcement learning (RL) is an area of machine learning that focuses on learning from experiences to solve decision making tasks. While the field of RL has made great progress, resulting in impressive empirical results on complex tasks, such as playing video games, flying stratospheric balloons and designing hardware chips, it is becoming increasingly apparent that the current standards for empirical evaluation might give a false sense of fast scientific progress while slowing it down.

To that end, in “Deep RL at the Edge of the Statistical Precipice”, accepted as an oral presentation at NeurIPS 2021, we discuss how statistical uncertainty of results needs to be considered, especially when using only a few training runs, in order for evaluation in deep RL to be reliable. Specifically, the predominant practice of reporting point estimates ignores this uncertainty and hinders reproducibility of results. Related to this, tables with per-task scores, as are commonly reported, can be overwhelming beyond a few tasks and often omit standard deviations. Furthermore, simple performance metrics like the mean can be dominated by a few outlier tasks, while the median score would remain unaffected even if up to half of the tasks had performance scores of zero. Thus, to increase the field’s confidence in reported results with a handful of runs, we propose various statistical tools, including stratified bootstrap confidence intervals, performance profiles, and better metrics, such as interquartile mean and probability of improvement. To help researchers incorporate these tools, we also release an easy-to-use Python library RLiable with a quickstart colab.

Statistical Uncertainty in RL Evaluation
Empirical research in RL relies on evaluating performance on a diverse suite of tasks, such as Atari 2600 video games, to assess progress. Published results on deep RL benchmarks typically compare point estimates of the mean and median scores aggregated across tasks. These scores are typically relative to some defined baseline and optimal performance (e.g., random agent and “average” human performance on Atari games, respectively) so as to make scores comparable across different tasks.

In most RL experiments, there is randomness in the scores obtained from different training runs, so reporting only point estimates does not reveal whether similar results would be obtained with new independent runs. A small number of training runs, coupled with the high variability in performance of deep RL algorithms, often leads to large statistical uncertainty in such point estimates.

The distribution of median human normalized scores on the Atari 100k benchmark, which contains 26 games, for five recently published algorithms, DER, OTR, CURL, two variants of DrQ, and SPR. The reported point estimates of median scores based on a few runs in publications, as shown by dashed lines, do not provide information about the variability in median scores and typically overestimate (e.g., CURL, SPR, DrQ) or underestimate (e.g., DER) the expected median, which can result in erroneous conclusions.

As benchmarks become increasingly more complex, evaluating more than a few runs will be increasingly demanding due to the increased compute and data needed to solve such tasks. For example, five runs on 50 Atari games for 200 million frames takes 1000+ GPU days. Thus, evaluating more runs is not a feasible solution for reducing statistical uncertainty on computationally demanding benchmarks. While prior work has recommended statistical significance tests as a solution, such tests are dichotomous in nature (either “significant” or “not significant”), so they often lack the granularity needed to yield meaningful insights and are widely misinterpreted.

Number of runs in RL papers over the years. Beginning with the Arcade Learning Environment (ALE), the shift toward computationally-demanding benchmarks has led to the practice of evaluating only a handful of runs per task, increasing the statistical uncertainty in point estimates.

Tools for Reliable Evaluation
Any aggregate metric based on a finite number of runs is a random variable, so to take this into account, we advocate for reporting stratified bootstrap confidence intervals (CIs), which predict the likely values of aggregate metrics if the same experiment were repeated with different runs. These CIs allow us to understand the statistical uncertainty and reproducibility of results. Such CIs use the scores on combined runs across tasks. For example, evaluating 3 runs each on Atari 100k, which contains 26 tasks, results in 78 sample scores for uncertainty estimation.

In each task, colored balls denote scores on different runs. To compute statified bootstrap CIs using the percentile method, bootstrap samples are created by randomly sampling scores with replacement proportionately from each task. Then, the distribution of aggregate scores on these samples is the bootstrapping distribution, whose spread around the center gives us the confidence interval.

Most deep RL algorithms often perform better on some tasks and training runs, but aggregate performance metrics can conceal this variability, as shown below.

Data with varied appearance but identical aggregate statistics. Source: Same Stats, Different Graphs.

Instead, we recommend performance profiles, which are typically used for comparing solve times of optimization software. These profiles plot the score distribution across all runs and tasks with uncertainty estimates using stratified bootstrap confidence bands. These plots show the total runs across all tasks that obtain a score above a threshold (𝝉) as a function of the threshold.

Performance profiles correspond to the empirical tail distribution of scores on runs combined across all tasks. Shaded regions show 95% stratified bootstrap confidence bands.

Such profiles allow for qualitative comparisons at a glance. For example, the curve for one algorithm above another means that one algorithm is better than the other. We can also read any score percentile, e.g., the profiles intersect y = 0.5 (dotted line above) at the median score. Furthermore, the area under the profile corresponds to the mean score.

While performance profiles are useful for qualitative comparisons, algorithms rarely outperform other algorithms on all tasks and thus their profiles often intersect, so finer quantitative comparisons require aggregate performance metrics. However, existing metrics have limitations: (1) a single high performing task may dominate the task mean score, while (2) the task median is unaffected by zero scores on nearly half of the tasks and requires a large number of training runs for small statistical uncertainty. To address the above limitations, we recommend two alternatives based on robust statistics: the interquartile mean (IQM) and the optimality gap, both of which can be read as areas under the performance profile, below.

IQM (red) corresponds to the area under the performance profile, shown in blue, between the 25 and 75 percentile scores on the x-axis. Optimality gap (yellow) corresponds to the area between the profile and horizontal line at y = 1 (human performance), for scores less than 1.

As an alternative to median and mean, IQM corresponds to the mean score of the middle 50% of the runs combined across all tasks. It is more robust to outliers than mean, a better indicator of overall performance than median, and results in smaller CIs, and so, requires fewer runs to claim improvements. Another alternative to mean, optimality gap measures how far an algorithm is from optimal performance.

IQM discards the lowest 25% and highest 25% of the combined scores (colored balls) and computes the mean of the remaining 50% scores.

For directly comparing two algorithms, another metric to consider is the average probability of improvement, which describes how likely an improvement over baseline is, regardless of its size. This metric is computed using the Mann-Whitney U-statistic, averaged across tasks.

Re-evaluating Evaluation
Using the above tools for evaluation, we revisit performance evaluations of existing algorithms on widely used RL benchmarks, revealing inconsistencies in prior evaluation. For example, in the Arcade Learning Environment (ALE), a widely recognized RL benchmark, the performance ranking of algorithms changes depending on the choice of aggregate metric. Since performance profiles capture the full picture, they often illustrate why such inconsistencies exist.

Median (left) and IQM (right) human normalized scores on the ALE as a function of the number of environment frames seen during training. IQM results in significantly smaller CIs than median scores.

On DM Control, a popular continuous control benchmark, there are large overlaps in 95% CIs of mean normalized scores for most algorithms.

DM Control Suite results, averaged across six tasks, on the 100k and 500k step benchmark. Since scores are normalized using maximum performance, mean scores correspond to one minus the optimality gap. The ordering of the algorithms is based on their claimed relative performance — all algorithms except Dreamer claimed improvement over at least one algorithm placed below them. Shaded regions show 95% CIs.

Finally, on Procgen, a benchmark for evaluating generalization in RL, the average probability of improvement shows that some claimed improvements are only 50-70% likely, suggesting that some reported improvements could be spurious.

Each row shows the probability that the algorithm X on the left outperforms algorithm Y on the right, given that X was claimed to be better than Y. Shaded region denotes 95% stratified bootstrap CIs.

Conclusion
Our findings on widely-used deep RL benchmarks show that statistical issues can have a large influence on previously reported results. In this work, we take a fresh look at evaluation to improve the interpretation of reported results and standardize experimental reporting. We’d like to emphasize the importance of published papers providing results for all runs to allow for future statistical analyses. To build confidence in your results, please check out our open-source library RLiable and the quickstart colab.

Acknowledgments
This work was done in collaboration with Max Schwarzer, Aaron Courville and Marc G. Bellemare. We’d like to thank Tom Small for an animated figure used in this post. We are also grateful for feedback by several members of the Google Research, Brain Team and DeepMind.

Read More

MetNet-2: Deep Learning for 12-Hour Precipitation Forecasting

Posted by Nal Kalchbrenner and Lasse Espeholt, Google Research

Deep learning has successfully been applied to a wide range of important challenges, such as cancer prevention and increasing accessibility. The application of deep learning models to weather forecasts can be relevant to people on a day-to-day basis, from helping people plan their day to managing food production, transportation systems, or the energy grid. Weather forecasts typically rely on traditional physics-based techniques powered by the world’s largest supercomputers. Such methods are constrained by high computational requirements and are sensitive to approximations of the physical laws on which they are based.

Deep learning offers a new approach to computing forecasts. Rather than incorporating explicit physical laws, deep learning models learn to predict weather patterns directly from observed data and are able to compute predictions faster than physics-based techniques. These approaches also have the potential to increase the frequency, scope, and accuracy of the predicted forecasts.

Illustration of the computation through MetNet-2. As the computation progresses, the network processes an ever larger context from the input and makes a probabilistic forecast of the likely future weather conditions.

Within weather forecasting, deep learning techniques have shown particular promise for nowcasting — i.e., predicting weather up to 2-6 hours ahead. Previous work has focused on using direct neural network models for weather data, extending neural forecasts from 0 to 8 hours with the MetNet architecture, generating continuations of radar data for up to 90 minutes ahead, and interpreting the weather information learned by these neural networks. Still, there is an opportunity for deep learning to extend improvements to longer-range forecasts.

To that end, in “Skillful Twelve Hour Precipitation Forecasts Using Large Context Neural Networks”, we push the forecasting boundaries of our neural precipitation model to 12 hour predictions while keeping a spatial resolution of 1 km and a time resolution of 2 minutes. By quadrupling the input context, adopting a richer weather input state, and extending the architecture to capture longer-range spatial dependencies, MetNet-2 substantially improves on the performance of its predecessor, MetNet. Compared to physics-based models, MetNet-2 outperforms the state-of-the-art HREF ensemble model for weather forecasts up to 12 hours ahead.

MetNet-2 Features and Architecture
Neural weather models like MetNet-2 map observations of the Earth to the probability of weather events, such as the likelihood of rain over a city in the afternoon, of wind gusts reaching 20 knots, or of a sunny day ahead. End-to-end deep learning has the potential to both streamline and increase quality by directly connecting a system’s inputs and outputs. With this in mind, MetNet-2 aims to minimize both the complexity and the total number of steps involved in creating a forecast.

The inputs to MetNet-2 include the radar and satellite images also used in MetNet. To capture a more comprehensive snapshot of the atmosphere with information such as temperature, humidity, and wind direction — critical for longer forecasts of up to 12 hours — MetNet-2 also uses the pre-processed starting state used in physical models as a proxy for this additional weather information. The radar-based measures of precipitation (MRMS) serve as the ground truth (i.e., what we are trying to predict) that we use in training to optimize MetNet-2’s parameters.

Example ground truth image: Instantaneous precipitation (mm/hr) based on radar (MRMS) capturing a 12 hours-long progression.

MetNet-2’s probabilistic forecasts can be viewed as averaging all possible future weather conditions weighted by how likely they are. Due to its probabilistic nature, MetNet-2 can be likened to physics-based ensemble models, which average some number of future weather conditions predicted by a variety of physics-based models. One notable difference between these two approaches is the duration of the core part of the computation: ensemble models take ~1 hour, whereas MetNet-2 takes ~1 second.

Steps in a MetNet-2 forecast and in a physics-based ensemble.

One of the main challenges that MetNet-2 must overcome to make 12 hour long forecasts is capturing a sufficient amount of spatial context in the input images. For each additional forecast hour we include 64 km of context in every direction at the input. This results in an input context of size 20482 km2 — four times that used in MetNet. In order to process such a large context, MetNet-2 employs model parallelism whereby the model is distributed across 128 cores of a Cloud TPU v3-128. Due to the size of the input context, MetNet-2 replaces the attentional layers of MetNet with computationally more efficient convolutional layers. But standard convolutional layers have local receptive fields that may fail to capture large spatial contexts, so MetNet-2 uses dilated receptive fields, whose size doubles layer after layer, in order to connect points in the input that are far apart one from the other.

Example of input spatial context and target area for MetNet-2.

Results
Because MetNet-2’s predictions are probabilistic, the model’s output is naturally compared with the output of similarly probabilistic ensemble or post-processing models. HREF is one such state-of-the-art ensemble model for precipitation in the United States, which aggregates ten predictions from five different models, twice a day. We evaluate the forecasts using established metrics, such as the Continuous Ranked Probability Score, which captures the magnitude of the probabilistic error of a model’s forecasts relative to the ground truth observations. Despite not performing any physics-based calculations, MetNet-2 is able to outperform HREF up to 12 hours into the future for both low and high levels of precipitation.

Continuous Ranked Probability Score (CRPS; lower is better) for MetNet-2 vs HREF aggregated over a large number of test patches randomly located in the Continental United States.

Examples of Forecasts
The following figures provide a selection of forecasts from MetNet-2 compared with the physics-based ensemble HREF and the ground truth MRMS.

Probability maps for the cumulative precipitation rate of 1 mm/hr on January 3, 2019 over the Pacific NorthWest. The maps are shown for each hour of lead time from 1 to 12. Left: Ground truth, source MRMS. Center: Probability map as predicted by MetNet-2 . Right: Probability map as predicted by HREF.
Comparison of 0.2 mm/hr precipitation on March 30, 2020 over Denver, Colorado. Left: Ground truth, source MRMS. Center: Probability map as predicted by MetNet-2 . Right: Probability map as predicted by HREF.MetNet-2 is able to predict the onset of the storm (called convective initiation) earlier in the forecast than HREF as well as the storm’s starting location, whereas HREF misses the initiation location, but captures its growth phase well.
Comparison of 2 mm/hr precipitation stemming from Hurricane Isaias, an extreme weather event that occurred on August 4, 2020 over the North East coast of the US. Left: Ground truth, source MRMS. Center: Probability map as predicted by MetNet-2. Right: Probability map as predicted by HREF.

Interpreting What MetNet-2 Learns About Weather
Because MetNet-2 does not use hand-crafted physical equations, its performance inspires a natural question: What kind of physical relations about the weather does it learn from the data during training? Using advanced interpretability tools, we further trace the impact of various input features on MetNet-2’s performance at different forecast timelines. Perhaps the most surprising finding is that MetNet-2 appears to emulate the physics described by Quasi-Geostrophic Theory, which is used as an effective approximation of large-scale weather phenomena. MetNet-2 was able to pick up on changes in the atmospheric forces, at the scale of a typical high- or low-pressure system (i.e., the synoptic scale), that bring about favorable conditions for precipitation, a key tenet of the theory.

Conclusion
MetNet-2 represents a step toward enabling a new modeling paradigm for weather forecasting that does not rely on hand-coding the physics of weather phenomena, but rather embraces end-to-end learning from observations to weather targets and parallel forecasting on low-precision hardware. Yet many challenges remain on the path to fully achieving this goal, including incorporating more raw data about the atmosphere directly (rather than using the pre-processed starting state from physical models), broadening the set of weather phenomena, increasing the lead time horizon to days and weeks, and widening the geographic coverage beyond the United States.

Acknowledgements
Shreya Agrawal, Casper Sønderby, Manoj Kumar, Jonathan Heek, Carla Bromberg, Cenk Gazen, Jason Hickey, Aaron Bell, Marcin Andrychowicz, Amy McGovern, Rob Carver, Stephan Hoyer, Zack Ontiveros, Lak Lakshmanan, David McPeek, Ian Gonzalez, Claudio Martella, Samier Merchant, Fred Zyda, Daniel Furrer and Tom Small.


Read More

An Open Source Vibrotactile Haptics Platform for On-Body Applications

Posted by Artem Dementyev, Hardware Engineer, Google Research

Most wearable smart devices and mobile phones have the means to communicate with the user through tactile feedback, enabling applications from simple notifications to sensory substitution for accessibility. Typically, they accomplish this using vibrotactile actuators, which are small electric vibration motors. However, designing a haptic system that is well-targeted and effective for a given task requires experimentation with the number of actuators and their locations in the device, yet most practical applications require standalone on-body devices and integration into small form factors. This combination of factors can be difficult to address outside of a laboratory as integrating these systems can be quite time-consuming and often requires a high level of expertise.

A typical lab setup on the left and the VHP board on the right.

In “VHP: Vibrotactile Haptics Platform for On-body Applications”, presented at ACM UIST 2021, we develop a low-power miniature electronics board that can drive up to 12 independent channels of haptic signals with arbitrary waveforms. The VHP electronics board can be battery-powered, and integrated into wearable devices and small gadgets. It allows all-day wear, has low latency, battery life between 3 and 25 hours, and can run 12 actuators simultaneously. We show that VHP can be used in bracelet, sleeve, and phone-case form factors. The bracelet was programmed with an audio-to-tactile interface to aid lipreading and remained functional when worn for multiple months by developers. To facilitate greater progress in the field of wearable multi-channel haptics with the necessary tools for their design, implementation, and experimentation, we are releasing the hardware design and software for the VHP system via GitHub.

Front and back sides of the VHP circuit board.
Block diagram of the system.

Platform Specifications.
VHP consists of a custom designed circuit board, where the main components are the microcontroller and haptic amplifier, which converts microcontroller’s digital output into signals that drive the actuators. The haptic actuators can be controlled by signals arriving via serial, USB, and Bluetooth Low Energy (BLE), as well as onboard microphones, using an nRF52840 microcontroller, which was chosen because it offers many input and output options and BLE, all in a small package. We added several sensors into the board to provide more experimental flexibility: an on-board digital microphone, an analog microphone amplifier, and an accelerometer. The firmware is a portable C/C++ library that works in the Arduino ecosystem.

To allow for rapid iteration during development, the interface between the board and actuators is critical. The 12 tactile signals’ wiring have to be quick to set up in order to allow for such development, while being flexible and robust to stand up to prolonged use. For the interface, we use a 24-pin FPC (flexible printed circuit) connector on the VHP. We support interfacing to the actuators in two ways: with a custom flexible circuit board and with a rigid breakout board.

VHP board (small board on the right) connected to three different types of tactile actuators via rigid breakout board (large board on the left).

Using Haptic Actuators as Sensors
In our previous blog post, we explored how back-EMF in a haptic actuator could be used for sensing and demonstrated a variety of useful applications. Instead of using back-EMF sensing in the VHP system, we measure the electrical current that drives each vibrotactile actuator and use the current load as the sensing mechanism. Unlike back-EMF sensing, this current-sensing approach allows simultaneous sensing and actuation, while minimizing the additional space needed on the board.

One challenge with the current-sensing approach is that there is a wide variety of vibrotactile actuators, each of which may behave differently and need different presets. In addition, because different actuators can be added and removed during prototyping with the adapter board, it would be useful if the VHP were able to identify the actuator automatically. This would improve the speed of prototyping and make the system more novice-friendly.

To explore this possibility, we collected current-load data from three off-the-shelf haptic actuators and trained a simple support vector machine classifier to recognize the difference in the signal pattern between actuators. The test accuracy was 100% for classifying the three actuators, indicating that each actuator has a very distinct response.

Different actuators have a different current signature during a frequency sweep, thus allowing for automatic identification.

Additionally, vibrotactile actuators require proper contact with the skin for consistent control over stimulation. Thus, the device should measure skin contact and either provide an alert or self-adjust if it is not loaded correctly. To test whether a skin contact measuring technique works in practice, we measured the current load on actuators in a bracelet as it was tightened and loosened around the wrist. As the bracelet strap is tightened, the contact pressure between the skin and the actuator increases and the current required to drive the actuator signal increases commensurately.

Current load sensing is responding to touch, while the actuator is driven at 250 Hz frequency.

Quality of the fit of the bracelet is measured.

Audio-to-Tactile Feedback
To demonstrate the utility of the VHP platform, we used it to develop an audio-to-tactile feedback device to help with lipreading. Lipreading can be difficult for many speech sounds that look similar (visemes), such as “pin” and “min”. In order to help the user differentiate visemes like these, we attach a microphone to the VHP system, which can then pick up the speech sounds and translate the audio to vibrations on the wrist. For audio-to-tactile translation, we used our previously developed algorithms for real-time audio-to-tactile conversion, available via GitHub. Briefly, audio filters are paired with neural networks to recognize certain viesemes (e.g., picking up the hard consonant “p” in “pin”), and are then translated to vibrations in different parts of the bracelet. Our approach is inspired by tactile phonemic sleeve (TAPS), however the major difference is that in our approach the tactile signal is presented continuously and in real-time.

One of the developers who employs lipreading in daily life wore the bracelet daily for several months and found it to give better information to facilitate lipreading than previous devices, allowing improved understanding of lipreading visemes with the bracelet versus lipreading alone. In the future, we plan to conduct full-scale experiments with multiple users wearing the device for an extended time.

Left: Audio-to-tactile sleeve. Middle: Audio-to-tactile bracelet. Right: One of our developers tests out the bracelets, which are worn on both arms.

Potential Applications
The VHP platform enables rapid experimentation and prototyping that can be used to develop techniques for a variety of applications. For example:

  • Rich haptics on small devices: Expanding the number of actuators on mobile phones, which typically only have one or two, could be useful to provide additional tactile information. This is especially useful as fingers are sensitive to vibrations. We demonstrated a prototype mobile phone case with eight vibrotactile actuators. This could be used to provide rich notifications and enhance effects in a mobile game or when watching a video.
  • Lab psychophysical experiments: Because VHP can be easily set up to send and receive haptic signals in real time, e.g., from a Jupyter notebook, it could be used to perform real-time haptic experiments.
  • Notifications and alerts: The wearable VHP could be used to provide haptic notifications from other devices, e.g., alerting if someone is at the door, and could even communicate distinguishable alerts through use of multiple actuators.
  • Sensory substitution: Besides the lipreading assistance example above, there are many other potential applications for accessibility using sensory substitution, such as visual-to-tactile sensing or even sensing magnetic fields.
  • Loading sensing: The ability to sense from the haptic actuator current load is unique to our platform, and enables a variety of features, such as pressure sensing or automatically adjusting actuator output.
Integrating eight voice coils into a phone case. We used loading sensing to understand which voice coils are being touched.

What’s next?
We hope that others can utilize the platform to build a diverse set of applications. If you are interested and have ideas about using our platform or want to receive updates, please fill out this form. We hope that with this platform, we can help democratize the use of haptics and inspire a more widespread use of tactile devices.

Acknowledgments
This work was done by Artem Dementyev, Pascal Getreuer, Dimitri Kanevsky, Malcolm Slaney and Richard Lyon. We thank Alex Olwal, Thad Starner, Hong Tan, Charlotte Reed, Sarah Sterman for valuable feedback and discussion on the paper. Yuhui Zhao, Dmitrii Votintcev, Chet Gnegy, Whitney Bai and Sagar Savla for feedback on the design and engineering.

Read More

Making Better Future Predictions by Watching Unlabeled Videos

Posted by Dave Epstein, Student Researcher and Chen Sun, Staff Research Scientist, Google Research

Machine learning (ML) agents are increasingly deployed in the real world to make decisions and assist people in their daily lives. Making reasonable predictions about the future at varying timescales is one of the most important capabilities for such agents because it enables them to predict changes in the world around them, including other agents’ behaviors, and plan how to act next. Importantly, successful future prediction requires both capturing meaningful transitions in the environment (e.g., dough transforming into bread) and adapting to how transitions unfold over time in order to make decisions.

Previous work in future prediction from visual observations has largely been constrained by the format of its output (e.g., pixels that represent an image) or a manually-defined set of human activities (e.g., predicting if someone will keep walking, sit down, or jump). These are either too detailed and hard to predict or lack important information about the richness of the real world. For example, predicting “person jumping” does not capture why they’re jumping, what they’re jumping onto, etc. Also, with very few exceptions, previous models were designed to make predictions at a fixed offset into the future, which is a limiting assumption because we rarely know when meaningful future states will happen.

For example, in a video about making ice cream (depicted below), the meaningful transition from “cream” to “ice cream” occurs over 35 seconds, so models predicting such transitions would need to look 35 seconds ahead. But this time interval varies a large amount across different activities and videos — meaningful transitions occur at any distance into the future. Learning to make such predictions at flexible intervals is hard because the desired ground truth may be relatively ambiguous. For example, the correct prediction could be the just-churned ice cream in the machine, or scoops of the ice cream in a bowl. In addition, collecting such annotations at scale (i.e., frame-by-frame for millions of videos) is infeasible. However, many existing instructional videos come with speech transcripts, which often offer concise, general descriptions throughout entire videos. This source of data can guide a model’s attention toward important parts of the video, obviating the need for manual labeling and allowing a flexible, data-driven definition of the future.

In “Learning Temporal Dynamics from Cycles in Narrated Video”, published at ICCV 2021, we propose an approach that is self-supervised, using a recent large unlabeled dataset of diverse human action. The resulting model operates at a high level of abstraction, can make predictions arbitrarily far into the future, and chooses how far into the future to predict based on context. Called Multi-Modal Cycle Consistency (MMCC), it leverages narrated instructional video to learn a strong predictive model of the future. We demonstrate how MMCC can be applied, without fine-tuning, to a variety of challenging tasks, and qualitatively examine its predictions. In the example below, MMCC predicts the future (d) from present frame (a), rather than less relevant potential futures (b) or (c).

This work uses cues from vision and language to predict high-level changes (such as cream becoming ice cream) in video (video from HowTo100M).

Viewing Videos as Graphs
The foundation of our method is to represent narrated videos as graphs. We view videos as a collection of nodes, where nodes are either video frames (sampled at 1 frame per second) or segments of narrated text (extracted with automatic speech recognition systems), encoded by neural networks. During training, MMCC constructs a graph from the nodes, using cross-modal edges to connect video frames and text segments that refer to the same state, and temporal edges to connect the present (e.g., strawberry-flavored cream) and the future (e.g., soft-serve ice cream). The temporal edges operate on both modalities equally — they can start from either a video frame, some text, or both, and can connect to a future (or past) state in either modality. MMCC achieves this by learning a latent representation shared by frames and text and then making predictions in this representation space.

Multi-modal Cycle Consistency
To learn the cross-modal and temporal edge functions without supervision, we apply the idea of cycle consistency. Here, cycle consistency refers to the construction of cycle graphs, in which the model constructs a series of edges from an initial node to other nodes and back again: Given a start node (e.g., a sample video frame), the model is expected to find its cross-modal counterpart (i.e., text describing the frame) and combine them as the present state. To do this, at the start of training, the model assumes that frames and text with the same timestamps are counterparts, but then relaxes this assumption later. The model then predicts a future state, and the node most similar to this prediction is selected. Finally, the model attempts to invert the above steps by predicting the present state backward from the future node, and thus connecting the future node back with the start node.

The discrepancy between the model’s prediction of the present from the future and the actual present is the cycle-consistency loss. Intuitively, this training objective requires the predicted future to contain enough information about its past to be invertible, leading to predictions that correspond to meaningful changes to the same entities (e.g., tomato becoming marinara sauce, or flour and eggs in a bowl becoming dough). Moreover, the inclusion of cross-modal edges ensures future predictions are meaningful in either modality.

To learn the temporal and cross-modal edge functions end-to-end, we use the soft attention technique, which first outputs how likely each node is to be the target node of the edge, and then “picks” a node by taking the weighted average among all possible candidates. Importantly, this cyclic graph constraint makes few assumptions for the kind of temporal edges the model should learn, as long as they end up forming a consistent cycle. This enables the emergence of long-term temporal dynamics critical for future prediction without requiring manual labels of meaningful changes.

An example of the training objective: A cycle graph is expected to be constructed between the chicken with soy sauce and the chicken in chili oil because they are two adjacent steps in the chicken’s preparation (video from HowTo100M).

Discovering Cycles in Real-World Video
MMCC is trained without any explicit ground truth, using only long video sequences and randomly sampled starting conditions (a frame or text excerpt) and asking the model to find temporal cycles. After training, MMCC can identify meaningful cycles that capture complex changes in video.

Given frames as input (left), MMCC selects relevant text from video narrations and uses both modalities to predict a future frame (middle). It then finds text relevant to this future and uses it to predict the past (right). Using its knowledge of how objects and scenes change over time, MMCC “closes the cycle” and ends up where it started (videos from HowTo100M).
The model can also start from narrated text rather than frames and still find relevant transitions (videos from HowTo100M).

Zero-Shot Applications
For MMCC to identify meaningful transitions over time in an entire video, we define a “likely transition score” for each pair (A, B) of frames in a video, according to the model’s predictions — the closer B is to our model’s prediction of the future of A, the higher the score assigned. We then rank all pairs according to this score and show the highest-scoring pairs of present and future frames detected in previously unseen videos (examples below).

The highest-scoring pairs from eight random videos, which showcase the versatility of the model across a wide range of tasks (videos from HowTo100M).

We can use this same approach to temporally sort an unordered collection of video frames without any fine-tuning by finding an ordering that maximizes the overall confidence scores between all adjacent frames in the sorted sequence.

Left: Shuffled frames from three videos. Right: MMCC unshuffles the frames. The true order is shown under each frame. Even when MMCC does not predict the ground truth, its predictions often appear reasonable, and so, it can present an alternate ordering (videos from HowTo100M).

Evaluating Future Prediction
We evaluate the model’s ability to anticipate action, potentially minutes in advance, using the top-k recall metric, which here measures a model’s ability to retrieve the correct future (higher is better). On CrossTask, a dataset of instruction videos with labels describing key steps, MMCC outperforms the previous self-supervised state-of-the-art models in inferring possible future actions.

Recall
Model    Top-1       Top-5       Top-10   
Cross-modal    2.9 14.2 24.3
Repr. Ant. 3.0 13.3 26.0
MemDPC 2.9 15.8 27.4
TAP 4.5 17.1 27.9
MMCC 5.4 19.9 33.8

Conclusions
We have introduced a self-supervised method to learn temporal dynamics by cycling through narrated instructional videos. Despite the simplicity of the model’s architecture, it can discover meaningful long-term transitions in vision and language, and can be applied without further training to challenging downstream tasks, such as anticipating far-away action and ordering collections of images. An interesting future direction is transferring the model to agents so they can use it to conduct long-term planning.

Acknowledgements
The core team includes Dave Epstein, Jiajun Wu, Cordelia Schmid, and Chen Sun. We thank Alexei Efros, Mia Chiquier, and Shiry Ginosar for their feedback, and Allan Jabri for inspiration in figure design. Dave would like to thank Dídac Surís and Carl Vondrick for insightful early discussions on cycling through time in video.

Read More

Model Ensembles Are Faster Than You Think

Posted by Xiaofang Wang, Intern and Yair Alon (prev. Movshovitz-Attias), Software Engineer, Google Research

When building a deep model for a new machine learning application, researchers often begin with existing network architectures, such as ResNets or EfficientNets. If the initial model’s accuracy isn’t high enough, a larger model may be a tempting alternative, but may not actually be the best solution for the task at hand. Instead, better performance potentially could be achieved by designing a new model that is optimized for the task. However, such efforts can be challenging and usually require considerable resources.

In “Wisdom of Committees: An Overlooked Approach to Faster and More Accurate Models”, we discuss model ensembles and a subset called model cascades, both of which are simple approaches that construct new models by collecting existing models and combining their outputs. We demonstrate that ensembles of even a small number of models that are easily constructed can match or exceed the accuracy of state-of-the-art models while being considerably more efficient.

What Are Model Ensembles and Cascades?
Ensembles and cascades are related approaches that leverage the advantages of multiple models to achieve a better solution. Ensembles execute multiple models in parallel and then combine their outputs to make the final prediction. Cascades are a subset of ensembles, but execute the collected models sequentially, and merge the solutions once the prediction has a high enough confidence. For simple inputs, cascades use less computation, but for more complex inputs, may end up calling on a greater number of models, resulting in higher computation costs.

Overview of ensembles and cascades. While this example shows 2-model combinations for both ensembles and cascades, any number of models can potentially be used.

Compared to a single model, ensembles can provide improved accuracy if there is variety in the collected models’ predictions. For example, the majority of images in ImageNet are easy for contemporary image recognition models to classify, but there are many images for which predictions vary between models and that will benefit most from an ensemble.

While ensembles are well-known, they are often not considered a core building block of deep model architectures and are rarely explored when researchers are developing more efficient models (with a few notable exceptions [1, 2, 3]). Therefore, we conduct a comprehensive analysis of ensemble efficiency and show that a simple ensemble or cascade of off-the-shelf pre-trained models can enhance both the efficiency and accuracy of state-of-the-art models.

To encourage the adoption of model ensembles, we demonstrate the following beneficial properties:

  1. Simple to build: Ensembles do not require complicated techniques (e.g., early exit policy learning).
  2. Easy to maintain: Ensembles are trained independently, making them easy to maintain and deploy.
  3. Affordable to train: The total training cost of models in an ensemble is often lower than a similarly accurate single model.
  4. On-device speedup: The reduction in computation cost (FLOPS) successfully translates to a speedup on real hardware.

Efficiency and Training Speed
It’s not surprising that ensembles can increase accuracy, but using multiple models in an ensemble may introduce extra computational cost at runtime. So, we investigate whether an ensemble can be more accurate than a single model that has the same computational cost. We analyze a series of models, EfficientNet-B0 to EfficientNet-B7, that have different levels of accuracy and FLOPS when applied to ImageNet inputs. The ensemble predictions are computed by averaging the predictions of each individual model.

We find that ensembles are significantly more cost-effective in the large computation regime (>5B FLOPS). For example, an ensemble of two EfficientNet-B5 models matches the accuracy of a single EfficientNet-B7 model, but does so using ~50% fewer FLOPS. This demonstrates that instead of using a large model, in this situation, one should use an ensemble of multiple considerably smaller models, which will reduce computation requirements while maintaining accuracy. Moreover, we find that the training cost of an ensemble can be much lower (e.g., two B5 models: 96 TPU days total; one B7 model: 160 TPU days). In practice, model ensemble training can be parallelized using multiple accelerators leading to further reductions. This pattern holds for the ResNet and MobileNet families as well.

Ensembles outperform single models in the large computation regime (>5B FLOPS).

Power and Simplicity of Cascades
While we have demonstrated the utility of model ensembles, applying an ensemble is often wasteful for easy inputs where a subset of the ensemble will give the correct answer. In these situations, cascades save computation by allowing for an early exit, potentially stopping and outputting an answer before all models are used. The challenge is to determine when to exit from the cascade.

To highlight the practical benefit of cascades, we intentionally choose a simple heuristic to measure the confidence of the prediction — we take the confidence of the model to be the maximum of the probabilities assigned to each class. For example, if the predicted probabilities for an image being either a cat, dog, or horse were 20%, 80% and 20%, respectively, then the confidence of the model’s prediction (dog) would be 0.8. We use a threshold on the confidence score to determine when to exit from the cascade.

To test this approach, we build model cascades for the EfficientNet, ResNet, and MobileNetV2 families to match either computation costs or accuracy (limiting the cascade to a maximum of four models). By design in cascades, some inputs incur more FLOPS than others, because more challenging inputs go through more models in the cascade than easier inputs. So we report the average FLOPS computed over all test images. We show that cascades outperform single models in all computation regimes (when FLOPS range from 0.15B to 37B) and can enhance accuracy or reduce the FLOPS (sometimes both) for all models tested.

Cascades of EfficientNet (left), ResNet (middle) and MobileNetV2 (right) models on ImageNet. When using similar FLOPS, cascades obtain a higher accuracy than single models (shown by the red arrows pointing up). Cascades can also match the accuracy of single models with significantly fewer FLOPS e.g., 5.4x for B7 (green arrows pointing left).
Summary of accuracy vs. FLOPS for ensembles and cascades. Squares and stars represent ensembles and cascades, respectively,, and the “+” notation indicates the models that comprise the ensemble or cascade. For example, ”B3+B4+B5+B7” at a star refers to a cascade of EfficientNet-B3, B4, B5 and B7 models.

In some cases it is not the average computation cost but the worst-case cost that is the limiting factor. By adding a simple constraint to the cascade building procedure, one can guarantee an upper bound to the computation cost of the cascade. See the paper for more details.

Other than convolutional neural networks, we also consider a Transformer-based architecture, ViT. We build a cascade of ViT-Base and ViT-Large models to match the average computation or accuracy of a single state-of-the-art ViT-Large model, and show that the benefit of cascades also generalizes to Transformer-based architectures.

        Single Models Cascades – Similar Throughput    Cascades – Similar Accuracy
Top-1 (%) Throughput Top-1 (%) Throughput △Top-1 Top-1 (%) Throughput SpeedUp
ViT-L-224 82.0 192 83.1 221 1.1 82.3 409 2.1x
ViT-L-384 85.0 54 86.0 69 1.0 85.2 125 2.3x
Cascades of ViT models on ImageNet. “224” and “384” indicate the image resolution on which the model is trained. Throughput is measured as the number of images processed per second. Our cascades can achieve a 1.0% higher accuracy than ViT-L-384 with a similar throughput or achieve a 2.3x speedup over that model while matching its accuracy.

Earlier works on cascades have also shown efficiency improvements for state-of-the-art models, but here we demonstrate that a simple approach with a handful of models is sufficient.

Inference Latency
In the above analysis, we average FLOPS to measure the computational cost. It is also important to verify that the FLOPS reduction obtained by cascades actually translates into speedup on hardware. We examine this by comparing on-device latency and speed-up for similarly performing single models versus cascades. We find a reduction in the average online latency on TPUv3 of up to 5.5x for cascades of models from the EfficientNet family compared to single models with comparable accuracy. As models become larger the more speed-up we find with comparable cascades.

Average latency of cascades on TPUv3 for online processing. Each pair of same colored bars has comparable accuracy. Notice that cascades provide drastic latency reduction.

Building Cascades from Large Pools of Models
Above, we limit the model types and only consider ensembles/cascades of at most four models. While this highlights the simplicity of using ensembles, it also allows us to check all combinations of models in very little time so we can find optimal model collections with only a few CPU hours on a held out set of predictions.

When a large pool of models exists, we would expect cascades to be even more efficient and accurate, but brute force search is not feasible. However, efficient cascade search methods have been proposed. For example, the algorithm of Streeter (2018), when applied to a large pool of models, produced cascades that matched the accuracy of state-of-the-art neural architecture search–based ImageNet models with significantly fewer FLOPS, for a range of model sizes.

Conclusion
As we have seen, ensemble/cascade-based models obtain superior efficiency and accuracy over state-of-the-art models from several standard architecture families. In our paper we show more results for other models and tasks. For practitioners, this outlines a simple procedure to boost accuracy while retaining efficiency using off-the-shelf models. We encourage you to try it out!

Acknowledgement
This blog presents research done by Xiaofang Wang (while interning at Google Research), Dan Kondratyuk, Eric Christiansen, Kris M. Kitani, Yair Alon (prev. Movshovitz-Attias), and Elad Eban. We thank Sergey Ioffe, Shankar Krishnan, Max Moroz, Josh Dillon, Alex Alemi, Jascha Sohl-Dickstein‎, Rif A Saurous, and Andrew Helton for their valuable help and feedback.

Read More

Expanding our ML-based flood forecasting

In 2018 we began our flood forecasting initiative to help combat the catastrophic damage from floods each year by equipping those in harm’s way with accurate and detailed alerts. This work is a part of Google’s broader Crisis Response program which provides people access to trusted information and resources in critical moments. For over a decade, our Crisis Response team has been partnering with front line and emergency workers to develop technology and programs that help keep people safe, informed and out of harm’s way.

Expanding our forecasting reach

In the first three years, we expanded our program to cover much of India and Bangladesh, working in partnership with the Indian Central Water Commision and with the Bangladesh Water Development Board, covering an area with about 220 million people and sending out 40 million potentially life-saving alerts. And in 2021, our operational systems were further expanded to cover an area with over 360 million people. Thanks to better flood prediction technology, we sent out over 115 million alerts — that’s about triple the amount we previously sent out.

Coverage areas of our current operational flood forecasting systems.

Coverage areas of our current operational flood forecasting systems. In these areas, we use our models to help government alerts reach the right people. In some areas we have also increased lead time and spatial accuracy.

We’re hyper-focused on making alerts more local, accessible, actionable and accurate — the more information we can offer about upcoming floods, the better, more timely decisions people can make. Most global flood alerts only provide information on how much a river will rise (e.g. 30 cm), which doesn’t always mean people can know what that would mean for them and their village. Our flood alerts display inundation maps, which show the extent and depth of flooding right on top of Google Maps, so people can visualize this critical information more easily. Our new manifold inundation model and advances across all models allow us to scale up significantly and provide this information to many more people (and we’ll share more about this technology in the near future).

  • Google Flood alerts

    Google Flood alerts

  • Google Flood alerts

    Google Flood alerts

  • Google Flood alerts

    Google Flood alerts

We recently launched the Google Flood Hub to make this flood data even more hyper-local. It allows you to zoom into our inundation maps where you can find information about the same flood, and focus on highly specific areas, such as a village. The Flood Hub provides the same depth and flood extent information in a more visual format that helps people to understand the current and forecasted flood situation in their area instantly. This site will be our primary resource for local, visual forecast information moving forward.

The Google Flood Hub user interface on a mobile device

The Google Flood Hub user interface on a mobile device

We’ve also partnered with multiple local aid organizations such as Federation of Red Cross and Red Crescent Societies, Indian Red Cross Society (IRCS), Bangladesh Red Crescent Society (BDRCS) and Yuganter to help get the alerts out even to people without smartphones or internet access. We worked closely with the organizations’ local teams who traveled between villages to train locals. The training included deeper explanations on how to read the Google alerts and flood maps, as well as how to act and notify others once an alert is issued.

Our flood forecasting system is now live in all of India and Bangladesh, and we are working to expand these life-saving alerts to countries in South Asia and South America. And eventually, we want them to be available everywhere.

Read More

A communication tool for people with speech impairments

For millions of people, being able to speak and be understood can be difficult as a result of conditions that can impact speech, including stroke, ALS, Cerebral Palsy, traumatic brain injury or Parkinson’s disease. Today, we’re inviting an initial group of people to test Project Relate, a new Android app that aims to help people with speech impairments communicate more easily with others and interact with the Google Assistant.

Project Relate is a continuation of years of research from both Google’s Speech and Research teams, made possible by over a million speech samples recorded by participants of our research effort. We are now looking for English-speaking testers in Australia, Canada, New Zealand and the United States to try out the app and provide feedback to help us improve it.

As an early tester of Project Relate, you will be asked to record a set of phrases. The app will use these phrases to automatically learn how to better understand your unique speech patterns, and give you access to the app’s three main features: Listen, Repeat and Assistant.

Listen: Through the Listen feature, the Relate app transcribes your speech to text in real time, so you can copy-paste text into other apps, or let people read what you want to tell them.

Repeat: You can use the Repeat feature to restate what you’ve said using a clear, synthesized voice. We hope this can be especially helpful in face-to-face conversation or even when you want to speak a command to your home assistant device.

Assistant: Speak directly to your Google Assistant from within the Relate app, so you can take care of different tasks, such as turning on the lights or playing a song, with ease.

In creating the app, we worked closely with many people with speech impairments, including Aubrie Lee, a brand manager at Google, whose speech is affected by muscular dystrophy. “I’m used to the look on people’s faces when they can’t understand what I’ve said,” Aubrie shared with us. “Project Relate can make the difference between a look of confusion and a friendly laugh of recognition.” Since Aubrie works on the marketing team that names new products, she also helped us name the app!

If you have a condition that makes your speech difficult to understand, you may be able to help provide feedback on the Project Relate Android app as a trusted tester. To express interest, please fill out our interest form at g.co/ProjectRelate, and the team will get back to you in the coming months.

With your help, we hope to build a future in which people with disabilities can more easily communicate and be understood.

Read More