Evaluating Natural Language Generation with BLEURT

Evaluating Natural Language Generation with BLEURT

Posted by Thibault Sellam, Software Engineer, and Ankur P. Parikh, Research Scientist, Google Research

In the last few years, research in natural language generation (NLG) has made tremendous progress, with models now able to translate text, summarize articles, engage in conversation, and comment on pictures with unprecedented accuracy, using approaches with increasingly high levels of sophistication. Currently, there are two methods to evaluate these NLG systems: human evaluation and automatic metrics. With human evaluation, one runs a large-scale quality survey for each new version of a model using human annotators, but that approach can be prohibitively labor intensive. In contrast, one can use popular automatic metrics (e.g., BLEU), but these are oftentimes unreliable substitutes for human interpretation and judgement. The rapid progress of NLG and the drawbacks of existing evaluation methods calls for the development of novel ways to assess the quality and success of NLG systems.

In “BLEURT: Learning Robust Metrics for Text Generation” (presented during ACL 2020), we introduce a novel automatic metric that delivers ratings that are robust and reach an unprecedented level of quality, much closer to human annotation. BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) builds upon recent advances in transfer learning to capture widespread linguistic phenomena, such as paraphrasing. The metric is available on Github.

Evaluating NLG Systems
In human evaluation, a piece of generated text is presented to annotators, who are tasked with assessing its quality with respect to its fluency and meaning. The text is typically shown side-by-side with a reference, authored by a human or mined from the Web.

An example questionnaire used for human evaluation in machine translation.

The advantage of this method is that it is accurate: people are still unrivaled when it comes to evaluating the quality of a piece of text. However, this method of evaluation can easily take days and involve dozens of people for just a few thousand examples, which disrupts the model development workflow.

In contrast, the idea behind automatic metrics is to provide a cheap, low-latency proxy for human-quality measurements. Automatic metrics often take two sentences as input, a candidate and a reference, and they return a score that indicates to what extent the former resembles the latter, typically using lexical overlap. A popular metric is BLEU, which counts the sequences of words in the candidate that also appear in the reference (the BLEU score is very similar to precision).

The advantages and weaknesses of automatic metrics are the opposite of those that come with human evaluation. Automatic metrics are convenient — they can be computed in real-time throughout the training process (e.g., for plotting with Tensorboard). However, they are often inaccurate due to their focus on surface-level similarities and they fail to capture the diversity of human language. Frequently, there are many perfectly valid sentences that can convey the same meaning. Overlap-based metrics that rely exclusively on lexical matches unfairly reward those that resemble the reference in their surface form, even if they do not accurately capture meaning, and penalize other paraphrases.

BLEU scores for three candidate sentences. Candidate 2 is semantically close to the reference, and yet its score is lower than Candidate 3.

Ideally, an evaluation method for NLG should combine the advantages of both human evaluation and automatic metrics — it should be relatively cheap to compute, but flexible enough to cope with linguistic diversity.

Introducing BLEURT
BLEURT is a novel, machine learning-based automatic metric that can capture non-trivial semantic similarities between sentences. It is trained on a public collection of ratings (the WMT Metrics Shared Task dataset) as well as additional ratings provided by the user.

Three candidate sentences rated by BLEURT. BLEURT captures that candidate 2 is similar to the reference, even though it contains more non-reference words than candidate 3.

Creating a metric based on machine learning poses a fundamental challenge: the metric should do well consistently on a wide range of tasks and domains, and over time. However, there is only a limited amount of training data. Indeed, public data is sparse — the WMT Metrics Task dataset, the largest collection of human ratings at the time of writing, contains ~260K human ratings covering the news domain only. This is too limited to train a metric suited for the evaluation of NLG systems of the future.

To address this problem, we employ transfer learning. First, we use the contextual word representations of BERT, a state-of-the-art unsupervised representation learning method for language understanding that has already been successfully incorporated into NLG metrics (e.g., YiSi or BERTscore).

Second, we introduce a novel pre-training scheme to increase BLEURT’s robustness. Our experiments reveal that training a regression model directly over publicly available human ratings is a brittle approach, since we cannot control in what domain and across what time span the metric will be used. The accuracy is likely to drop in the presence of domain drift, i.e., when the text used comes from a different domain than the training sentence pairs. It may also drop when there is a quality drift, when the ratings to be predicted are higher than those used during training — a feature which would normally be good news because it indicates that ML research is making progress.

The success of BLEURT relies on “warming-up” the model using millions of synthetic sentence pairs before fine-tuning on human ratings. We generated training data by applying random perturbations to sentences from Wikipedia. Instead of collecting human ratings, we use a collection of metrics and models from the literature (including BLEU), which allows the number of training examples to be scaled up at very low cost.

BLEURT’s data generation process combines random perturbations and scoring with pre-existing metrics and models.

Experiments reveal that pre-training significantly increases BLEURT’s accuracy, especially when the test data is out-of-distribution.

We pre-train BLEURT twice, first with a language modelling objective (as explained in the original BERT paper), then with a collection of NLG evaluation objectives. We then fine-tune the model on the WMT Metrics dataset, on a set of ratings provided by the user, or a combination of both.The following figure illustrates BLEURT’s training procedure end-to-end.

Results
We benchmark BLEURT against competing approaches and show that it offers superior performance, correlating well with human ratings on the WMT Metrics Shared Task (machine translation) and the WebNLG Challenge (data-to-text). For example, BLEURT is ~48% more accurate than BLEU on the WMT Metrics Shared Task of 2019. We also demonstrate that pre-training helps BLEURT cope with quality drift.

Correlation between different metrics and human ratings on the WMT’19 Metrics Shared Task.

Conclusion
As NLG models have gotten better over time, evaluation metrics have become an important bottleneck for the research in this field. There are good reasons why overlap-based metrics are so popular: they are simple, consistent, and they do not require any training data. In the use cases where multiple reference sentences are available for each candidate, they can be very accurate. While they play a critical part in our infrastructure, they are also very conservative, and only give an incomplete picture of NLG systems’ performance. Our view is that ML engineers should enrich their evaluation toolkits with more flexible, semantic-level metrics.

BLEURT is our attempt to capture NLG quality beyond surface overlap. Thanks to BERT’s representations and a novel pre-training scheme, our metric yields SOTA performance on two academic benchmarks, and we are currently investigating how it can improve Google products. Future research includes investigating multilinguality and multimodality.

Acknowledgements
This project was co-advised by Dipanjan Das. We thank Slav Petrov, Eunsol Choi, Nicholas FitzGerald, Jacob Devlin, Madhavan Kidambi, Ming-Wei Chang, and all the members of the Google Research Language team.

Open-Sourcing BiT: Exploring Large-Scale Pre-training for Computer Vision

Open-Sourcing BiT: Exploring Large-Scale Pre-training for Computer Vision

Posted by Lucas Beyer and Alexander Kolesnikov, Research Engineers, Google Research, Zürich

A common refrain for computer vision researchers is that modern deep neural networks are always hungry for more labeled data — current state-of-the-art CNNs need to be trained on datasets such as OpenImages or Places, which consist of over 1M labelled images. However, for many applications, collecting this amount of labeled data can be prohibitive to the average practitioner.

A common approach to mitigate the lack of labeled data for computer vision tasks is to use models that have been pre-trained on generic data (e.g., ImageNet). The idea is that visual features learned on the generic data can be re-used for the task of interest. Even though this pre-training works reasonably well in practice, it still falls short of the ability to both quickly grasp new concepts and understand them in different contexts. In a similar spirit to how BERT and T5 have shown advances in the language domain, we believe that large-scale pre-training can advance the performance of computer vision models.

In “Big Transfer (BiT): General Visual Representation Learning” we devise an approach for effective pre-training of general features using image datasets at a scale beyond the de-facto standard (ILSVRC-2012). In particular, we highlight the importance of appropriately choosing normalization layers and scaling the architecture capacity as the amount of pre-training data increases. Our approach exhibits unprecedented performance adapting to a wide range of new visual tasks, including the few-shot recognition setting and the recently introduced “real-world” ObjectNet benchmark. We are excited to share the best BiT models pre-trained on public datasets, along with code in TF2, Jax, and PyTorch. This will allow anyone to reach state-of-the-art performance on their task of interest, even with just a handful of labeled images per class.

Pre-training
In order to investigate the effect of data scale, we revisit common design choices of the pre-training setup (such as normalizations of activations and weights, model width/depth and training schedules) using three datasets: ILSVRC-2012 (1.28M images with 1000 classes), ImageNet-21k (14M images with ~21k classes) and JFT (300M images with ~18k classes). Importantly, with these datasets we concentrate on the previously underexplored large data regime.

We first investigate the interplay between dataset size and model capacity. To do this we train classical ResNet architectures, which perform well, while being simple and reproducible. We train variants from the standard 50-layer deep “R50x1” up to the 4x wider and 152-layer deep “R152x4” on each of the above-mentioned datasets. A key observation is that in order to profit from more data, one also needs to increase model capacity. This is exemplified by the red arrows in the left-hand panel of the figure below

Left: In order to make effective use of a larger dataset for pre-training, one needs to increase model capacity. The red arrows exemplify this: small architectures (smaller point) become worse when pre-trained on the larger ImageNet-21k, whereas the larger architectures (larger points) improve. Right: Pre-training on a larger dataset alone does not necessarily result in improved performance, e.g., when going from ILSVRC-2012 to the relatively larger ImageNet-21k. However, by also increasing the computational budget and training for longer, the performance improvement is pronounced.

A second, even more important observation, is that the training duration becomes crucial. If one pre-trains on a larger dataset without adjusting the computational budget and training longer, performance is likely to become worse. However, by adapting the schedule to the new dataset, the improvements can be significant.

During our exploration phase, we discovered another modification crucial to improving performance. We show that replacing batch normalization (BN, a commonly used layer that stabilizes training by normalizing activations) with group normalization (GN) is beneficial for pre-training at large scale. First, BN’s state (mean and variance of neural activations) needs adjustment between pre-training and transfer, whereas GN is stateless, thus side-stepping this difficulty. Second, BN uses batch-level statistics, which become unreliable with small per-device batch sizes that are inevitable for large models. Since GN does not compute batch-level statistics, it also side-steps this issue. For more technical details, including the use of a weight standardization technique to ensure stable behavior, please see our paper.

Summary of our pre-training strategy: take a standard ResNet, increase depth and width, replace BatchNorm (BN) with GroupNorm and Weight Standardization (GNWS), and train on a very large and generic dataset for many more iterations.

Transfer Learning
Following the methods established in the language domain by BERT, we fine-tune the pre-trained BiT model on data from a variety of “downstream” tasks of interest, which may come with very little labeled data. Because the pre-trained model already comes with a good understanding of the visual world, this simple strategy works remarkably well.

Fine-tuning comes with a lot of hyper-parameters to be chosen, such as learning-rate, weight-decay, etc. We propose a heuristic for selecting these hyper-parameters that we call “BiT-HyperRule”, which is based only on high-level dataset characteristics, such as image resolution and the number of labeled examples. We successfully apply the BiT-HyperRule on more than 20 diverse tasks, ranging from natural to medical images.

Once the BiT model is pre-trained, it can be fine-tuned on any task, even if only few labeled examples are available.

When transfering BiT to tasks with very few examples, we observe that as we simultaneously increase the amount of generic data used for pre-training and the architecture capacity, the ability of the resulting model to adapt to novel data drastically improves. On both 1-shot and 5-shot CIFAR (see Fig below) increasing model capacity yields limited returns when pre-training on ILSVRC (green curves). Yet, with large-scale pre-training on JFT, each step-up in model capacity yields massive returns (brown curves), up to BiT-L which attains 64% 1-shot and 95% 5-shot.

The curves depict median accuracy over 5 independent runs (light points) when transferring to CIFAR-10 with only 1 or 5 images per class (10 or 50 images total). It is evident that large architectures pre-trained on large datasets are significantly more data-efficient.

In order to verify that this result holds more generally, we also evaluate BiT on VTAB-1k, which is a suite of 19 diverse tasks with only 1000 labeled examples per task. We transfer the BiT-L model to all these tasks and achieve a score of 76.3% overall, which is a 5.8% absolute improvement over the previous state-of-the-art.

We show that this strategy of large-scale pre-training and simple transfer is effective even when a moderate amount of data is available by evaluating BiT-L on several standard computer vision benchmarks such as Oxford Pets and Flowers, CIFAR, etc. On all of these, BiT-L matches or surpasses state-of-the-art results. Finally, we use BiT as a backbone for RetinaNet on the MSCOCO-2017 detection task and confirm that even for such a structured output task, using large-scale pre-training helps considerably.

Left: Accuracy of BiT-L compared to the previous state-of-the-art general model on various standard computer vision benchmarks. Right: Results in average precision (AP) of using BiT as backbone for RetinaNet on MSCOCO-2017.

It is important to emphasize that across all the different downstream tasks we consider, we do not perform per-task hyper-parameter tuning and rely on the BiT-HyperRule. As we show in the paper, even better results can be achieved by tuning hyperparameters on sufficiently large validation data.

Evaluation with ObjectNet
To further assess the robustness of BiT in a more challenging scenario, we evaluate BiT models that were fine-tuned on ILSVRC-2012 on the recently introduced ObjectNet dataset. This dataset closely resembles real-world scenarios, where objects may appear in atypical context, viewpoint, rotation, etc. Interestingly, the benefit from data and architecture scale is even more pronounced with the BiT-L model achieving unprecedented top-5 accuracy of 80.0%, an almost 25% absolute improvement over the previous state-of-the-art.

Results of BiT on the ObjectNet evaluation dataset. Left: top-5 accuracy, right: top-1 accuracy.

Conclusion
We show that given pre-training on large amounts of generic data, a simple transfer strategy leads to impressive results, both on large datasets as well as tasks with very little data, down to a single image per class. We release the BiT-M model, a R152x4 pre-trained on ImageNet-21k, along with colabs for transfer in Jax, TensorFlow2, and PyTorch. In addition to the code release, we refer the reader to the hands-on TensorFlow2 tutorial on how to use BiT models. We hope that practitioners and researchers find it a useful alternative to commonly used ImageNet pre-trained models.

Acknowledgements
We would like to thank Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby who have co-authored the BiT paper and been involved in all aspects of its development, as well as the Brain team in Zürich. We also would like to thank Andrei Giurgiu for his help in debugging input pipelines. We thank Tom Small for creating the animations used in this blogpost. Finally, we refer the interested reader to the related approaches in this direction by our colleagues in Google Research, Noisy Student, as well as Facebook Research’s highly relevant Exploring the Limits of Weakly Supervised Pretraining.

Announcing the 7th Fine-Grained Visual Categorization Workshop

Announcing the 7th Fine-Grained Visual Categorization Workshop

Posted by Christine Kaeser-Chen, Software Engineer and Serge Belongie, Visiting Faculty, Google Research

Fine-grained visual categorization refers to the problem of distinguishing between images of closely related entities, e.g., a monarch butterfly (Danaus plexippus) from a viceroy (Limenitis archippus). At the time of the first FGVC workshop in 2011, very few fine-grained datasets existed, and the ones that were available (e.g., the CUB dataset of 200 bird species, launched at that workshop) presented a formidable challenge to the leading classification algorithms of the time. Fast forward to 2020, and the computer vision landscape has undergone breathtaking changes. Deep learning based methods helped CUB-200-2011 accuracy rocket from 17% to 90% and fine-grained datasets have proliferated, with data arriving from a diverse array of institutions, such as art museums, apparel retailers, and cassava farms.

In order to help support even further progress in this field, we are excited to sponsor and co-organize the 7th Workshop on Fine-Grained Visual Categorization (FGVC7), which will take place as a virtual gathering on June 19, 2020, in conjunction with the IEEE conference on Computer Vision and Pattern Recognition (CVPR). We’re excited to highlight this year’s world-class lineup of fine-grained challenges, ranging from fruit tree disease prediction to fashion attributes, and we invite computer vision researchers from across the world to participate in the workshop.

The FGVC workshop at CVPR 2020 focuses on subordinate categories, including (from left to right) wildlife camera traps, plant pathology, birds, herbarium sheets, apparel, and museum artifacts.

Real-World Impact of the FGVC Challenges
In addition to pushing the frontier of fine-grained recognition on ever more challenging datasets, each FGVC workshop cycle provides opportunities for fostering new collaborations between researchers and practitioners. Some of the efforts from the FGVC workshop have made the leap into the hands of real world users.

The 2018 FGVC workshop hosted a Fungi challenge with data for 1,500 mushroom species provided by the Danish Mycological Society. When the competition concluded, the leaderboard was topped by a team from Czech Technical University and the University of West Bohemia.

The mycologists subsequently invited the Czech researchers for a visit to Copenhagen to explore further collaboration and field test a new workflow for collaborative machine learning research in biodiversity. This resulted in a jointly authored conference paper, a mushroom recognition app for Android and iOS, and an open access model published on TensorFlow Hub.

The Svampeatlas app for mushroom recognition is a result of a Danish-Czech collaboration spun out of the FGVC 2018 Fungi challenge. The underlying model is now published on TF Hub. Images used with permission of the Danish Mycological Society.

The iCassava Disease Challenge from 2019 mentioned above is another example of an FGVC team effort finding its way into the real world. In this challenge, Google researchers in Ghana collaborated with Makerere University and the National Crops Resources Research Institute (NaCRRI) to produce an annotated dataset of five cassava disease categories.

Examples of cassava leaf disease represented in the 2019 iCassava challenge.

The teams are testing a new model in the fields in Uganda with local farmers, and the model will be published on TFHub soon.

This Year’s Challenges
FGVC7 will feature six challenges, four of which represent sequels to past offerings, and two of which are brand new.

In iWildCam, the challenge is to identify different species of animals in camera trap images. Like its predecessors in 2018 and 2019, this year’s competition makes use of data from static, motion-triggered cameras used by biologists to study animals in the wild. Participants compete to build models that address diverse regions from around the globe, with a focus on generalization to held-out camera deployments within those regions, which exhibit differences in device model, image quality, local environment, lighting conditions, and species distributions, making generalization difficult.

It has been shown that species classification performance can be dramatically improved by using information beyond the image itself. In addition, since an ecosystem can be monitored in a variety of ways (e.g., camera traps, citizen scientists, remote sensing), each of which has its own strengths and limitations, it is important to facilitate the exploration of techniques for combining these complementary modalities. To this end, the competition provides a time series of remote sensing imagery for each camera trap location, as well as images from the iNaturalist competition datasets for species in the camera trap data.

Side-by-side comparison of image quality from iWildcam, captured from wildlife camera traps, (left) and iNaturalist (right), captured by conventional cameras. Images are from the 2020 iWildCam Challenge, and the iNaturalist competition datasets from 2017 and 2018.

The Herbarium Challenge, now in its second year, entails plant species identification, based on a large, long-tailed collection of herbarium specimens. Developed in collaboration with the New York Botanical Garden (NYBG), this challenge features over 1 million images representing over 32,000 plant species. Last year’s challenge was based on 46,000 specimens for 680 species. Being able to recognize species from historical herbarium collections can not only help botanists better understand changes in plant life on our planet, but also offers a unique opportunity to identify previously undescribed new species in the collection.

Representative examples of specimens from the 2020 Herbarium challenge. Images used with permission of the New York Botanical Garden.

In this year’s iMat Fashion challenge, participants compete to perform apparel instance segmentation and fine-grained attribute classification. The goal of this competition is to push the state of the art in fine-grained segmentation by joining forces between the fashion and computer vision communities. This challenge is in its third iteration, growing both in size and level of detail over past years’ offerings.

The last of the sequels is iMet, in which participants are challenged with building algorithms for fine-grained attribute classification on works of art. Developed in collaboration with the Metropolitan Museum of Art, the dataset has grown significantly since the 2019 edition, with a wide array of new cataloguing information generated by subject matter experts including multiple object classifications, artist, title, period, date, medium, culture, size, provenance, geographic location, and other related museum objects within the Met’s collection.

Semi-Supervised Aves is one of the new challenges at this year’s workshop. While avian data from iNaturalist has featured prominently in past FGVC challenges, this challenge focuses on the problem of learning from partially labeled data, a form of semi-supervised learning. The dataset is designed to expose some of the challenges encountered in realistic settings, such as the fine-grained similarity between classes, significant class imbalance, and domain mismatch between the labeled and unlabeled data.

Rounding out the set of challenges is Plant Pathology. In this challenge, the participants attempt to spot foliar diseases of apples using a reference dataset of expert-annotated diseased specimens. While this particular challenge is new to the FGVC community, it is the second such challenge to involve plant disease, the first being iCassava at last year’s FGVC.

Invitation to Participate
The results of these competitions will be presented at the FGVC7 workshop by top performing teams. We invite researchers, practitioners, and domain experts to participate in the FGVC workshop to learn more about state-of-the-art advances in fine-grained image recognition. We are excited to encourage the community’s development of cutting edge algorithms for fine-grained visual categorization and foster new collaborations with global impact!

Acknowledgements
We’d like to thank our colleagues and friends on the FGVC7 organizing committee for working together to advance this important area. At Google we would like to thank Hartwig Adam, Kiat Chuan Tan, Arvi Gjoka, Kimberly Wilber, Sara Beery, Mikhail Sirotenko, Denis Brulé, Timnit Gebru, Ernest Mwebaze, Wojciech Sirko, Maggie Demkin.

How AI could predict sight-threatening eye conditions

How AI could predict sight-threatening eye conditions

Age-related macular degeneration (AMD) is the biggest cause of sight loss in the UK and USA and is the third largest cause of blindness across the globe. The latest research collaboration between Google Health, DeepMind and Moorfields Eye Hospital is published in Nature Medicine today. It shows that artificial intelligence (AI) has the potential to not only spot the presence of AMD in scans, but also predict the disease’s progression. 

Vision loss and wet AMD

Around 75 percent of patients with AMD have an early form called “dry” AMD that usually has relatively mild impact on vision. A minority of patients, however, develop the more sight-threatening form of AMD called exudative, or “wet” AMD. This condition affects around 15 percent of patients, and occurs when abnormal blood vessels develop underneath the retina. These vessels can leak fluid, which can cause permanent loss of central vision if not treated early enough.

Macular degeneration mainly affects central vision, causing "blind spots" directly ahead

Macular degeneration mainly affects central vision, causing “blind spots” directly ahead (Macular Society).

Wet AMD often affects one eye first, so patients become heavily reliant upon their unaffected eye to maintain their normal day-to-day living. Unfortunately, 20 percent of these patientswill go on to develop wet AMD in their other eye within two years. The condition often develops suddenly but further vision loss can be slowed with treatments if wet AMD is recognized early enough. Ophthalmologists regularly monitor their patients for signs of wet AMD using 3D optical coherence tomography (OCT) images of the retina.

The period before wet AMD develops is a critical window for preventive treatment, which is why we set out to build a system that could predict whether a patient with wet AMD in one eye will go on to develop the condition in their second eye. This is a novel clinical challenge, since it’s not a task that is routinely performed.

How AI could predict the development of wet AMD

In collaboration with colleagues at DeepMind and Moorfields Eye Hospital NHS Foundation Trust, we’ve developed an artificial intelligence (AI) model that has the potential to predict whether a patient will develop wet AMD within six months. In the future, this system could potentially help doctors plan studies of earlier intervention, as well as contribute more broadly to clinical understanding of the disease and disease progression. 

We trained and tested our model using a retrospective, anonymized dataset of 2,795 patients. These patients had been diagnosed with wet AMD in one of their eyes, and were attending one of seven clinical sites for regular OCT imaging and treatment. For each patient, our researchers worked with retinal experts to review all prior scans for each eye and determine the scan when wet AMD was first evident. In collaboration with our colleagues at DeepMind we developed an AI system composed of two deep convolutional neural networks, one taking the raw 3D scan as input and the other, built on our previous work, taking a segmentation map outlining the types of tissue present in the retina. Our prediction system used the raw scan and tissue segmentations to estimate a patient’s risk of progressing to wet AMD within the next six months. 

To test the system, we presented the model with a single, de-identified scan and asked it to predict whether there were any signs that indicated the patient would develop wet AMD in the following six months. We also asked six clinical experts—three retinal specialists and three optometrists, each with at least ten years’ experience—to do the same. Predicting the possibility of a patient developing wet AMD is not a task that is usually performed in clinical practice so this is the first time, to our knowledge, that experts have been assessed on this ability. 

While clinical experts performed better than chance alone, there was substantial variability between their assessments. Our system performed as well as, and in certain cases better than, these clinicians in predicting wet AMD progression. This highlights its potential use for informing studies in the future to assess or help develop treatments to prevent wet AMD progression.

Future work could address several limitations of our research. The sample was representative of practice at multiple sites of the world’s largest eye hospital, but more work is needed to understand the model performance in different demographics and clinical settings. Such work should also understand the impact of unstudied factors—such as additional imaging tests—that might be important for prediction, but were beyond the scope of this work.

What’s next 

These findings demonstrate the potential for AI to help improve understanding of disease progression and predict the future risk of patients developing sight-threatening conditions. This, in turn, could help doctors study preventive treatments.

This is the latest stage in our partnership with Moorfields Eye Hospital NHS Foundation Trust, a long-standing relationship that transitioned from DeepMind to Google Health in September 2019. Our previous collaborations include using AI to quickly detect eye conditions, and showing how Google Cloud AutoML might eventually help clinicians without prior technical experience to accurately detect common diseases from medical images. 

This is early research, rather than a product that could be implemented in routine clinical practice. Any future product would need to go through rigorous prospective clinical trials and regulatory approvals before it could be used as a tool for doctors. This work joins a growing body of research in the area of developing predictive models that could inform clinical research and trials. In line with this, Moorfields will be making the dataset available through the Ryan Initiative for Macular Research. We hope that models like ours will be able to support this area of work to improve patient outcomes. 

Read More

Enabling E-Textile Microinteractions: Gestures and Light through Helical Structures

Enabling E-Textile Microinteractions: Gestures and Light through Helical Structures

Posted by Alex Olwal, Research Scientist, Google Research

Textiles have the potential to help technology blend into our everyday environments and objects by improving aesthetics, comfort, and ergonomics. Consumer devices have started to leverage these opportunities through fabric-covered smart speakers and braided headphone cords, while advances in materials and flexible electronics have enabled the incorporation of sensing and display into soft form factors, such as jackets, dresses, and blankets.

A scalable interactive E-textile architecture with embedded touch sensing, gesture recognition and visual feedback.

In “E-textile Microinteractions” (Proceedings of ACM CHI 2020), we bring interactivity to soft devices and demonstrate how machine learning (ML) combined with an interactive textile topology enables parallel use of discrete and continuous gestures. This work extends our previously introduced E-textile architecture (Proceedings of ACM UIST 2018). This research focuses on cords, due to their modular use as drawstrings in garments, and as wired connections for data and power across consumer devices. By exploiting techniques from textile braiding, we integrate both gesture sensing and visual feedback along the surface through a repeating matrix topology.

For insight into how this works, please see this video about E-textile microinteractions and this video about the E-textile architecture.

E-textile microinteractions combining continuous sensing with discrete motion and grasps.

The Helical Sensing Matrix (HSM)
Braiding generally refers to the diagonal interweaving of three or more material strands. While braids are traditionally used for aesthetics and structural integrity, they can also be used to enable new sensing and display capabilities.

Whereas cords can be made to detect basic touch gestures through capacitive sensing, we developed a helical sensing matrix (HSM) that enables a larger gesture space. The HSM is a braid consisting of electrically insulated conductive textile yarns and passive support yarns,where conductive yarns in opposite directions take the role of transmit and receive electrodes to enable mutual capacitive sensing. The capacitive coupling at their intersections is modulated by the user’s fingers, and these interactions can be sensed anywhere on the cord since the braided pattern repeats along the length.

Left: A Helical Sensing Matrix based on a 4×4 braid (8 conductive threads spiraled around the core). Magenta/cyan are conductive yarns, used as receive/transmit lines. Grey are passive yarns (cotton). Center: Flattened matrix, that illustrates the infinite number of 4×4 matrices (colored circles 0-F), which repeat along the length of the cord. Right: Yellow are fiber optic lines, which provide visual feedback.

Rotation Detection
A key insight is that the two axial columns in an HSM that share a common set of electrodes (and color in the diagram of the flattened matrix) are 180º opposite each other. Thus, pinching and rolling the cord activates a set of electrodes and allows us to track relative motion across these columns. Rotation detection identifies the current phase with respect to the set of time-varying sinusoidal signals that are offset by 90º. The braid allows the user to initiate rotation anywhere, and is scalable with a small set of electrodes.

Rotation is deduced from horizontal finger motion across the columns. The plots below show the relative capacitive signal strengths, which change with finger proximity.

Interaction Techniques and Design Guidelines
This e-textile architecture makes the cord touch-sensitive, but its softness and malleability limit suitable interactions compared to rigid touch surfaces. With the unique material in mind, our design guidelines emphasize:

  • Simple gestures. We design for short interactions where the user either makes a single discrete gesture or performs a continuous manipulation.
  • Closed-loop feedback. We want to help the user discover functionality and get continuous feedback on their actions. Where possible, we provide visual, tactile, and audio feedback integrated in the device.

Based on these principles, we leverage our e-textile architecture to enable interaction techniques based on our ability to sense proximity, area, contact time, roll and pressure.

Our e-textile enables interaction based on capacitive sensing of proximity, contact area, contact time, roll, and pressure.

The inclusion of fiber optic strands that can display color of varying intensity enable dynamic real-time feedback to the user.

Braided fiber optics strands create the illusion of directional motion.

Motion Gestures (Flicks and Slides) and Grasping Styles (Pinch, Grab, Pinch)
We conducted a gesture elicitation study, which showed opportunities for an expanded gesture set. Inspired by these results, we decided to investigate five motion gestures based on flicks and slides, along with single­-touch gestures (pinch, grab and pat).

Gesture elicitation study with imagined touch sensing.

We collected data from 12 new participants, which resulted in 864 gesture samples (12 participants performed eight gestures each, repeating nine times), each having 16 features linearly interpolated to 80 observations over time. Participants performed the eight gestures in their own style without feedback as we wanted to accommodate individual differences since the classification is highly dependent on user style (“contact”), preference (“how to pinch/grab”) and anatomy (e.g., hand size). Our pipeline was thus designed for user-dependent training to enable individual styles with differences across participants, such as the inconsistent use of clockwise/counterclockwise, overlap between temporal gestures (e.g., flick vs. flick and hold, and similar pinch and grab gestures.) For a user-independent system, we would need to address such differences, for example with stricter instructions for consistency, data from a larger population, and in more diverse settings. Real-time feedback during training will also help mitigate differences as the user learns to adjust their behavior.

Twelve participants (horizontal axis) performed 9 repetitions (animation) for the eight gestures (vertical axis). Each sub-image shows 16 overlaid feature vectors, interpolated to 80 observations over time.

We performed cross-validation for each user across the gestures by training on eight repetitions and testing on one, through nine permutations, and achieved a gesture recognition accuracy of ~94%. This result is encouraging, especially given the expressivity enabled by such a low-resolution sensor matrix (eight electrodes).

Notable here is that inherent relationships in the repeated sensing matrices are well-suited for machine learning classification. The ML classifier used in our research enables quick training with limited data, which makes a user-dependent interaction system reasonable. In our experience, training for a typical gesture takes less than 30s, which is comparable to the amount of time required to train a fingerprint sensor.

User-Independent, Continuous Twist: Quantifying Precision and Speed
The per-user trained gesture recognition enabled eight new discrete gestures. For continuous interactions, we also wanted to quantify how well user-independent, continuous twist performs for precision tasks. We compared our e-textile with two baselines, a capacitive multi-touch trackpad (“Scroll”) and the familiar headphone cord remote control (“Buttons”). We designed a lab study where the three devices controlled 1D movement in a targeting task.

We analysed three dependent variables for the 1800 trials, covering 12 participants and three techniques: time on task (milliseconds), total motion, and motion during end-of-trial. Participants also provided qualitative feedback through rankings and comments.

Our quantitative analysis suggests that our e-textile’s twisting is faster than existing headphone button controls and comparable in speed to a touch surface. Qualitative feedback also indicated a preference for e-textile interaction over headphone controls.

Left: Weighted average subjective feedback. We mapped the 7-point Likert scale to a score in the range [-3, 3] and multiplied by the number of times the technique received that rating, and computed an average for all the scores. Right: Mean completion times for target distances show that Buttons were consistently slower.

These results are particularly interesting given that our e-textile was more sensitive, compared to the rigid input devices. One explanation might be its expressiveness — users can twist quickly or slowly anywhere on the cord, and the actions are symmetric and reversible. Conventional buttons on headphones require users to find their location and change grips for actions, which adds a high cost to pressing the wrong button. We use a high-pass filter to limit accidental skin contact, but further work is needed to characterize robustness and evaluate long-term performance in actual contexts of use.

Gesture Prototypes: Headphones, Hoodie Drawstrings, and Speaker Cord
We developed different prototypes to demonstrate the capabilities of our e-textile architecture: e-textile USB-C headphones to control media playback on the phone, a hoodie drawstring to invisibly add music control to clothing, and an interactive cord for gesture controls of smart speakers.

Left: Tap = Play/Pause; Center: Double-tap = Next track; Right: Roll = Volume +/-
Interactive speaker cord for simultaneous use of continuous (twisting/rolling) and discrete gestures (pinch/pat) to control music playback.

Conclusions and Future Directions
We introduce an interactive e-textile architecture for embedded sensing and visual feedback, which can enable both precise small-scale and large-scale motion in a compact cord form factor. With this work, we hope to advance textile user interfaces and inspire the use of microinteractions for future wearable interfaces and smart fabrics, where eyes-free access and casual, compact and efficient input is beneficial. We hope that our e-textile will inspire others to augment physical objects with scalable techniques, while preserving industrial design and aesthetics.

Acknowledgements
This work is a collaboration across multiple teams at Google. Key contributors to the project include Alex Olwal, Thad Starner, Jon Moeller, Greg Priest-Dorman, Ben Carroll, and Gowa Mainini. We thank the Google ATAP Jacquard team for our collaboration, especially Shiho Fukuhara, Munehiko Sato, and Ivan Poupyrev. We thank Google Wearables, and Kenneth Albanowski and Karissa Sawyer, in particular. Finally, we would like to thank Mark Zarich for illustrations, Bryan Allen for videography, Frank Li for data processing, Mathieu Le Goc for valuable discussions, and Carolyn Priest-Dorman for textile advice.

AdLingo Ads Builder turns an ad into a conversation

AdLingo Ads Builder turns an ad into a conversation

My parents were small business owners in the U.S. Virgin Islands where I grew up. They taught me that, though advertising is important, personal relationships are the best way to get new customers and grow your business. When I started working at Google 14 years ago, online advertising was a one-way messaging channel. People couldn’t ask questions or get personalized information from an ad, so we saw an opportunity to turn an ad into a two-way conversation.

My co-founder Dario Rapisardi and I joined Area 120, Google’s in house incubator for experimental projects, to use conversational AI technology to create such a service. In 2018 we launched AdLingo Ads for brands that leverage the Google Display & Video 360 buying platform. They can turn their ads, shown on the Google Partner Inventory, into an AI-powered conversation with potential customers. If customers are interested in the product promoted in the ad, they can ask questions to get more information.

Today, we’re announcing AdLingo Ads Builder (accessible to our beta partners), a new tool that helps advertisers and agencies build AdLingo Ads ten times faster than before. You can upload the components of your ad, as well as the conversational assistant, with just a few clicks.

As an early example, Purple used AdLingo to help people find the best mattress based on their personal sleep preferences. People found the ad helpful, as each engaged person spent on average 1 minute and 37 seconds in the conversation.

Purple Ads Builder_Keyword.png

AdLingo Ads Builder (with Purple ad): After selecting from a few simple drop-downs, the ad is ready to preview.

So far we’ve partnered with more than 30 different brands globally. Our product delivers results for advertisers by advancing potential customers from discovering a product to considering its purchase in one single ad, at a competitive cost compared to other channels. For example Renault used AdLingo for the new ZOE electric car launch to address French drivers’ preconceptions about electric vehicles. The campaign helped position Renault as a trusted advisor to consumers.

Renault_Adlingo_Experience_Keyword.png

Renault AdLingo Ad experience: Potential customers can ask questions and learn more about ZOE electric cars.

Online advertising has created huge opportunities for companies to reach customers all over the world, but when I think about my parent’s small business, I remember the importance of building a personal relationship with your customers. In creating AdLingo, we’re on a mission to use conversational AI to foster stronger relationships between customers and businesses.

Read More

Announcing Meta-Dataset: A Dataset of Datasets for Few-Shot Learning

Announcing Meta-Dataset: A Dataset of Datasets for Few-Shot Learning

Posted by Eleni Triantafillou, Student Researcher, and Vincent Dumoulin, Research Scientist, Google Research

Recently, deep learning has achieved impressive performance on an array of challenging problems, but its success often relies on large amounts of manually annotated training data. This limitation has sparked interest in learning from fewer examples. A well-studied instance of this problem is few-shot image classification: learning new classes from only a few representative images.

In addition to being an interesting problem from a scientific perspective due to the apparent gap between the ability of a person to learn from limited information compared to that of a deep learning algorithm, few-shot classification is also a very important problem from a practical perspective. Because large labeled datasets are often unavailable for tasks of interest, solving this problem would enable, for example, quick customization of models to individual user’s needs, democratizing the use of machine learning. Indeed, there has been an explosion of recent work to tackle few-shot classification, but previous benchmarks fail to reliably assess the relative merits of the different proposed models, inhibiting research progress.

In “Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples” (presented at ICLR 2020), we propose a large-scale and diverse benchmark for measuring the competence of different image classification models in a realistic and challenging few-shot setting, offering a framework in which one can investigate several important aspects of few-shot classification. It is composed of 10 publicly available datasets of natural images (including ImageNet, CUB-200-2011, Fungi, etc.), handwritten characters and doodles. The code is public, and includes a notebook that demonstrates how Meta-Dataset can be used in TensorFlow and PyTorch. In this blog post, we outline some results from our initial research investigation on Meta-Dataset and highlight important research directions.

Background: Few-shot Classification
In standard image classification, a model is trained on a set of images from a particular set of classes, and then tested on a held-out set of images of those same classes. Few-shot classification goes a step further and studies generalization to entirely new classes at test time, no images of which were seen in training.

Specifically, in few-shot classification, the training set contains classes that are entirely disjoint from those that will appear at test time. So the aim of training is to learn a flexible model that can be easily repurposed towards classifying new classes using only a few examples. The end-goal is to perform well on the test-time evaluation that is carried out on a number of test tasks, each of which presents a classification problem between previously unseen classes, from a held out test set of classes. Each test task contains a support set of a few labeled images from which the model can learn about the new classes, and a disjoint query set of examples that the model is then asked to classify.

In Meta-Dataset, in addition to the tough generalization challenge to new classes inherent in the few-shot learning setup described above, we also study generalization to entirely new datasets, from which no images of any class were seen in training.

Comparison of Meta-Dataset with Previous Benchmarks
A popular dataset for studying few-shot classification is mini-ImageNet, a downsampled version of a subset of classes from ImageNet. This dataset contains 100 classes in total that are divided into training, validation and test class splits. While classes encountered at test time in benchmarks like mini-ImageNet have not been seen during training, they are still substantially similar to the training classes visually. Recent works reveal that this allows a model to perform competitively at test time simply by re-using features learned at training time, without necessarily demonstrating the capability to learn from the few examples presented to the model in the support set. In contrast, performing well on Meta-Dataset requires absorbing diverse information at training time and rapidly adapting it to solve significantly different tasks at test time that possibly originate from entirely unseen datasets.

Test tasks from mini-ImageNet. Each task is a classification problem between previously unseen (test) classes. The model can use the support set of a few labeled examples of the new classes to adapt to the task at hand and then predicts labels for the query examples of these new classes. The evaluation metric is the query set accuracy, averaged over examples within each task and across tasks.

While other recent papers have investigated training on mini-ImageNet and evaluating on different datasets, Meta-Dataset represents the largest-scale organized benchmark for cross-dataset, few-shot image classification to date. It also introduces a sampling algorithm for generating tasks of varying characteristics and difficulty, by varying the number of classes in each task, the number of available examples per class, introducing class imbalances and, for some datasets, varying the degree of similarity between the classes of each task. Some example test tasks from Meta-Dataset are shown below.

Test tasks from Meta-Dataset. Contrary to the mini-ImageNet tasks shown above, different tasks here originate from (the test classes of) different datasets. Further, the number of classes and the support set sizes differ across tasks and the support sets might be class-imbalanced.

Initial Investigation and Findings on Meta-Dataset
We benchmark two main families of few-shot learning models on Meta-Dataset: pre-training and meta-learning.

Pre-training simply trains a classifier (a neural network feature extractor followed by a linear classifier) on the training set of classes using supervised learning. Then, the examples of a test task can be classified either by fine-tuning the pre-trained feature extractor and training a new task-specific linear classifier, or by means of nearest-neighbor comparisons, where the prediction for each query example is the label of its nearest support example. Despite its “baseline” status in the few-shot classification literature, this approach has recently enjoyed a surge of attention and competitive results.

On the other hand, meta-learners construct a number of “training tasks” and their training objective explicitly reflects the goal of performing well on each task’s query set after having adapted to that task using the associated support set, capturing the ability that is required at test time to solve each test task. Each training task is created by randomly sampling a subset of training classes and some examples of those classes to play the role of support and query sets.

Below, we summarize some of our findings from evaluating pre-training and meta-learning models on Meta Dataset:

1) Existing approaches have trouble leveraging heterogeneous training data sources.

We compared training models (from both pre-training and meta-learning approaches) using only the training classes of ImageNet to using all training classes from the datasets in Meta-Dataset, in order to measure the generalization gain from using a more expansive collection of training data. We singled out ImageNet for this purpose, because the features learned on ImageNet readily transfer to other datasets. The evaluation tasks applied to all models are derived from a held-out set of classes from the datasets used in training, with at least two additional datasets that are entirely held-out for evaluation (i.e., no classes from these datasets were used for training).

One might expect that training on more data, albeit heterogeneous, would generalize better on the test set. However, this is not always the case. Specifically, the following figure displays the accuracy of different models on test tasks of Meta-Dataset’s ten datasets. We observe that the performance on test tasks coming from handwritten characters / doodles (Omniglot and Quickdraw) is significantly improved when having trained on all datasets, instead of ImageNet only. This is reasonable since these datasets are visually significantly different from ImageNet. However, for test tasks of natural image datasets, similar accuracy can be obtained by training on ImageNet only, revealing that current models cannot effectively leverage heterogeneous data towards improving in this regard.

Comparison of test performance on each dataset after having trained on ImageNet (ILSVRC-2012) only or on all datasets.

2) Some models are more capable than others of exploiting additional data at test time.

We analyzed the performance of different models as a function of the number of available examples in each test task, uncovering an interesting trade-off: different models perform best with a particular number of training (support) samples. We observe that some models outshine the rest when there are very few examples (“shots”) available (e.g., ProtoNet and our proposed fo-Proto-MAML) but don’t exhibit a large improvement when given more, while other models are not well-suited for tasks with very few examples but improve at a quicker rate as more are given (e.g., Finetune baseline). However, since in practice we might not know in advance the number of examples that will be available at test time, one would like to identify a model that can best leverage any number of examples, without disproportionately suffering in a particular regime.

Comparison of test performance averaged across different datasets to the number of examples available per class in test tasks (“shots”). Performance is measured in terms of class precision: the proportion of the examples of a class that are correctly labeled, averaged across classes.

3) The adaptation algorithm of a meta-learner is more heavily responsible for its performance than the fact that it is trained end-to-end (i.e. meta-trained).

We developed a new set of baselines to measure the benefit of meta-learning. Specifically, for several meta-learners, we consider a non-meta-learned counterpart that pre-trains a feature extractor and then, at evaluation time only, applies the same adaptation algorithm as the respective meta-learner on those features. When training on ImageNet only, meta-training often helps a bit or at least doesn’t hurt too much, but when training on all datasets, the results are mixed. This suggests that further work is needed to understand and improve upon meta-learning, especially across datasets.

Comparison of three different meta-learner variants to their corresponding inference-only baselines, when training on ImageNet (ILSVRC-1012) only or all datasets. Each bar represents the difference between meta-training and inference-only, so positive values indicate improved performance from meta-training.

Conclusion
Meta-Dataset introduces new challenges for few-shot classification. Our initial exploration has revealed limitations of existing methods, calling for additional research. Recent works have already reported exciting results on Meta-Dataset, for example using cleverly-designed task conditioning, more sophisticated hyperparameter tuning, a ‘meta-baseline’ that combines the benefits of pre-training and meta-learning, and finally using feature selection to specialize a universal representation for each task. We hope that Meta-Dataset will help drive research in this important sub-field of machine learning.

Acknowledgements
Meta-Dataset was developed by Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol and Hugo Larochelle. We would like to thank Pablo Castro for his valuable guidance on this blog post, Chelsea Finn for fruitful discussions and ensuring the correctness of fo-MAML’s implementation, as well as Zack Nado and Dan Moldovan for the initial dataset code that was adapted, Cristina Vasconcelos for spotting an issue in the ranking of models and John Bronskill for suggesting that we experiment with a larger inner-loop learning rate for MAML which indeed significantly improved our fo-MAML results.

Speeding Up Neural Network Training with Data Echoing

Speeding Up Neural Network Training with Data Echoing

Posted by Dami Choi, Student Researcher and George Dahl, Senior Research Scientist, Google Research

Over the past decade, dramatic increases in neural network training speed have made it possible to apply deep learning techniques to many important problems. In the twilight of Moore’s law, as improvements in general purpose processors plateau, the machine learning community has increasingly turned to specialized hardware to produce additional speedups. For example, GPUs and TPUs optimize for highly parallelizable matrix operations, which are core components of neural network training algorithms. These accelerators, at a high level, can speed up training in two ways. First, they can process more training examples in parallel, and second, they can process each training example faster. We know there are limits to the speedups from processing more training examples in parallel, but will building ever faster accelerators continue to speed up training?

Unfortunately, not all operations in the training pipeline run on accelerators, so one cannot simply rely on faster accelerators to continue driving training speedups. For example, earlier stages in the training pipeline like disk I/O and data preprocessing involve operations that do not benefit from GPUs and TPUs. As accelerator improvements outpace improvements in CPUs and disks, these earlier stages will increasingly become a bottleneck, wasting accelerator capacity and limiting training speed.

An example training pipeline representative of many large-scale computer vision programs. The stages that come before applying the mini-batch stochastic gradient descent (SGD) update generally do not benefit from specialized hardware accelerators.

Consider a scenario where the code upstream to the accelerator takes twice as long as the code that runs on the accelerator – a scenario that is already realistic for some workloads today. Even if the code is pipelined to execute the upstream and downstream stages in parallel, the upstream stage will dominate training time and the accelerator will be idle 50% of the time. In this case, building a faster accelerator will not improve training speed at all. It may be possible to speed up the input pipeline by dedicating engineering effort and additional compute resources, but such efforts are time consuming and distract from the main goal of improving predictive performance. For very small datasets,one can precompute the augmented dataset offline and load the entire preprocessed dataset in memory, but this doesn’t work for most ML training scenarios.

In “Faster Neural Network Training with Data Echoing”, we propose a simple technique that reuses (or “echoes”) intermediate outputs from earlier pipeline stages to reclaim idle accelerator capacity. Rather than waiting for more data to become available, we simply utilize data that is already available to keep the accelerators busy.

Left: Without data echoing, downstream computational capacity is idle 50% of the time. Right: Data echoing with echoing factor 2 reclaims downstream computational capacity.

Repeating Data to Train Faster
Imagine a situation where reading and preprocessing a batch of training data takes twice as long as performing a single optimization step on that batch. In this case, after the first optimization step on the preprocessed batch, we can reuse the batch and perform a second step before the next batch is ready. In the best case scenario, where repeated data is as useful as fresh data, we would see a twofold speedup in training. In reality, data echoing provides a slightly smaller speedup because repeated data is not as useful as fresh data – but it can still provide a significant speedup compared to leaving the accelerator idle.

There are typically several ways to implement data echoing in a given neural network training pipeline. The technique we propose involves duplicating data into a shuffle buffer somewhere in the training pipeline, but we are free to insert this buffer anywhere after whichever stage produces a bottleneck in the given pipeline. When we insert the buffer before batching, we call our technique example echoing, whereas, when we insert it after batching, we call our technique batch echoing. Example echoing shuffles data at the example level, while batch echoing shuffles the sequence of duplicate batches. We can also insert the buffer before data augmentation, such that each copy of repeated data is slightly different (and therefore closer to a fresh example). Of the different versions of data echoing that place the shuffle buffer between different stages, the version that provides the greatest speedup depends on the specific training pipeline.

Data Echoing Across Workloads
So how useful is reusing data? We tried data echoing on five neural network training pipelines spanning 3 different tasks – image classification, language modeling, and object detection – and measured the number of fresh examples needed to reach a particular performance target. We chose targets to match the best result reliably achieved by the baseline during hyperparameter tuning. We found that data echoing allowed us to reach the target performance with fewer fresh examples, demonstrating that reusing data is useful for reducing disk I/O across a variety of tasks. In some cases, repeated data is nearly as useful as fresh data: in the figure below, example echoing before augmentation reduces the number of fresh examples required almost by the repetition factor.

Data echoing, when each data item is repeated twice, either reduces or does not change the number of fresh examples needed to reach the target out-of-sample performance. Dashed lines indicate the values we would expect if repeated examples were as useful as fresh examples.

Reduction in Training Time
Data echoing can speed up training whenever computation upstream from accelerators dominates training time. We measured the training speedup achieved in a training pipeline bottlenecked by input latency due to streaming training data from cloud storage, which is realistic for many of today’s large-scale production workloads or anyone streaming training data over a network from a remote storage system. We trained a ResNet-50 model on the ImageNet dataset and found that data echoing provides a significant training speedup, in this case, more than 3 times faster when using data echoing.

Data echoing can reduce training time for ResNet-50 on ImageNet. In this experiment, reading a batch of training data from cloud storage took 6 times longer than the code that used each batch of data to perform a training step. The Echoing factor in the legend refers to the number of times each data item was repeated. Dashed lines indicate the expected values if repeated examples were as useful as fresh examples and there was no overhead from echoing.

Data Echoing Preserves Predictive Performance
Although one might be concerned that reusing data would harm the model’s final performance, we found that data echoing did not degrade the quality of the final model for any of the workloads we tested.

Comparing the individual trials that achieved the best out-of-sample performance during training for both with and without data echoing shows that reusing data does not harm final model quality. Here validation cross entropy is equivalent to log perplexity.

As improvements in specialized accelerators like GPUs and TPUs continue to outpace general purpose processors, we expect data echoing and similar strategies to become increasingly important parts of the neural network training toolkit.

Acknowledgements
The Data Echoing project was conducted by Dami Choi, Alexandre Passos, Christopher J. Shallue, and George E. Dahl while Dami Choi was a Google AI Resident. We would also like to thank Roy Frostig, Luke Metz, Yiding Jiang, and Ting Chen for helpful discussions.

Meet the Googlers working to ensure tech is for everyone

Meet the Googlers working to ensure tech is for everyone

During their early studies and careers, Tiffany Deng, Tulsee Doshi and Timnit Gebru found themselves asking the same questions: Why is it that some products and services work better for some than others, and why isn’t everyone represented around the table when a decision is being made? Their collective passion to create a digital world that works for everyone is what brought the three women to Google, where they lead efforts to make machine learning systems fair and inclusive. 

I sat down with Tiffany, Tulsee and Timnit to discuss why working on machine learning fairness is so important, and how they came to work in this field.  

How would you explain your job to someone who isn’t in tech?

Tiffany: I’d say my job is to make sure we’re not reinforcing any of the entrenched and embedded biases humans might have into products people use, and that every time you pick up a product—a Google product—you as an individual can have a good experience when using it. 

Timnit: I help machines understand imagery and text. Just like a human, if a machine tries to learn a pattern or understand something, and it is trained on input that’s been provided for it to do just that, the input, or data in this case, has societal bias. This could lead to a biased outcome or prediction made by the machine. And my work is to figure out different ways of mitigating this bias. 

Tulsee: My work includes making sure everyone has positive experiences with our products, and that people don’t feel excluded or stereotyped, especially based on their identities. The products should work for you as an individual, and provide the best experience possible. 

What made you want to work in this field?

Tulsee:When I started college, I was unsure of what I wanted to study. I came in with an interest in math, and quickly found myself taking a variety of classes in computer science, among other topics. But no matter which interesting courses I took, I often felt a disconnect between what I was studying and the people the work would help. I kept coming back to wanting to focus on people, and after taking classes like child psychology and philosophy of AI, I decided I wanted to take on a role where I could combine my skill sets with a people-centered approach. I think everyone has an experience of services and technology not working for them, and solving for that is a passion behind much of what I do. 

Tiffany:After graduating from West Point I joined the army as an intelligence officer before becoming a consultant and working for the State Department and the Department of Defense. I then joined Facebook as a privacy manager for a period of time, and that’s when I started working on more ML fairness-related matters. When people ask me how I ended up where I am, I’d say that there’s never a straight path to finding your passion, and all the experiences that I’ve had outside of tech are ones I bring into the work I’m doing today. 

An important “aha moment” for me was about a year and a half ago, when my son had a rash all over his body and we went to the doctor to get help. They told us they weren’t able to diagnose him because his skin wasn’t red, and of course, his skin won’t turn red as he has deep brown skin. Someone telling me they can’t diagnose my son because of his skin—that’s troubling as a parent. I wanted to understand the root cause of the issue—why is this not working for me and my family, the way it does for others? Fast forwarding, when thinking about how AI will someday be ubiquitous and an important component in assisting human decision-making, I wanted to get involved and help ensure that we’re building technology that works equally as well for everyone. 

Timnit: I grew up with a father and two sisters working in electrical engineering, so I followed their path and decided to also pursue studies in the field. After spending some time at Apple working as a circuit designer and starting my own company, I went back to studying image processing and completed a Ph.D. in computer vision. Towards the end of my Ph.D., I read a ProPublica article discussing racial bias in predicting crime recidivism rates. At the same time, I started thinking more about how there were very few, if any, Black people in grad school and that whenever I went to conferences, Black people weren’t represented in the decisions driving this field of work. That’s how I came to found a nonprofit organization called Black in AI, along with Rediet Abebe, to increase the visibility of Black people working in the field. After graduating with my Ph.D. I did a postdoc at Microsoft research and soon after that, I took a role at Google as the co-lead of the ethical AI research team which was founded by Meg Mitchell

What are some of the main challenges in this work, and why is it so important? 

Tulsee:The challenge question is interesting, and a hard one. First of all, there is the theoretical and sociological question on the notion of fairness—how does one define what is fair? Addressing fairness concerns requires multiple perspectives, and product development approaches ranging from technical to design. Because of this, even for use cases where we have a lot of experience, there are still many challenges for product teams to understand the different approaches for measuring and tackling fairness concerns. This is one of the reasons why I believe tooling and resources are so critical, and why we’re investing in them for both internal and external purposes.

Another important aspect is company culture and how companies define their values and motivate their employees. We are starting to see a growing, industry-wide shift in terms of what success looks like. If organizations and product creators get rewarded for thinking about a broader set of people when developing products, the more companies start fostering a diverse workforce, consult external experts and think about whose voices are being represented at the table. We need to remember we’re talking about real people’s experiences, and while working on these issues can sometimes be emotionally difficult, it’s so important to get right. 

Timnit:A general challenge is that people who are the most negatively affected are often the ones whose voices are not heard. Representation is an important issue, and while there’s a lot of opportunities with ML technology in society, it’s important to have a diverse set of people and perspectives involved when working on the development so you don’t end up enhancing a gap between different groups.

This is not an issue that is specific to ML. As an example, let’s think of DNA sequencing. The African continent has the most diverse DNA in the world, but I read that it consists of less than 1 percent of the DNA studied in DNA sequencing, so there are examples of researchers who have come to the wrong conclusions based on data that was not representative. Now imagine someone is looking to develop the next generation of drugs, and the result could be that they don’t work for certain groups because their DNA hasn’t been rightly represented. 

Do you think ML has the potential to help complement human decision making, and drive the world to become more fair?

Timnit:It’s important to recognize the complexity of the human mind, and that humans should not be replaced when it comes to decision making. I don’t think ML can make the world more fair: Only humans can do that. And humans choose how to use this technology. In terms of opportunities, there are many ways in which we have already used ML systems to uncover societal bias, and this is something I work on as well. For example, studies by Jennifer Eberhardt and her collaborators at Stanford University including Vinodkumar Prabhakaran, who has since joined our team, used natural language processing to analyze body camera recordings of police stops in Oakland. They found a pattern of police speaking less respectfully to Black people than white people. A lot of times when you show these issues backed up by data and scientific analysis, it can help make a case. At the same time, the history of scientific racism also shows that data can be used to propagate the most harmful societal biases of the day. Blindly trusting data driven studies or decisions can be dangerous. It’s important to understand the context under which these studies are conducted and to work with affected communities and other domain experts to formulate the questions that need to be addressed.

Tiffany:I think ML will be incredibly important to help with things like climate change, sustainability and helping save endangered animals. Timnit’s work on using AI to help identify diseased cassava plants is an incredible use of AI, especially in the developing world. The range of problems AI can aid humans with is endless—we just have to ensure we continue to build technological solutions with ethics and inclusion at the forefront of our conversations.

Read More

Agile and Intelligent Locomotion via Deep Reinforcement Learning

Agile and Intelligent Locomotion via Deep Reinforcement Learning

Posted by Yuxiang Yang and Deepali Jain, AI Residents, Robotics at Google

Recent advancements in deep reinforcement learning (deep RL) has enabled legged robots to learn many agile skills through automated environment interactions. In the past few years, researchers have greatly improved sample efficiency by using off-policy data, imitating animal behaviors, or performing meta learning. However, sample efficiency remains a bottleneck for most deep reinforcement learning algorithms, especially in the legged locomotion domain. Moreover, most existing works focus on simple, low-level skills only, such as walking forward, backward and turning. In order to operate autonomously in the real world, robots still need to combine these skills to generate more advanced behaviors.

Today we present two projects that aim to address the above problems and help close the perception-actuation loop for legged robots. In “Data Efficient Reinforcement Learning for Legged Robots”, we present an efficient way to learn low level motion control policies. By fitting a dynamics model to the robot and planning for actions in real time, the robot learns multiple locomotion skills using less than 5 minutes of data. Going beyond simple behaviors, we explore automatic path navigation in “Hierarchical Reinforcement Learning for Quadruped Locomotion”. With a policy architecture designed for end-to-end training, the robot learns to combine a high-level planning policy with a low-level motion controller, in order to navigate autonomously through a curved path.

Data Efficient Reinforcement Learning for Legged Robots
A major roadblock in RL is the lack of sample efficiency. Even with a state-of-the-art sample-efficient learning algorithm like Soft Actor-Critic (SAC), it would still require more than an hour of data to learn a reasonable walking policy, which is difficult to collect in the real world.

In a continued effort to learn walking skills using minimal interaction with the real-world environment, we present another, more sample-efficient model-based method for learning basic walking skills that dramatically reduces the training data needed. Instead of directly learning a policy that maps from environment state to robot action, we learn a dynamics model of the robot that estimates future states given its current state and action. Since the entire learning process requires less than 5 minutes of data, it could be performed directly on the real robot.

We start by executing random actions on the robot, and fit the model to the data collected. With the model fitted, we control the robot using a model predictive control (MPC) planner. We iterate between collecting more data with MPC and re-training the model to better fit the dynamics of the environment.

Overview of the model-based learning pipeline. The system alternates between fitting the dynamics model and collecting trajectories using model predictive control (MPC).

In standard MPC, the controller plans for a sequence of actions at each timestep, and only executes the first of the planned actions. While online replanning with regular feedback from the robot to the controller makes the controller robust to model inaccuracies, it also poses a challenge for the action planner, as planning must finish before the next step of the control loop (usually less than 10ms for legged robots). To satisfy such a tight time constraint, we introduce a multi-threaded, asynchronous version of MPC, with action planning and execution happening on different threads. As the execution thread applies actions at a high frequency, the planning thread optimizes for actions in the background without interruption. Furthermore, since action planning can take multiple timesteps, the robot state would have changed by the time planning has finished. To address the problem with planning latency, we devise a novel technique to compensate, which first predicts the future state when the planner is expected to finish its computation, and then uses this future state to seed the planning algorithm.

We separate action planning and execution on different threads.

Although MPC refreshes the action plan frequently, the planner still needs to work over long action horizons to keep track of the long-term goal and avoid myopic behaviors. To that end, we use a multi-step loss function, a reformulation of the model loss function that helps to reduce error accumulation over time by predicting the loss over a range of future steps.

Safety is another concern for learning on the real robot. For legged robots, a small mistake, such as missing a foot step, could lead to catastrophic failures, from the robot falling to the motor overheating. To ensure safe exploration, we embed a stable, in-place stepping gait prior, that is modulated by a trajectory generator. With the stable walking prior, MPC can then safely explore the action space.

Combining an accurate dynamics model with an online, asynchronous MPC controller, the robot successfully learned to walk using only 4.5 minutes of data (36 episodes). The learned dynamics model is also generalizable: by simply changing the reward function of MPC, the controller is able to optimize for different behaviors, such as walking backwards, or turning, without re-training. As an extension, we use a similar framework to enable even more agile behaviors. For example, in simulation the robot learns to backflip and walk on its rear legs, though these behaviors are yet to be learned by the real robot.

The robot learns to walk using only 4.5 minutes of data.
The robot learns to backflip and walk with rear legs using the same framework.

Combining low-level controller with high-level planning
Although model-based RL has allowed the robot to learn simple locomotion skills efficiently, such skills are insufficient for handling complex, real-world tasks. For example, in order to navigate through an office space, the robot may have to adjust its speed, direction and height multiple times, instead of following a pre-defined speed profile. Traditionally, people solve such complex tasks by breaking them down into multiple hierarchical sub-problems, such as a high-level trajectory planner and a low-level trajectory-following controller. However, manually defining a suitable hierarchy is typically a tedious task, as it requires careful engineering for each sub-problem.

In our second paper, we introduce a hierarchical reinforcement learning (HRL) framework that can be trained to automatically decompose complex reinforcement learning tasks. We break down our policy structure into a high-level and a low-level policy. Instead of designing each policy manually, we only define a simple communication protocol between the policy levels. In this framework, the high-level policy (e.g., a trajectory planner) commands the low-level policy (such as the motion control policy) through a latent command, and decides for how long to hold that command constant before issuing a new one. The low-level policy then interprets the latent command from the high-level policy, and gives motor commands to the robot.

To facilitate learning, we also split the observation space into high-level (e.g., robot position and orientation) and low-level (IMU, motor positions) observations, which are fed to their corresponding policies. This architecture naturally allows the high-level policy to operate at a slower timescale than the low-level policy, which saves computation resources and reduces training complexity.

Framework of Hierarchical Policy: The policy gets observations from the robot and sends motor commands to execute desired actions. It is split into two levels (high and low). The high-level policy gives a latent command to the low-level policy and also decides the duration for which low-level will run.

Since the high-level and low-level policies operate at discrete timescales, the entire policy structure is not end-to-end differentiable, and standard gradient-based RL algorithms like PPO and SAC cannot be used. Instead, we choose to train the hierarchical policy through augmented random search (ARS), a simple evolutionary optimization method that has demonstrated good performance in reinforcement learning tasks. Weights of both levels of the policy are trained together, where the objective is to maximize the total reward from the robot trajectory.

We test our framework on a path-following task using the same quadruped robot. In addition to straight walking, the robot needs to steer in different directions to complete the task. Note that as the low-level policy does not know the robot’s position in the path, it does not have sufficient information to complete the entire task on its own. However, with the coordination between the high-level and low-level policies, steering behavior emerges automatically in the latent command space, which allows the robot to efficiently complete the path. After successful training in a simulated environment, we validate our results on hardware by transferring an HRL policy to a real robot and recording the resulting trajectories.

Successful trajectory of a robot on a curved path. Left: A plot of the trajectory traversed by the robot with dots along the trajectory marking the positions where the high-level policy sent a new latent command to the low-level policy. Middle: The robot walking along the path in the simulated environment. Right: The robot walking around the path in the real world.

To further demonstrate the learned hierarchical policy, we visualized the behavior of the learned low-level policy under different latent commands. As shown in the plot below, different latent commands can cause the robot to walk straight, or turn left or right at different rates. We also test the generalizability of low-level policies by transferring them to new tasks from a similar domain, which, in our case, includes following a path with different shapes. By fixing the low-level policy weights and only training the high-level policy, the robot could successfully traverse through different paths.

Left: Visualization of a learned 2D latent command space. Vector directions correspond to the movement direction of the robot. Vector length is proportional to the distance covered. Right: Transfer of low level policy: An HRL policy was trained on a single path (right, top). The learned low-level policy was then reused when training the high-level policy on other paths (e.g., right, bottom).

Conclusion
Reinforcement learning poses a promising future for robotics by automating the controller design process. With model-based RL, we enabled efficient learning of generalizable locomotion behaviors directly on the real robot. With hierarchical RL, the robot learned to coordinate policies at different levels to achieve more complex tasks. In the future, we plan to bring perception into the loop, so that robots can operate truly autonomously in the real world.

Acknowledgements
Both Deepali Jain and Yuxiang Yang are residents in the AI Residency program, mentored by Ken Caluwaerts and Atil Iscen. We would also like to thank Jie Tan and Vikas Sindhwani for support of the research, and Noah Broestl for managing the New York AI Residency Program.