Image-Text Pre-training with Contrastive Captioners

Oftentimes, machine learning (ML) model developers begin their design using a generic backbone model that is trained at scale and with capabilities transferable to a wide range of downstream tasks. In natural language processing, a number of popular backbone models, including BERT, T5, GPT-3 (sometimes also referred to as “foundation models”), are pre-trained on web-scale data and have demonstrated generic multi-tasking capabilities through zero-shot, few-shot or transfer learning. Compared with training over-specialized individual models, pre-training backbone models for a large number of downstream tasks can amortize the training costs, allowing one to overcome resource limitations when building large scale models.

In computer vision, pioneering work has shown the effectiveness of single-encoder models pre-trained for image classification to capture generic visual representations that are effective for other downstream tasks. More recently, contrastive dual-encoder (CLIP, ALIGN, Florence) and generative encoder-decoder (SimVLM) approaches trained using web-scale noisy image-text pairs have been explored. Dual-encoder models exhibit remarkable zero-shot image classification capabilities but are less effective for joint vision-language understanding. On the other hand, encoder-decoder methods are good at image captioning and visual question answering but cannot perform retrieval-style tasks.

In “CoCa: Contrastive Captioners are Image-Text Foundation Models”, we present a unified vision backbone model called Contrastive Captioner (CoCa). Our model is a novel encoder-decoder approach that simultaneously produces aligned unimodal image and text embeddings and joint multimodal representations, making it flexible enough to be directly applicable for all types of downstream tasks. Specifically, CoCa achieves state-of-the-art results on a series of vision and vision-language tasks spanning vision recognition, cross-modal alignment, and multimodal understanding. Furthermore, it learns highly generic representations so that it can perform as well or better than fully fine-tuned models with zero-shot learning or frozen encoders.

Overview of Contrastive Captioners (CoCa) compared to single-encoder, dual-encoder and encoder-decoder models.

Method
We propose CoCa, a unified training framework that combines contrastive loss and captioning loss on a single training data stream consisting of image annotations and noisy image-text pairs, effectively merging single-encoder, dual-encoder and encoder-decoder paradigms.

To this end, we present a novel encoder-decoder architecture where the encoder is a vision transformer (ViT), and the text decoder transformer is decoupled into two parts, a unimodal text decoder and a multimodal text decoder. We skip cross-attention in unimodal decoder layers to encode text-only representations for contrastive loss, and cascade multimodal decoder layers with cross-attention to image encoder outputs to learn multimodal image-text representations for captioning loss. This design maximizes the model’s flexibility and universality in accommodating a wide spectrum of tasks, and at the same time, it can be efficiently trained with a single forward and backward propagation for both training objectives, resulting in minimal computational overhead. Thus, the model can be trained end-to-end from scratch with training costs comparable to a naïve encoder-decoder model.

Illustration of forward propagation used by CoCa for both contrastive and captioning losses.

Benchmark Results
The CoCa model can be directly fine-tuned on many tasks with minimal adaptation. By doing so, our model achieves a series of state-of-the-art results on popular vision and multimodal benchmarks, including (1) visual recognition: ImageNet, Kinetics-400/600/700, and MiT; (2) cross-modal alignment: MS-COCO, Flickr30K, and MSR-VTT; and (3) multimodal understanding: VQA, SNLI-VE, NLVR2, and NoCaps.

Comparison of CoCa with other image-text backbone models (without task-specific customization) and multiple state-of-the-art task-specialized models.

It is noteworthy that CoCa attains these results as a single model adapted for all tasks while often lighter than prior top-performing specialized models. For example, CoCa obtains 91.0% ImageNet top-1 accuracy while using less than half the parameters of prior state-of-the-art models. In addition, CoCa also obtains strong generative capability of high-quality image captions.

Image classification scaling performance comparing fine-tuned ImageNet top-1 accuracy versus model size.
Text captions generated by CoCa with NoCaps images as input.

Zero-Shot Performance
Besides achieving excellent performance with fine-tuning, CoCa also outperforms previous state-of-the-art models on zero-shot learning tasks, including image classification,and cross-modal retrieval. CoCa obtains 86.3% zero-shot accuracy on ImageNet while also robustly outperforming prior models on challenging variant benchmarks, such as ImageNet-A, ImageNet-R, ImageNet-V2, and ImageNet-Sketch. As shown in the figure below, CoCa obtains better zero-shot accuracy with smaller model sizes compared to prior methods.

Image classification scaling performance comparing zero-shot ImageNet top-1 accuracy versus model size.

Frozen Encoder Representation
One particularly exciting observation is that CoCa achieves results comparable to the best fine-tuned models using only a frozen visual encoder, in which features extracted after model training are used to train a classifier, rather than the more computationally intensive effort of fine-tuning a model. On ImageNet, a frozen CoCa encoder with a learned classification head obtains 90.6% top-1 accuracy, which is better than the fully fine-tuned performance of existing backbone models (90.1%). We also find this setup to work extremely well for video recognition. We feed sampled video frames into the CoCa frozen image encoder individually, and fuse output features by attentional pooling before applying a learned classifier. This simple approach using a CoCa frozen image encoder achieves video action recognition top-1 accuracy of 88.0% on Kinetics-400 dataset and demonstrates that CoCa learns a highly generic visual representation with the combined training objectives.

Comparison of Frozen CoCa visual encoder with (multiple) best-performing fine-tuned models.

Conclusion
We present Contrastive Captioner (CoCa), a novel pre-training paradigm for image-text backbone models. This simple method is widely applicable to many types of vision and vision-language downstream tasks, and obtains state-of-the-art performance with minimal or even no task-specific adaptations.

Acknowledgements
We would like to thank our co-authors Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu who have been involved in all aspects of the project. We also would like to thank Yi-Ting Chen, Kaifeng Chen, Ye Xia, Zhen Li, Chao Jia, Yinfei Yang, Zhengdong Zhang, Wei Han, Yuan Cao, Tao Zhu, Futang Peng, Soham Ghosh, Zihang Dai, Xin Li, Anelia Angelova, Jason Baldridge, Izhak Shafran, Shengyang Dai, Abhijit Ogale, Zhifeng Chen, Claire Cui, Paul Natsev, Tom Duerig for helpful discussions, Andrew Dai for help with contrastive models, Christopher Fifty and Bowen Zhang for help with video models, Yuanzhong Xu for help with model scaling, Lucas Beyer for help with data preparation, Andy Zeng for help with MSR-VTT evaluation, Hieu Pham and Simon Kornblith for help with zero-shot evaluations, Erica Moreira and Victor Gomes for help with resource coordination, Liangliang Cao for proofreading, Tom Small for creating the animations used in this blogpost, and others in the Google Brain team for support throughout this project.

Read More

How we build with and for people with disabilities

Editor’s note: Today is Global Accessibility Awareness Day. We’re also sharing how we’re making education more accessibleand launching a newAndroid accessibility feature.

Over the past nine years, my job has focused on building accessible products and supporting Googlers with disabilities. Along the way, I’ve been constantly reminded of how vast and diverse the disability community is, and how important it is to continue working alongside this community to build technology and solutions that are truly helpful.

Before delving into some of the accessibility features our teams have been building, I want to share how we’re working to be more inclusive of people with disabilities to create more accessible tools overall.

Nothing about us, without us

In the disability community, people often say “nothing about us without us.” It’s a sentiment that I find sums up what disability inclusion means. The types of barriers that people with disabilities face in society vary depending on who they are, where they live and what resources they have access to. No one’s experience is universal. That’s why it’s essential to include a wide array of people with disabilities at every stage of the development process for any of our accessibility products, initiatives or programs.

We need to work to make sure our teams at Google are reflective of the people we’re building for. To do so, last year we launched our hiring site geared toward people with disabilities — including our Autism Career Program to further grow and strengthen our autistic community. Most recently, we helped launch the Neurodiversity Career Connector along with other companies to create a job portal that connects neurodiverse candidates to companies that are committed to hiring more inclusively.

Beyond our internal communities, we also must partner with communities outside of Google so we can learn what is truly useful to different groups and parlay that understanding into the improvement of current products or the creation of new ones. Those partnerships have resulted in the creation of Project Relate, a communication tool for people with speech impairments, the development of a completely new TalkBack, Android’s built-in screen reader, and the improvement of Select-to-Speak, a Chromebook tool that lets you hear selected text on your screen spoken out loud.

Equitable experiences for everyone

Engaging and listening to these communities — inside and outside of Google — make it possible to create tools and features like the ones we’re sharing today.

The ability to add alt-text, which is a short description of an image that is read aloud by screen readers, directly to images sent through Gmail starts rolling out today. With this update, people who use screen readers will know what’s being sent to them, whether it’s a GIF celebrating the end of the week or a screenshot of an important graph.

Communication tools that are inclusive of everyone are especially important as teams have shifted to fully virtual or hybrid meetings. Again, everyone experiences these changes differently. We’ve heard from some people who are deaf or hard of hearing, that this shift has made it easier to identify who is speaking — something that is often more difficult in person. But, in the case of people who use ASL, we’ve heard that it can be difficult to be in a virtual meeting and simultaneously see their interpreter and the person speaking to them.

Multi-pin, a new feature in Google Meet, helps solve this. Now you can pin multiple video tiles at once, for example, the presenter’s screen and the interpreter’s screen. And like many accessibility features, the usefulness extends beyond people with disabilities. The next time someone is watching a panel and wants to pin multiple people to the screen, this feature makes that possible.

We’ve also been working to make video content more accessible to those who are blind or low-vision through audio descriptions that describe verbally what is on the screen visually. All of our English language YouTube Originals content from the past year — and moving forward — will now have English audio descriptions available globally. To turn on the audio description track, at the bottom right of the video player, click on “Settings”, select “Audio track”, and choose “English descriptive”.

For many people with speech impairments, being understood by the technology that powers tools like voice typing or virtual assistants can be difficult. In 2019, we started work to change that through Project Euphonia, a research initiative that works with community organizations and people with speech impairments to create more inclusive speech recognition models. Today, we’re expanding Project Euphonia’s research to include four more languages: French, Hindi, Japanese and Spanish. With this expansion, we can create even more helpful technology for more people — no matter where they are or what language they speak.

I’ve learned so much in my time working in this space and among the things I’ve learned is the absolute importance of building right alongside the very people who will most use these tools in the end. We’ll continue to do that as we work to create a more inclusive and accessible world, both physically and digitally.

Read More

Vector-Quantized Image Modeling with Improved VQGAN

In recent years, natural language processing models have dramatically improved their ability to learn general-purpose representations, which has resulted in significant performance gains for a wide range of natural language generation and natural language understanding tasks. In large part, this has been accomplished through pre-training language models on extensive unlabeled text corpora.

This pre-training formulation does not make assumptions about input signal modality, which can be language, vision, or audio, among others. Several recent papers have exploited this formulation to dramatically improve image generation results through pre-quantizing images into discrete integer codes (represented as natural numbers), and modeling them autoregressively (i.e., predicting sequences one token at a time). In these approaches, a convolutional neural network (CNN) is trained to encode an image into discrete tokens, each corresponding to a small patch of the image. A second stage CNN or Transformer is then trained to model the distribution of encoded latent variables. The second stage can also be applied to autoregressively generate an image after the training. But while such models have achieved strong performance for image generation, few studies have evaluated the learned representation for downstream discriminative tasks (such as image classification).

In “Vector-Quantized Image Modeling with Improved VQGAN”, we propose a two-stage model that reconceives traditional image quantization techniques to yield improved performance on image generation and image understanding tasks. In the first stage, an image quantization model, called VQGAN, encodes an image into lower-dimensional discrete latent codes. Then a Transformer model is trained to model the quantized latent codes of an image. This approach, which we call Vector-quantized Image Modeling (VIM), can be used for both image generation and unsupervised image representation learning. We describe multiple improvements to the image quantizer and show that training a stronger image quantizer is a key component for improving both image generation and image understanding.

Vector-Quantized Image Modeling with ViT-VQGAN
One recent, commonly used model that quantizes images into integer tokens is the Vector-quantized Variational AutoEncoder (VQVAE), a CNN-based auto-encoder whose latent space is a matrix of discrete learnable variables, trained end-to-end. VQGAN is an improved version of this that introduces an adversarial loss to promote high quality reconstruction. VQGAN uses transformer-like elements in the form of non-local attention blocks, which allows it to capture distant interactions using fewer layers.

In our work, we propose taking this approach one step further by replacing both the CNN encoder and decoder with ViT. In addition, we introduce a linear projection from the output of the encoder to a low-dimensional latent variable space for lookup of the integer tokens. Specifically, we reduced the encoder output from a 768-dimension vector to a 32- or 8-dimension vector per code, which we found encourages the decoder to better utilize the token outputs, improving model capacity and efficiency.

Overview of the proposed ViT-VQGAN (left) and VIM (right), which, when working together, is capable of both image generation and image understanding. In the first stage, ViT-VQGAN converts images into discrete integers, which the autoregressive Transformer (Stage 2) then learns to model. Finally, the Stage 1 decoder is applied to these tokens to enable generation of high quality images from scratch.

With our trained ViT-VQGAN, images are encoded into discrete tokens represented by integers, each of which encompasses an 8×8 patch of the input image. Using these tokens, we train a decoder-only Transformer to predict a sequence of image tokens autoregressively. This two-stage model, VIM, is able to perform unconditioned image generation by simply sampling token-by-token from the output softmax distribution of the Transformer model.

VIM is also capable of performing class-conditioned generation, such as synthesizing a specific image of a given class (e.g., a dog or a cat). We extend the unconditional generation to class-conditioned generation by prepending a class-ID token before the image tokens during both training and sampling.

Uncurated set of dog samples from class-conditioned image generation trained on ImageNet. Conditioned classes: Irish terrier, Norfolk terrier, Norwich terrier, Yorkshire terrier, wire-haired fox terrier, Lakeland terrier.

To test the image understanding capabilities of VIM, we also fine-tune a linear projection layer to perform ImageNet classification, a standard benchmark for measuring image understanding abilities. Similar to ImageGPT, we take a layer output at a specific block, average over the sequence of token features (frozen) and insert a softmax layer (learnable) projecting averaged features to class logits. This allows us to capture intermediate features that provide more information useful for representation learning.

Experimental Results
We train all ViT-VQGAN models with a training batch size of 256 distributed across 128 CloudTPUv4 cores. All models are trained with an input image resolution of 256×256. On top of the pre-learned ViT-VQGAN image quantizer, we train Transformer models for unconditional and class-conditioned image synthesis and compare with previous work.

We measure the performance of our proposed methods for class-conditioned image synthesis and unsupervised representation learning on the widely used ImageNet benchmark. In the table below we demonstrate the class-conditioned image synthesis performance measured by the Fréchet Inception Distance (FID). Compared to prior work, VIM improves the FID to 3.07 (lower is better), a relative improvement of 58.6% over the VQGAN model (FID 7.35). VIM also improves the capacity for image understanding, as indicated by the Inception Score (IS), which goes from 188.6 to 227.4, a 20.6% improvement relative to VQGAN.

Model Acceptance
Rate
FID IS
Validation data 1.0 1.62 235.0
DCTransformer 1.0 36.5 N/A
BigGAN 1.0 7.53 168.6
BigGAN-deep 1.0 6.84 203.6
IDDPM 1.0 12.3 N/A
ADM-G, 1.0 guid. 1.0 4.59 186.7
VQVAE-2 1.0 ~31 ~45
VQGAN 1.0 17.04 70.6
VQGAN 0.5 10.26 125.5
VQGAN 0.25 7.35 188.6
ViT-VQGAN (Ours) 1.0 4.17 175.1
ViT-VQGAN (Ours) 0.5 3.04 227.4
Fréchet Inception Distance (FID) comparison between different models for class-conditional image synthesis and Inception Score (IS) for image understanding, both on ImageNet with resolution 256×256. The acceptance rate shows results filtered by a ResNet-101 classification model, similar to the process in VQGAN.

After training a generative model, we test the learned image representations by fine-tuning a linear layer to perform ImageNet classification, a standard benchmark for measuring image understanding abilities. Our model outperforms previous generative models on the image understanding task, improving classification accuracy through linear probing (i.e., training a single linear classification layer, while keeping the rest of the model frozen) from 60.3% (iGPT-L) to 73.2%. These results showcase VIM’s strong generation results as well as image representation learning abilities.

Conclusion
We propose Vector-quantized Image Modeling (VIM), which pretrains a Transformer to predict image tokens autoregressively, where discrete image tokens are produced from improved ViT-VQGAN image quantizers. With our proposed improvements on image quantization, we demonstrate superior results on both image generation and understanding. We hope our results can inspire future work towards more unified approaches for image generation and understanding.

Acknowledgements
We would like to thank Xin Li, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu for the preparation of the VIM paper. We thank Wei Han, Yuan Cao, Jiquan Ngiam‎, Vijay Vasudevan, Zhifeng Chen and Claire Cui for helpful discussions and feedback, and others on the Google Research and Brain Team for support throughout this project.

Read More

Contextual Rephrasing in Google Assistant

When people converse with one another, context and references play a critical role in driving their conversation more efficiently. For instance, if one asks the question “Who wrote Romeo and Juliet?” and, after receiving an answer, asks “Where was he born?”, it is clear that ‘he’ is referring to William Shakespeare without the need to explicitly mention him. Or if someone mentions “python” in a sentence, one can use the context from the conversation to determine whether they are referring to a type of snake or a computer language. If a virtual assistant cannot robustly handle context and references, users would be required to adapt to the limitation of the technology by repeating previously shared contextual information in their follow-up queries to ensure that the assistant understands their requests and can provide relevant answers.

In this post, we present a technology currently deployed on Google Assistant that allows users to speak in a natural manner when referencing context that was defined in previous queries and answers. The technology, based on the latest machine learning (ML) advances, rephrases a user’s follow-up query to explicitly mention the missing contextual information, thus enabling it to be answered as a stand-alone query. While Assistant considers many types of context for interpreting the user input, in this post we are focusing on short-term conversation history.

Context Handling by Rephrasing
One of the approaches taken by Assistant to understand contextual queries is to detect if an input utterance is referring to previous context and then rephrase it internally to explicitly include the missing information. Following on from the previous example in which the user asked who wrote Romeo and Juliet, one may ask follow-up questions like “When?”. Assistant recognizes that this question is referring to both the subject (Romeo and Juliet) and answer from the previous query (William Shakespeare) and can rephrase “When?” to “When did William Shakespeare write Romeo and Juliet?”

While there are other ways to handle context, for instance, by applying rules directly to symbolic representations of the meaning of queries, like intents and arguments, the advantage of the rephrasing approach is that it operates horizontally at the string level across any query answering, parsing, or action fulfillment module.

Conversation on a smart display device, where Assistant understands multiple contextual follow-up queries, allowing the user to have a more natural conversation. The phrases appearing at the bottom of the display are suggestions for follow-up questions that the user can select. However, the user can still ask different questions.

A Wide Variety of Contextual Queries
The natural language processing field, traditionally, has not put much emphasis on a general approach to context, focusing on the understanding of stand-alone queries that are fully specified. Accurately incorporating context is a challenging problem, especially when considering the large variety of contextual query types. The table below contains example conversations that illustrate query variability and some of the many contextual challenges that Assistant’s rephrasing method can resolve (e.g., differentiating between referential and non-referential cases or identifying what context a query is referencing). We demonstrate how Assistant is now able to rephrase follow-up queries, adding contextual information before providing an answer.

System Architecture
At a high level, the rephrasing system generates rephrasing candidates by using different types of candidate generators. Each rephrasing candidate is then scored based on a number of signals, and the one with the highest score is selected.

High level architecture of Google Assistant contextual rephraser.

Candidate Generation
To generate rephrasing candidates we use a hybrid approach that applies different techniques, which we classify into three categories:

  1. Generators based on the analysis of the linguistic structure of the queries use grammatical and morphological rules to perform specific operations — for instance, the replacement of pronouns or other types of referential phrases with antecedents from the context.
  2. Generators based on query statistics combine key terms from the current query and its context to create candidates that match popular queries from historical data or common query patterns.
  3. Generators based on Transformer technologies, such as MUM, learn to generate sequences of words according to a number of training samples. LaserTagger and FELIX are technologies suitable for tasks with high overlap between the input and output texts, are very fast at inference time, and are not vulnerable to hallucination (i.e., generating text that is not related to the input texts). Once presented with a query and its context, they can generate a sequence of text edits to transform the input queries into a rephrasing candidate by indicating which portions of the context should be preserved and which words should be modified.

Candidate Scoring
We extract a number of signals for each rephrasing candidate and use an ML model to select the most promising candidate. Some of the signals depend only on the current query and its context. For example, is the topic of the current query similar to the topic of the previous query? Or, is the current query a good stand-alone query or does it look incomplete? Other signals depend on the candidate itself: How much of the information of the context does the candidate preserve? Is the candidate well-formed from a linguistic point of view? Etc.

Recently, new signals generated by BERT and MUM models have significantly improved the performance of the ranker, fixing about one-third of the recall headroom while minimizing false positives on query sequences that are not contextual (and therefore do not require a rephrasing).

Example conversation on a phone where Assistant understands a sequence of contextual queries.

Conclusion
The solution described here attempts to resolve contextual queries by rephrasing them in order to make them fully answerable in a stand-alone manner, i.e., without having to relate to other information during the fulfillment phase. The benefit of this approach is that it is agnostic to the mechanisms that would fulfill the query, thus making it usable as a horizontal layer to be deployed before any further processing.

Given the variety of contexts naturally used in human languages, we adopted a hybrid approach that combines linguistic rules, large amounts of historic data through logs, and ML models based on state-of-the-art Transformer approaches. By generating a number of rephrasing candidates for each query and its context, and then scoring and ranking them using a variety of signals, Assistant can rephrase and thus correctly interpret most contextual queries. As Assistant can handle most types of linguistic references, we are empowering users to have more natural conversations. To make such multi-turn conversations even less cumbersome, Assistant users can turn on Continued Conversation mode to enable asking follow-up queries without the need to repeat “Hey Google” between each query. We are also using this technology in other virtual assistant settings, for instance, interpreting context from something shown on a screen or playing on a speaker.

Acknowledgements
This post reflects the combined work of Aliaksei Severyn, André Farias, Cheng-Chun Lee, Florian Thöle, Gabriel Carvajal, Gyorgy Gyepesi, Julien Cretin, Liana Marinescu, Martin Bölle, Patrick Siegler, Sebastian Krause, Victor Ähdel, Victoria Fossum, Vincent Zhao. We also thank Amar Subramanya, Dave Orr, Yury Pinsky for helpful discussions and support.

Read More

Challenges in Multi-objective Optimization for Automatic Wireless Network Planning

Economics, combinatorics, physics, and signal processing conspire to make it difficult to design, build, and operate high-quality, cost-effective wireless networks. The radio transceivers that communicate with our mobile phones, the equipment that supports them (such as power and wired networking), and the physical space they occupy are all expensive, so it’s important to be judicious in choosing sites for new transceivers. Even when the set of available sites is limited, there are exponentially many possible networks that can be built. For example, given only 50 sites, there are 250 (over a million billion) possibilities!

Further complicating things, for every location where service is needed, one must know which transceiver provides the strongest signal and how strong it is. However, the physical characteristics of radio propagation in an environment containing buildings, hills, foliage, and other clutter are incredibly complex, so accurate predictions require sophisticated, computationally-intensive models. Building all possible sites would yield the best coverage and capacity, but even if this were not prohibitively expensive, it would create unacceptable interference among nearby transceivers. Balancing these trade-offs is a core mathematical difficulty.

The goal of wireless network planning is to decide where to place new transceivers to maximize coverage and capacity while minimizing cost and interference. Building an automatic network planning system (a.k.a., auto-planner) that quickly solves national-scale problems at fine-grained resolution without compromising solution quality has been among the most important and difficult open challenges in telecom research for decades.

To address these issues, we are piloting network planning tools built using detailed geometric models derived from high-resolution geographic data, that feed into radio propagation models powered by distributed computing. This system provides fast, high-accuracy predictions of signal strength. Our optimization algorithms then intelligently sift through the exponential space of possible networks to output a small menu of candidate networks that each achieve different desirable trade-offs among cost, coverage, and interference, while ensuring enough capacity to meet demand.

Example auto-planning project in Charlotte, NC. Blue dots denote selected candidate sites. The heat map indicates signal strength from the propagation engine. The inset emphasizes the non-isotropic path loss in downtown.

<!–

Example auto-planning project in Charlotte, NC. Blue dots denote selected candidate sites. The heat map indicates signal strength from the propagation engine. The inset emphasizes the non-isotropic path loss in downtown.

–>

Radio Propagation
The propagation of radio waves near Earth’s surface is complicated. Like ripples in a pond, they decay with distance traveled, but they can also penetrate, bounce off, or bend around obstacles, further weakening the signal. Computing radio wave attenuation across a real-world landscape (called path loss) is a hybrid process combining traditional physics-based calculations with learned corrections accounting for obstruction, diffraction, reflection, and scattering of the signal by clutter (e.g., trees and buildings).

We have developed a radio propagation modeling engine that leverages the same high-res geodata that powers Google Earth, Maps and Street View to map the 3D distribution of vegetation and buildings. While accounting for signal origin, frequency, broadcast strength, etc., we train signal correction models using extensive real-world measurements, which account for diverse propagation environments — from flat to hilly terrain and from dense urban to sparse rural areas.

While such hybrid approaches are common, using detailed geodata enables accurate path loss predictions below one-meter resolution. Our propagation engine provides fast point-to-point path loss calculations and scales massively via distributed computation. For instance, computing coverage for 25,000 transceivers scattered across the continental United States can be done at 4 meter resolution in only 1.5 hours, using 1000 CPU cores.

Photorealistic 3D model in Google Earth (top-left) and corresponding clutter height model (top-right). Path profile through buildings and trees from a source to destination in the clutter model (bottom). Gray denotes buildings and green denotes trees.

Auto-Planning Inputs
Once accurate coverage estimates are available, we can use them to optimize network planning, for example, deciding where to place hundreds of new sites to maximize network quality. The auto-planning solver addresses large-scale combinatorial optimization problems such as these, using a fast, robust, scalable approach.

Formally, an auto-planning input instance contains a set of demand points (usually a square grid) where service is to be provided, a set of candidate transceiver sites, predicted signal strengths from candidate sites to demand points (supplied by the propagation model), and a cost budget. Each demand point includes a demand quantity (e.g., estimated from the population of wireless users), and each site includes a cost and capacity. Signal strengths below some threshold are omitted. Finally, the input may include an overall cost budget.

Data Summarization for Large Instances
Auto-planning inputs can be huge, not just because of the number of candidate sites (tens of thousands), and demand points (billions), but also because it requires signal strengths to all demand points from all nearby candidate sites. Simple downsampling is insufficient because population density may vary widely over a given region. Therefore, we apply methods like priority sampling to shrink the data. This technique produces a low-variance, unbiased estimate of the original data, preserving an accurate view of the network traffic and interference statistics, and shrinking the input data enough that a city-size instance fits into memory on one machine.

Multi-objective Optimization via Local Search
Combinatorial optimization remains a difficult task, so we created a domain-specific local search algorithm to optimize network quality. The local search algorithmic paradigm is widely applied to address computationally-hard optimization problems. Such algorithms move from one solution to another through a search space of candidate solutions by applying small local changes, stopping at a time limit or when the solution is locally optimal. To evaluate the quality of a candidate network, we combine the different objective functions into a single one, as described in the following section.

The number of local steps to reach a local optimum, number of candidate moves we evaluate per step, and time to evaluate each candidate can all be large when dealing with realistic networks. To achieve a high-quality algorithm that finishes within hours (rather than days), we must address each of these components. Fast candidate evaluation benefits greatly from dynamic data structures that maintain the mapping between each demand point and the site in the candidate solution that provides the strongest signal to it. We update this “strongest-signal” map efficiently as the candidate solution evolves during local search. The following observations help limit both the number of steps to convergence and evaluations per step.

Bipartite graph representing candidate sites (left) and demand points (right). Selected sites are circled in red, and each demand point is assigned to its strongest available connection. The topmost demand point has no service because the only site that can reach it was not selected. The third and fourth demand points from the top may have high interference if the signal strengths attached to their gray edges are only slightly lower than the ones on their red edges. The bottommost site has high congestion because many demand points get their service from that site, possibly exceeding its capacity.

Selecting two nearby sites is usually not ideal because they interfere. Our algorithm explicitly forbids such pairs of sites, thereby steering the search toward better solutions while greatly reducing the number of moves considered per step. We identify pairs of forbidden sites based on the demand points they cover, as measured by the weighted Jaccard index. This captures their functional proximity much better than simple geographic distance does, especially in urban or hilly areas where radio propagation is highly non-isotropic.

Breaking the local search into epochs also helps. The first epoch mostly adds sites to increase the coverage area while avoiding forbidden pairs. As we approach the cost budget, we begin a second epoch that includes swap moves between forbidden pairs to fine-tune the interference. This restriction limits the number of candidate moves per step, while focusing on those that improve interference with less change to coverage.

Three candidate local search moves. Red circles indicate selected sites and the orange edge indicates a forbidden pair.

Outputting a Diverse Set of Good Solutions
As mentioned before, auto-planning must balance three competing objectives: maximizing coverage, while minimizing interference and capacity violations, subject to a cost budget. There is no single correct tradeoff, so our algorithm delegates the final decision to the user by providing a small menu of candidate networks with different emphases. We apply a multiplier to each objective and optimize the sum. Raising the multiplier for a component guides the algorithm to emphasize it. We perform grid search over multipliers and budgets, generating a large number of solutions, filter out any that are worse than another solution along all four components (including cost), and finally select a small subset that represent different tradeoffs.

Menu of candidate solutions, one per row, displaying metrics. Clicking on a solution turns the selected sites pink and displays a plot of the interference distribution across covered area and demand. Sites not selected are blue.

Conclusion
We described our efforts to address the most vexing challenges facing telecom network operators. Using combinatorial optimization in concert with geospatial and radio propagation modeling, we built a scalable auto-planner for wireless telecommunication networks. We are actively exploring how to expand these capabilities to best meet the needs of our customers. Stay tuned!

For questions and other inquiries, please reach out to wireless-network-interest@google.com.

Acknowledgements
These technological advances were enabled by the tireless work of our collaborators: Aaron Archer, Serge Barbosa Da Torre, Imad Fattouch, Danny Liberty, Pishoy Maksy, Zifei Tong, and Mat Varghese. Special thanks to Corinna Cortes, Mazin Gilbert, Rob Katcher, Michael Purdy, Bea Sebastian, Dave Vadasz, Josh Williams, and Aaron Yonas, along with Serge and especially Aaron Archer for their assistance with this blog post.

Read More

Language Models Perform Reasoning via Chain of Thought

In recent years, scaling up the size of language models has been shown to be a reliable way to improve performance on a range of natural language processing (NLP) tasks. Today’s language models at the scale of 100B or more parameters achieve strong performance on tasks like sentiment analysis and machine translation, even with little or no training examples. Even the largest language models, however, can still struggle with certain multi-step reasoning tasks, such as math word problems and commonsense reasoning. How might we enable language models to perform such reasoning tasks?

In “Chain of Thought Prompting Elicits Reasoning in Large Language Models,” we explore a prompting method for improving the reasoning abilities of language models. Called chain of thought prompting, this method enables models to decompose multi-step problems into intermediate steps. With chain of thought prompting, language models of sufficient scale (~100B parameters) can solve complex reasoning problems that are not solvable with standard prompting methods.

Comparison to Standard Prompting
With standard prompting (popularized by GPT-3) the model is given examples of input–output pairs (formatted as questions and answers) before being asked to predict the answer for a test-time example (shown below on the left). In chain of thought prompting (below, right), the model is prompted to produce intermediate reasoning steps before giving the final answer to a multi-step problem. The idea is that a model-generated chain of thought would mimic an intuitive thought process when working through a multi-step reasoning problem. While producing a thought process has been previously accomplished via fine-tuning, we show that such thought processes can be elicited by including a few examples of chain of thought via prompting only, which does not require a large training dataset or modifying the language model’s weights.

Whereas standard prompting asks the model to directly give the answer to a multi-step reasoning problem, chain of thought prompting induces the model to decompose the problem into intermediate reasoning steps, in this case leading to a correct final answer.

Chain of thought reasoning allows models to decompose complex problems into intermediate steps that are solved individually. Moreover, the language-based nature of chain of thought makes it applicable to any task that a person could solve via language. We find through empirical experiments that chain of thought prompting can improve performance on various reasoning tasks, and that successful chain of thought reasoning is an emergent property of model scale — that is, the benefits of chain of thought prompting only materialize with a sufficient number of model parameters (around 100B).

Arithmetic Reasoning
One class of tasks where language models typically struggle is arithmetic reasoning (i.e., solving math word problems). Two benchmarks in arithmetic reasoning are MultiArith and GSM8K, which test the ability of language models to solve multi-step math problems similar to the one shown in the figure above. We evaluate both the LaMDA collection of language models ranging from 422M to 137B parameters, as well as the PaLM collection of language models ranging from 8B to 540B parameters. We manually compose chains of thought to include in the examples for chain of thought prompting.

For these two benchmarks, using standard prompting leads to relatively flat scaling curves: increasing the scale of the model does not substantially improve performance (shown below). However, we find that when using chain of thought prompting, increasing model scale leads to improved performance that substantially outperforms standard prompting for large model sizes.

Employing chain of thought prompting enables language models to solve arithmetic reasoning problems for which standard prompting has a mostly flat scaling curve.

On the GSM8K dataset of math word problems, PaLM shows remarkable performance when scaled to 540B parameters. As shown in the table below, combining chain of thought prompting with the 540B parameter PaLM model leads to new state-of-the-art performance of 58%, surpassing the prior state of the art of 55% achieved by fine-tuning GPT-3 175B on a large training set and then ranking potential solutions via a specially trained verifier. Moreover, follow-up work on self-consistency shows that the performance of chain of thought prompting can be improved further by taking the majority vote of a broad set of generated reasoning processes, which results in 74% accuracy on GSM8K.

Chain of thought prompting with PaLM achieves a new state of the art on the GSM8K benchmark of math word problems. For a fair comparison against fine-tuned GPT-3 baselines, the chain of thought prompting results shown here also use an external calculator to compute basic arithmetic functions (i.e., addition, subtraction, multiplication and division).

Commonsense Reasoning
In addition to arithmetic reasoning, we consider whether the language-based nature of chain of thought prompting also makes it applicable to commonsense reasoning, which involves reasoning about physical and human interactions under the presumption of general background knowledge. For these evaluations, we use the CommonsenseQA and StrategyQA benchmarks, as well as two domain-specific tasks from BIG-Bench collaboration regarding date understanding and sports understanding. Example questions are below:

As shown below, for CommonsenseQA, StrategyQA, and Date Understanding, performance improved with model scale, and employing chain of thought prompting led to additional small improvements. Chain of thought prompting had the biggest improvement on sports understanding, for which PaLM 540B’s chain of thought performance surpassed that of an unaided sports enthusiast (95% vs 84%).

Chain of thought prompting also improves performance on various types of commonsense reasoning tasks.

Conclusions
Chain of thought prompting is a simple and broadly applicable method for improving the ability of language models to perform various reasoning tasks. Through experiments on arithmetic and commonsense reasoning, we find that chain of thought prompting is an emergent property of model scale. Broadening the range of reasoning tasks that language models can perform will hopefully inspire further work on language-based approaches to reasoning.

Acknowledgements
It was an honor and privilege to work with Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Quoc Le on this project.

Read More

Improving skin tone representation across Google

Seeing yourself reflected in the world around you — in real life, media or online — is so important. And we know that challenges with image-based technologies and representation on the web have historically left people of color feeling overlooked and misrepresented. Last year, we announced Real Tone for Pixel, which is just one example of our efforts to improve representation of diverse skin tones across Google products.

Today, we’re introducing a next step in our commitment to image equity and improving representation across our products. In partnership with Harvard professor and sociologist Dr. Ellis Monk, we’re releasing a new skin tone scale designed to be more inclusive of the spectrum of skin tones we see in our society. Dr. Monk has been studying how skin tone and colorism affect people’s lives for more than 10 years.

The culmination of Dr. Monk’s research is the Monk Skin Tone (MST) Scale, a 10-shade scale that will be incorporated into various Google products over the coming months. We’re openly releasing the scale so anyone can use it for research and product development. Our goal is for the scale to support inclusive products and research across the industry — we see this as a chance to share, learn and evolve our work with the help of others.

Ten circles in a row, ranging from dark to light.

The 10 shades of the Monk Skin Tone Scale.

This scale was designed to be easy-to-use for development and evaluation of technology while representing a broader range of skin tones. In fact, our research found that amongst participants in the U.S., people found the Monk Skin Tone Scale to be more representative of their skin tones compared to the current tech industry standard. This was especially true for people with darker skin tones.

“In our research, we found that a lot of the time people feel they’re lumped into racial categories, but there’s all this heterogeneity with ethnic and racial categories,” Dr. Monk says. “And many methods of categorization, including past skin tone scales, don’t pay attention to this diversity. That’s where a lack of representation can happen…we need to fine-tune the way we measure things, so people feel represented.”

Using the Monk Skin Tone Scale to improve Google products

Updating our approach to skin tone can help us better understand representation in imagery, as well as evaluate whether a product or feature works well across a range of skin tones. This is especially important for computer vision, a type of AI that allows computers to see and understand images. When not built and tested intentionally to include a broad range of skin-tones, computer vision systems have been found to not perform as well for people with darker skin.

The MST Scale will help us and the tech industry at large build more representative datasets so we can train and evaluate AI models for fairness, resulting in features and products that work better for everyone — of all skin tones. For example, we use the scale to evaluate and improve the models that detect faces in images.

Here are other ways you’ll see this show up in Google products.

Improving skin tone representation in Search

Every day, millions of people search the web expecting to find images that reflect their specific needs. That’s why we’re also introducing new features using the MST Scale to make it easier for people of all backgrounds to find more relevant and helpful results.

For example, now when you search for makeup related queries in Google Images, you’ll see an option to further refine your results by skin tone. So if you’re looking for “everyday eyeshadow” or “bridal makeup looks” you’ll more easily find results that work better for your needs.

Animated GIF showing a Google Images search for “bridal makeup looks.” The results include an option to filter by skin tone; the cursor selects a darker skin tone, which adjusts to results that are more relevant to this choice.

Seeing yourself represented in results can be key to finding information that’s truly relevant and useful, which is why we’re also rolling out improvements to show a greater range of skin tones in image results for broad searches about people, or ones where people show up in the results. In the future, we’ll incorporate the MST Scale to better detect and rank images to include a broader range of results, so everyone can find what they’re looking for.

Creating a more representative Search experience isn’t something we can do alone, though. How content is labeled online is a key factor in how our systems surface relevant results. In the coming months, we’ll also be developing a standardized way to label web content. Creators, brands and publishers will be able to use this new inclusive schema to label their content with attributes like skin tone, hair color and hair texture. This will make it possible for content creators or online businesses to label their imagery in a way that search engines and other platforms can easily understand.

A photograph of a Black person looking into the camera. Tags hover over various areas of the photo; one over their skin says “Skin tone” with a circle matching their skin tone. Two additional tags over their hair read “Hair color” and “Hair texture.

Improving skin tone representation in Google Photos

We’ll also be using the MST Scale to improve Google Photos. Last year, we introduced an improvement to our auto enhance feature in partnership with professional image makers. Now we’re launching a new set of Real Tone filters that are designed to work well across skin tones and evaluated using the MST Scale. We worked with a diverse range of renowned image makers, like Kennedi Carter and Joshua Kissi, who are celebrated for beautiful and accurate depictions of their subjects, to evaluate, test and build these filters. These new Real Tone filters allow you to choose from a wider assortment of looks and find one that reflects your style. Real Tone filters will be rolling out on Google Photos across Android, iOS and Web in the coming weeks.

Animated video showing before and after photos of images with the Real Tone Filter.

What’s next?

We’re openly releasing the Monk Skin Tone Scale so that others can use it in their own products, and learn from this work —and so that we can partner with and learn from them. We want to get feedback, drive more interdisciplinary research, and make progress together. We encourage you to share your thoughts here. We’re continuing to collaborate with Dr. Monk to evaluate the MST Scale across different regions and product applications, and we’ll iterate and improve on it to make sure the scale works for people and use cases all over the world. And, we’ll continue our efforts to make Google’s products work even better for every user.

The best part of working on this project is that it isn’t just ours — while we’re committed to making Google products better and more inclusive, we’re also excited about all the possibilities that exist as we work together to build for everyone across the web.

Read More

Unlocking Zero-Resource Machine Translation to Support New Languages in Google Translate

Machine translation (MT) technology has made significant advances in recent years, as deep learning has been integrated with natural language processing (NLP). Performance on research benchmarks like WMT have soared, and translation services have improved in quality and expanded to include new languages. Nevertheless, while existing translation services cover languages spoken by the majority of people world wide, they only include around 100 languages in total, just over 1% of those actively spoken globally. Moreover, the languages that are currently represented are overwhelmingly European, largely overlooking regions of high linguistic diversity, like Africa and the Americas.

There are two key bottlenecks towards building functioning translation models for the long tail of languages. The first arises from data scarcity; digitized data for many languages is limited and can be difficult to find on the web due to quality issues with Language Identification (LangID) models. The second challenge arises from modeling limitations. MT models usually train on large amounts of parallel (translated) text, but without such data, models must learn to translate from limited amounts of monolingual text, which is a novel area of research. Both of these challenges need to be addressed for translation models to reach sufficient quality.

In “Building Machine Translation Systems for the Next Thousand Languages”, we describe how to build high-quality monolingual datasets for over a thousand languages that do not have translation datasets available and demonstrate how one can use monolingual data alone to train MT models. As part of this effort, we are expanding Google Translate to include 24 under-resourced languages. For these languages, we created monolingual datasets by developing and using specialized neural language identification models combined with novel filtering approaches. The techniques we introduce supplement massively multilingual models with a self supervised task to enable zero-resource translation. Finally, we highlight how native speakers have helped us realize this accomplishment.

Meet the Data
Automatically gathering usable textual data for under-resourced languages is much more difficult than it may seem. Tasks like LangID, which work well for high-resource languages, are unsuccessful for under-resourced languages, and many publicly available datasets crawled from the web often contain more noise than usable data for the languages they attempt to support. In our early attempts to identify under-resourced languages on the web by training a standard Compact Language Detector v3 (CLD3) LangID model, we too found that the dataset was too noisy to be usable.

As an alternative, we trained a Transformer-based, semi-supervised LangID model on over 1000 languages. This model supplements the LangID task with the MAsked Sequence-to-Sequence (MASS) task to better generalize over noisy web data. MASS simply garbles the input by randomly removing sequences of tokens from it, and trains the model to predict these sequences. We applied the Transformer-based model to a dataset that had been filtered with a CLD3 model and trained to recognize clusters of similar languages.

We then applied the open sourced Term Frequency-Inverse Internet Frequency (TF-IIF) filtering to the resulting dataset to find and discard sentences that were actually in related high-resource languages, and developed a variety of language-specific filters to eliminate specific pathologies. The result of this effort was a dataset with monolingual text in over 1000 languages, of which 400 had over 100,000 sentences. We performed human evaluations on samples of 68 of these languages and found that the majority (>70%) reflected high-quality, in-language content.

The amount of monolingual data per language versus the amount of parallel (translated) data per language. A small number of languages have large amounts of parallel data, but there is a long tail of languages with only monolingual data.

Meet the Models
Once we had a dataset of monolingual text in over 1000 languages, we then developed a simple yet practical approach for zero-resource translation, i.e., translation for languages with no in-language parallel text and no language-specific translation examples. Rather than limiting our model to an artificial scenario with only monolingual text, we also include all available parallel text data with millions of examples for higher resource languages to enable the model to learn the translation task. Simultaneously, we train the model to learn representations of under-resourced languages directly from monolingual text using the MASS task. In order to solve this task, the model is forced to develop a sophisticated representation of the language in question, developing a complex understanding of how words relate to other words in a sentence.

Relying on the benefits of transfer learning in massively multilingual models, we train a single giant translation model on all available data for over 1000 languages. The model trains on monolingual text for all 1138 languages and on parallel text for a subset of 112 of the higher-resourced languages.

At training time, any input the model sees has a special token indicating which language the output should be in, exactly like the standard formulation for multilingual translation. Our additional innovation is to use the same special tokens for both the monolingual MASS task and the translation task. Therefore, the token translate_to_french may indicate that the source is in English and needs to be translated to French (the translation task), or it may mean that the source is in garbled French and needs to be translated to fluent French (the MASS task). By using the same tags for both tasks, a translate_to_french tag takes on the meaning, “Produce a fluent output in French that is semantically close to the input, regardless of whether the input is garbled in the same language or in another language entirely. From the model’s perspective, there is not much difference between the two.

Surprisingly, this simple procedure produces high quality zero-shot translations. The BLEU and ChrF scores for the resulting model are in the 10–40 and 20–60 ranges respectively, indicating mid- to high-quality translation. We observed meaningful translations even for highly inflected languages like Quechua and Kalaallisut, despite these languages being linguistically dissimilar to all other languages in the model. However, we only computed these metrics on the small subset of languages with human-translated evaluation sets. In order to understand the quality of translation for the remaining languages, we developed an evaluation metric based on round-trip translation, which allowed us to see that several hundred languages are reaching high translation quality.

To further improve quality, we use the model to generate large amounts of synthetic parallel data, filter the data based on round-trip translation (comparing a sentence translated into another language and back again), and continue training the model on this filtered synthetic data via back-translation and self-training. Finally, we fine-tune the model on a smaller subset of 30 languages and distill it into a model small enough to be served.

Translation accuracy scores for 638 of the languages supported in our model, using the metric we developed (RTTLangIDChrF), for both the higher-resource supervised languages and the low-resource zero-resource languages.

Contributions from Native Speakers
Regular communication with native speakers of these languages was critical for our research. We collaborated with over 100 people at Google and other institutions who spoke these languages. Some volunteers helped develop specialized filters to remove out-of-language content overlooked by automatic methods, for instance Hindi mixed with Sanskrit. Others helped with transliterating between different scripts used by the languages, for instance between Meetei Mayek and Bengali, for which sufficient tools didn’t exist; and yet others helped with a gamut of tasks related to evaluation. Native speakers were also key for advising in matters of political sensitivity, like the appropriate name for the language, and the appropriate writing system to use for it. And only native speakers could answer the ultimate question: given the current quality of translation, would it be valuable to the community for Google Translate to support this language?

Closing Notes
This advance is an exciting first step toward supporting more language technologies in under-resourced languages. Most importantly, we want to stress that the quality of translations produced by these models still lags far behind that of the higher-resource languages supported by Google Translate. These models are certainly a useful first tool for understanding content in under-resourced languages, but they will make mistakes and exhibit their own biases. As with any ML-driven tool, one should consider the output carefully.

The complete list of new languages added to Google Translate in this update:

Acknowledgements
We would like to thank Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yanping Huang, Zhifeng Chen, Yonghui Wu, and Macduff Hughes for their contributions to the research, engineering, and leadership of this project.

We would also like to extend our deepest gratitude to the following native speakers and members of affected communities, who helped us in a wide variety of ways: Yasser Salah Eddine Bouchareb (Algerian Arabic); Mfoniso Ukwak (Anaang); Bhaskar Borthakur, Kishor Barman, Rasika Saikia, Suraj Bharech (Assamese); Ruben Hilare Quispe (Aymara); Devina Suyanto (Balinese); Allahserix Auguste Tapo, Bakary Diarrassouba, Maimouna Siby (Bambara); Mohammad Jahangir (Baluchi); Subhajit Naskar (Bengali); Animesh Pathak, Ankur Bapna, Anup Mohan, Chaitanya Joshi, Chandan Dubey, Kapil Kumar, Manish Katiyar, Mayank Srivastava, Neeharika, Saumya Pathak, Tanya Sinha, Vikas Singh (Bhojpuri); Bowen Liang, Ellie Chio, Eric Dong, Frank Tang, Jeff Pitman, John Wong, Kenneth Chang, Manish Goregaokar, Mingfei Lau, Ryan Li, Yiwen Luo (Cantonese); Monang Setyawan (Caribbean Javanese); Craig Cornelius (Cherokee); Anton Prokopyev (Chuvash); Rajat Dogra, Sid Dogra (Dogri); Mohamed Kamagate (Dyula); Chris Assigbe, Dan Ameme, Emeafa Doe, Irene Nyavor, Thierry Gnanih, Yvonne Dumor (Ewe); Abdoulaye Barry, Adama Diallo, Fauzia van der Leeuw, Ibrahima Barry (Fulfulde); Isabel Papadimitriou (Greek); Alex Rudnick (Guarani); Mohammad Khdeir (Gulf Arabic); Paul Remollata (Hiligaynon); Ankur Bapna (Hindi); Mfoniso Ukwak (Ibibio); Nze Lawson (Igbo); D.J. Abuy, Miami Cabansay (Ilocano); Archana Koul, Shashwat Razdan, Sujeet Akula (Kashmiri); Jatin Kulkarni, Salil Rajadhyaksha, Sanjeet Hegde Desai, Sharayu Shenoy, Shashank Shanbhag, Shashi Shenoy (Konkani); Ryan Michael, Terrence Taylor (Krio); Bokan Jaff, Medya Ghazizadeh, Roshna Omer Abdulrahman, Saman Vaisipour, Sarchia Khursheed (Kurdish (Sorani));Suphian Tweel (Libyan Arabic); Doudou Kisabaka (Lingala); Colleen Mallahan, John Quinn (Luganda); Cynthia Mboli (Luyia); Abhishek Kumar, Neeraj Mishra, Priyaranjan Jha, Saket Kumar, Snehal Bhilare (Maithili); Lisa Wang (Mandarin Chinese); Cibu Johny (Malayalam); Viresh Ratnakar (Marathi); Abhi Sanoujam, Gautam Thockchom, Pritam Pebam, Sam Chaomai, Shangkar Mayanglambam, Thangjam Hindustani Devi (Meiteilon (Manipuri)); Hala Ajil (Mesopotamian Arabic); Hamdanil Rasyid (Minangkabau); Elizabeth John, Remi Ralte, S Lallienkawl Gangte,Vaiphei Thatsing, Vanlalzami Vanlalzami (Mizo); George Ouais (MSA); Ahmed Kachkach, Hanaa El Azizi (Morrocan Arabic); Ujjwal Rajbhandari (Newari); Ebuka Ufere, Gabriel Fynecontry, Onome Ofoman, Titi Akinsanmi (Nigerian Pidgin); Marwa Khost Jarkas (North Levantine Arabic); Abduselam Shaltu, Ace Patterson, Adel Kassem, Mo Ali, Yonas Hambissa (Oromo); Helvia Taina, Marisol Necochea (Quechua); AbdelKarim Mardini (Saidi Arabic); Ishank Saxena, Manasa Harish, Manish Godara, Mayank Agrawal, Nitin Kashyap, Ranjani Padmanabhan, Ruchi Lohani, Shilpa Jindal, Shreevatsa Rajagopalan, Vaibhav Agarwal, Vinod Krishnan (Sanskrit); Nabil Shahid (Saraiki); Ayanda Mnyakeni (Sesotho, Sepedi); Landis Baker (Seychellois Creole); Taps Matangira (Shona); Ashraf Elsharif (Sudanese Arabic); Sakhile Dlamini (Swati); Hakim Sidahmed (Tamazight); Melvin Johnson (Tamil); Sneha Kudugunta (Telugu); Alexander Tekle, Bserat Ghebremicael, Nami Russom, Naud Ghebre (Tigrinya); Abigail Annkah, Diana Akron, Maame Ofori, Monica Opoku-Geren, Seth Duodu-baah, Yvonne Dumor (Twi); Ousmane Loum (Wolof); and Daniel Virtheim (Yiddish).

Read More

Google I/O 2022: Advancing knowledge and computing

[TL;DR]

Nearly 24 years ago, Google started with two graduate students, one product, and a big mission: to organize the world’s information and make it universally accessible and useful. In the decades since, we’ve been developing our technology to deliver on that mission.

The progress we’ve made is because of our years of investment in advanced technologies, from AI to the technical infrastructure that powers it all. And once a year — on my favorite day of the year 🙂 — we share an update on how it’s going at Google I/O.

Today, I talked about how we’re advancing two fundamental aspects of our mission — knowledge and computing — to create products that are built to help. It’s exciting to build these products; it’s even more exciting to see what people do with them.

Thank you to everyone who helps us do this work, and most especially our Googlers. We are grateful for the opportunity.

– Sundar


Editor’s note: Below is an edited transcript of Sundar Pichai’s keynote address during the opening of today’s Google I/O Developers Conference.

Hi, everyone, and welcome. Actually, let’s make that welcome back! It’s great to return to Shoreline Amphitheatre after three years away. To the thousands of developers, partners and Googlers here with us, it’s great to see all of you. And to the millions more joining us around the world — we’re so happy you’re here, too.

Last year, we shared how new breakthroughs in some of the most technically challenging areas of computer science are making Google products more helpful in the moments that matter. All this work is in service of our timeless mission: to organize the world’s information and make it universally accessible and useful.

I’m excited to show you how we’re driving that mission forward in two key ways: by deepening our understanding of information so that we can turn it into knowledge; and advancing the state of computing, so that knowledge is easier to access, no matter who or where you are.

Today, you’ll see how progress on these two parts of our mission ensures Google products are built to help. I’ll start with a few quick examples. Throughout the pandemic, Google has focused on delivering accurate information to help people stay healthy. Over the last year, people used Google Search and Maps to find where they could get a COVID vaccine nearly two billion times.

A visualization of Google’s flood forecasting system, with three 3D maps stacked on top of one another, showing landscapes and weather patterns in green and brown colors. The maps are floating against a gray background.

Google’s flood forecasting technology sent flood alerts to 23 million people in India and Bangladesh last year.

We’ve also expanded our flood forecasting technology to help people stay safe in the face of natural disasters. During last year’s monsoon season, our flood alerts notified more than 23 million people in India and Bangladesh. And we estimate this supported the timely evacuation of hundreds of thousands of people.

In Ukraine, we worked with the government to rapidly deploy air raid alerts. To date, we’ve delivered hundreds of millions of alerts to help people get to safety. In March I was in Poland, where millions of Ukrainians have sought refuge. Warsaw’s population has increased by nearly 20% as families host refugees in their homes, and schools welcome thousands of new students. Nearly every Google employee I spoke with there was hosting someone.

Adding 24 more languages to Google Translate

In countries around the world, Google Translate has been a crucial tool for newcomers and residents trying to communicate with one another. We’re proud of how it’s helping Ukrainians find a bit of hope and connection until they are able to return home again.

Two boxes, one showing a question in English — “What’s the weather like today?” — the other showing its translation in Quechua. There is a microphone symbol below the English question and a loudspeaker symbol below the Quechua answer.

With machine learning advances, we’re able to add languages like Quechua to Google Translate.

Real-time translation is a testament to how knowledge and computing come together to make people’s lives better. More people are using Google Translate than ever before, but we still have work to do to make it universally accessible. There’s a long tail of languages that are underrepresented on the web today, and translating them is a hard technical problem. That’s because translation models are usually trained with bilingual text — for example, the same phrase in both English and Spanish. However, there’s not enough publicly available bilingual text for every language.

So with advances in machine learning, we’ve developed a monolingual approach where the model learns to translate a new language without ever seeing a direct translation of it. By collaborating with native speakers and institutions, we found these translations were of sufficient quality to be useful, and we’ll continue to improve them.

A list of the 24 new languages Google Translate now has available.

We’re adding 24 new languages to Google Translate.

Today, I’m excited to announce that we’re adding 24 new languages to Google Translate, including the first indigenous languages of the Americas. Together, these languages are spoken by more than 300 million people. Breakthroughs like this are powering a radical shift in how we access knowledge and use computers.

Taking Google Maps to the next level

So much of what’s knowable about our world goes beyond language — it’s in the physical and geospatial information all around us. For more than 15 years, Google Maps has worked to create rich and useful representations of this information to help us navigate. Advances in AI are taking this work to the next level, whether it’s expanding our coverage to remote areas, or reimagining how to explore the world in more intuitive ways.

An overhead image of a map of a dense urban area, showing gray roads cutting through clusters of buildings outlined in blue.

Advances in AI are helping to map remote and rural areas.

Around the world, we’ve mapped around 1.6 billion buildings and over 60 million kilometers of roads to date. Some remote and rural areas have previously been difficult to map, due to scarcity of high-quality imagery and distinct building types and terrain. To address this, we’re using computer vision and neural networks to detect buildings at scale from satellite images. As a result, we have increased the number of buildings on Google Maps in Africa by 5X since July 2020, from 60 million to nearly 300 million.

We’ve also doubled the number of buildings mapped in India and Indonesia this year. Globally, over 20% of the buildings on Google Maps have been detected using these new techniques. We’ve gone a step further, and made the dataset of buildings in Africa publicly available. International organizations like the United Nations and the World Bank are already using it to better understand population density, and to provide support and emergency assistance.

Immersive view in Google Maps fuses together aerial and street level images.

We’re also bringing new capabilities into Maps. Using advances in 3D mapping and machine learning, we’re fusing billions of aerial and street level images to create a new, high-fidelity representation of a place. These breakthrough technologies are coming together to power a new experience in Maps called immersive view: it allows you to explore a place like never before.

Let’s go to London and take a look. Say you’re planning to visit Westminster with your family. You can get into this immersive view straight from Maps on your phone, and you can pan around the sights… here’s Westminster Abbey. If you’re thinking of heading to Big Ben, you can check if there’s traffic, how busy it is, and even see the weather forecast. And if you’re looking to grab a bite during your visit, you can check out restaurants nearby and get a glimpse inside.

What’s amazing is that isn’t a drone flying in the restaurant — we use neural rendering to create the experience from images alone. And Google Cloud Immersive Stream allows this experience to run on just about any smartphone. This feature will start rolling out in Google Maps for select cities globally later this year.

Another big improvement to Maps is eco-friendly routing. Launched last year, it shows you the most fuel-efficient route, giving you the choice to save money on gas and reduce carbon emissions. Eco-friendly routes have already rolled out in the U.S. and Canada — and people have used them to travel approximately 86 billion miles, helping save an estimated half million metric tons of carbon emissions, the equivalent of taking 100,000 cars off the road.

Still image of eco-friendly routing on Google Maps — a 53-minute driving route in Berlin is pictured, with text below the map showing it will add three minutes but save 18% more fuel.

Eco-friendly routes will expand to Europe later this year.

I’m happy to share that we’re expanding this feature to more places, including Europe later this year. In this Berlin example, you could reduce your fuel consumption by 18% taking a route that’s just three minutes slower. These small decisions have a big impact at scale. With the expansion into Europe and beyond, we estimate carbon emission savings will double by the end of the year.

And we’ve added a similar feature to Google Flights. When you search for flights between two cities, we also show you carbon emission estimates alongside other information like price and schedule, making it easy to choose a greener option. These eco-friendly features in Maps and Flights are part of our goal to empower 1 billion people to make more sustainable choices through our products, and we’re excited about the progress here.

New YouTube features to help people easily access video content

Beyond Maps, video is becoming an even more fundamental part of how we share information, communicate, and learn. Often when you come to YouTube, you are looking for a specific moment in a video and we want to help you get there faster.

Last year we launched auto-generated chapters to make it easier to jump to the part you’re most interested in.

This is also great for creators because it saves them time making chapters. We’re now applying multimodal technology from DeepMind. It simultaneously uses text, audio and video to auto-generate chapters with greater accuracy and speed. With this, we now have a goal to 10X the number of videos with auto-generated chapters, from eight million today, to 80 million over the next year.

Often the fastest way to get a sense of a video’s content is to read its transcript, so we’re also using speech recognition models to transcribe videos. Video transcripts are now available to all Android and iOS users.

Animation showing a video being automatically translated. Then text reads "Now available in sixteen languages."

Auto-translated captions on YouTube.

Next up, we’re bringing auto-translated captions on YouTube to mobile. Which means viewers can now auto-translate video captions in 16 languages, and creators can grow their global audience. We’ll also be expanding auto-translated captions to Ukrainian YouTube content next month, part of our larger effort to increase access to accurate information about the war.

Helping people be more efficient with Google Workspace

Just as we’re using AI to improve features in YouTube, we’re building it into our Workspace products to help people be more efficient. Whether you work for a small business or a large institution, chances are you spend a lot of time reading documents. Maybe you’ve felt that wave of panic when you realize you have a 25-page document to read ahead of a meeting that starts in five minutes.

At Google, whenever I get a long document or email, I look for a TL;DR at the top — TL;DR is short for “Too Long, Didn’t Read.” And it got us thinking, wouldn’t life be better if more things had a TL;DR?

That’s why we’ve introduced automated summarization for Google Docs. Using one of our machine learning models for text summarization, Google Docs will automatically parse the words and pull out the main points.

This marks a big leap forward for natural language processing. Summarization requires understanding of long passages, information compression and language generation, which used to be outside of the capabilities of even the best machine learning models.

And docs are only the beginning. We’re launching summarization for other products in Workspace. It will come to Google Chat in the next few months, providing a helpful digest of chat conversations, so you can jump right into a group chat or look back at the key highlights.

Animation showing summary in Google Chat

We’re bringing summarization to Google Chat in the coming months.

And we’re working to bring transcription and summarization to Google Meet as well so you can catch up on some important meetings you missed.

Visual improvements on Google Meet

Of course there are many moments where you really want to be in a virtual room with someone. And that’s why we continue to improve audio and video quality, inspired by Project Starline. We introduced Project Starline at I/O last year. And we’ve been testing it across Google offices to get feedback and improve the technology for the future. And in the process, we’ve learned some things that we can apply right now to Google Meet.

Starline inspired machine learning-powered image processing to automatically improve your image quality in Google Meet. And it works on all types of devices so you look your best wherever you are.

An animation of a man looking directly at the camera then waving and smiling. A white line sweeps across the screen, adjusting the image quality to make it brighter and clearer.

Machine learning-powered image processing automatically improves image quality in Google Meet.

We’re also bringing studio quality virtual lighting to Meet. You can adjust the light position and brightness, so you’ll still be visible in a dark room or sitting in front of a window. We’re testing this feature to ensure everyone looks like their true selves, continuing the work we’ve done with Real Tone on Pixel phones and the Monk Scale.

These are just some of the ways AI is improving our products: making them more helpful, more accessible, and delivering innovative new features for everyone.

Gif shows a phone camera pointed towards a rack of shelves, generating helpful information about food items. Text on the screen shows the words ‘dark’, ‘nut-free’ and ‘highly-rated’.

Today at I/O Prabhakar Raghavan shared how we’re helping people find helpful information in more intuitive ways on Search.

Making knowledge accessible through computing

We’ve talked about how we’re advancing access to knowledge as part of our mission: from better language translation to improved Search experiences across images and video, to richer explorations of the world using Maps.

Now we’re going to focus on how we make that knowledge even more accessible through computing. The journey we’ve been on with computing is an exciting one. Every shift, from desktop to the web to mobile to wearables and ambient computing has made knowledge more useful in our daily lives.

As helpful as our devices are, we’ve had to work pretty hard to adapt to them. I’ve always thought computers should be adapting to people, not the other way around. We continue to push ourselves to make progress here.

Here’s how we’re making computing more natural and intuitive with the Google Assistant.

Introducing LaMDA 2 and AI Test Kitchen

Animation shows demos of how LaMDA can converse on any topic and how AI Test Kitchen can help create lists.

A demo of LaMDA, our generative language model for dialogue application, and the AI Test Kitchen.

We’re continually working to advance our conversational capabilities. Conversation and natural language processing are powerful ways to make computers more accessible to everyone. And large language models are key to this.

Last year, we introduced LaMDA, our generative language model for dialogue applications that can converse on any topic. Today, we are excited to announce LaMDA 2, our most advanced conversational AI yet.

We are at the beginning of a journey to make models like these useful to people, and we feel a deep responsibility to get it right. To make progress, we need people to experience the technology and provide feedback. We opened LaMDA up to thousands of Googlers, who enjoyed testing it and seeing its capabilities. This yielded significant quality improvements, and led to a reduction in inaccurate or offensive responses.

That’s why we’ve made AI Test Kitchen. It’s a new way to explore AI features with a broader audience. Inside the AI Test Kitchen, there are a few different experiences. Each is meant to give you a sense of what it might be like to have LaMDA in your hands and use it for things you care about.

The first is called “Imagine it.” This demo tests if the model can take a creative idea you give it, and generate imaginative and relevant descriptions. These are not products, they are quick sketches that allow us to explore what LaMDA can do with you. The user interfaces are very simple.

Say you’re writing a story and need some inspirational ideas. Maybe one of your characters is exploring the deep ocean. You can ask what that might feel like. Here LaMDA describes a scene in the Mariana Trench. It even generates follow-up questions on the fly. You can ask LaMDA to imagine what kinds of creatures might live there. Remember, we didn’t hand-program the model for specific topics like submarines or bioluminescence. It synthesized these concepts from its training data. That’s why you can ask about almost any topic: Saturn’s rings or even being on a planet made of ice cream.

Staying on topic is a challenge for language models. Say you’re building a learning experience — you want it to be open-ended enough to allow people to explore where curiosity takes them, but stay safely on topic. Our second demo tests how LaMDA does with that.

In this demo, we’ve primed the model to focus on the topic of dogs. It starts by generating a question to spark conversation, “Have you ever wondered why dogs love to play fetch so much?” And if you ask a follow-up question, you get an answer with some relevant details: it’s interesting, it thinks it might have something to do with the sense of smell and treasure hunting.

You can take the conversation anywhere you want. Maybe you’re curious about how smell works and you want to dive deeper. You’ll get a unique response for that too. No matter what you ask, it will try to keep the conversation on the topic of dogs. If I start asking about cricket, which I probably would, the model brings the topic back to dogs in a fun way.

This challenge of staying on-topic is a tricky one, and it’s an important area of research for building useful applications with language models.

These experiences show the potential of language models to one day help us with things like planning, learning about the world, and more.

Of course, there are significant challenges to solve before these models can truly be useful. While we have improved safety, the model might still generate inaccurate, inappropriate, or offensive responses. That’s why we are inviting feedback in the app, so people can help report problems.

We will be doing all of this work in accordance with our AI Principles. Our process will be iterative, opening up access over the coming months, and carefully assessing feedback with a broad range of stakeholders — from AI researchers and social scientists to human rights experts. We’ll incorporate this feedback into future versions of LaMDA, and share our findings as we go.

Over time, we intend to continue adding other emerging areas of AI into AI Test Kitchen. You can learn more at: g.co/AITestKitchen.

Advancing AI language models

LaMDA 2 has incredible conversational capabilities. To explore other aspects of natural language processing and AI, we recently announced a new model. It’s called Pathways Language Model, or PaLM for short. It’s our largest model to date and trained on 540 billion parameters.

PaLM demonstrates breakthrough performance on many natural language processing tasks, such as generating code from text, answering a math word problem, or even explaining a joke.

It achieves this through greater scale. And when we combine that scale with a new technique called chain-of- thought prompting, the results are promising. Chain-of-thought prompting allows us to describe multi-step problems as a series of intermediate steps.

Let’s take an example of a math word problem that requires reasoning. Normally, how you use a model is you prompt it with a question and answer, and then you start asking questions. In this case: How many hours are in the month of May? So you can see, the model didn’t quite get it right.

In chain-of-thought prompting, we give the model a question-answer pair, but this time, an explanation of how the answer was derived. Kind of like when your teacher gives you a step-by-step example to help you understand how to solve a problem. Now, if we ask the model again — how many hours are in the month of May — or other related questions, it actually answers correctly and even shows its work.

There are two boxes below a heading saying ‘chain-of-thought prompting’. A box headed ‘input’ guides the model through answering a question about how many tennis balls a person called Roger has. The output box shows the model correctly reasoning through and answering a separate question (‘how many hours are in the month of May?’)

Chain-of-thought prompting leads to better reasoning and more accurate answers.

Chain-of-thought prompting increases accuracy by a large margin. This leads to state-of-the-art performance across several reasoning benchmarks, including math word problems. And we can do it all without ever changing how the model is trained.

PaLM is highly capable and can do so much more. For example, you might be someone who speaks a language that’s not well-represented on the web today — which makes it hard to find information. Even more frustrating because the answer you are looking for is probably out there. PaLM offers a new approach that holds enormous promise for making knowledge more accessible for everyone.

Let me show you an example in which we can help answer questions in a language like Bengali — spoken by a quarter billion people. Just like before we prompt the model with two examples of questions in Bengali with both Bengali and English answers.

That’s it, now we can start asking questions in Bengali: “What is the national song of Bangladesh?” The answer, by the way, is “Amar Sonar Bangla” — and PaLM got it right, too. This is not that surprising because you would expect that content to exist in Bengali.

You can also try something that is less likely to have related information in Bengali such as: “What are popular pizza toppings in New York City?” The model again answers correctly in Bengali. Though it probably just stirred up a debate amongst New Yorkers about how “correct” that answer really is.

What’s so impressive is that PaLM has never seen parallel sentences between Bengali and English. Nor was it ever explicitly taught to answer questions or translate at all! The model brought all of its capabilities together to answer questions correctly in Bengali. And we can extend the techniques to more languages and other complex tasks.

We’re so optimistic about the potential for language models. One day, we hope we can answer questions on more topics in any language you speak, making knowledge even more accessible, in Search and across all of Google.

Introducing the world’s largest, publicly available machine learning hub

The advances we’ve shared today are possible only because of our continued innovation in our infrastructure. Recently we announced plans to invest $9.5 billion in data centers and offices across the U.S.

One of our state-of-the-art data centers is in Mayes County, Oklahoma. I’m excited to announce that, there, we are launching the world’s largest, publicly-available machine learning hub for our Google Cloud customers.

Still image of a data center with Oklahoma map pin on bottom left corner.

One of our state-of-the-art data centers in Mayes County, Oklahoma.

This machine learning hub has eight Cloud TPU v4 pods, custom-built on the same networking infrastructure that powers Google’s largest neural models. They provide nearly nine exaflops of computing power in aggregate — bringing our customers an unprecedented ability to run complex models and workloads. We hope this will fuel innovation across many fields, from medicine to logistics, sustainability and more.

And speaking of sustainability, this machine learning hub is already operating at 90% carbon-free energy. This is helping us make progress on our goal to become the first major company to operate all of our data centers and campuses globally on 24/7 carbon-free energy by 2030.

Even as we invest in our data centers, we are working to innovate on our mobile platforms so more processing can happen locally on device. Google Tensor, our custom system on a chip, was an important step in this direction. It’s already running on Pixel 6 and Pixel 6 Pro, and it brings our AI capabilities — including the best speech recognition we’ve ever deployed — right to your phone. It’s also a big step forward in making those devices more secure. Combined with Android’s Private Compute Core, it can run data-powered features directly on device so that it’s private to you.

People turn to our products every day for help in moments big and small. Core to making this possible is protecting your private information each step of the way. Even as technology grows increasingly complex, we keep more people safe online than anyone else in the world, with products that are secure by default, private by design and that put you in control.

We also spent time today sharing updates to platforms like Android. They’re delivering access, connectivity, and information to billions of people through their smartphones and other connected devices like TVs, cars and watches.

And we shared our new Pixel Portfolio, including the Pixel 6a, Pixel Buds Pro, Google Pixel Watch, Pixel 7, and Pixel tablet all built with ambient computing in mind. We’re excited to share a family of devices that work better together — for you.

The next frontier of computing: augmented reality

Today we talked about all the technologies that are changing how we use computers and access knowledge. We see devices working seamlessly together, exactly when and where you need them and with conversational interfaces that make it easier to get things done.

Looking ahead, there’s a new frontier of computing, which has the potential to extend all of this even further, and that is augmented reality. At Google, we have been heavily invested in this area. We’ve been building augmented reality into many Google products, from Google Lens to multisearch, scene exploration, and Live and immersive views in Maps.

These AR capabilities are already useful on phones and the magic will really come alive when you can use them in the real world without the technology getting in the way.

That potential is what gets us most excited about AR: the ability to spend time focusing on what matters in the real world, in our real lives. Because the real world is pretty amazing!

It’s important we design in a way that is built for the real world — and doesn’t take you away from it. And AR gives us new ways to accomplish this.

Let’s take language as an example. Language is just so fundamental to connecting with one another. And yet, understanding someone who speaks a different language, or trying to follow a conversation if you are deaf or hard of hearing can be a real challenge. Let’s see what happens when we take our advancements in translation and transcription and deliver them in your line of sight in one of the early prototypes we’ve been testing.

You can see it in their faces: the joy that comes with speaking naturally to someone. That moment of connection. To understand and be understood. That’s what our focus on knowledge and computing is all about. And it’s what we strive for every day, with products that are built to help.

Each year we get a little closer to delivering on our timeless mission. And we still have so much further to go. At Google, we genuinely feel a sense of excitement about that. And we are optimistic that the breakthroughs you just saw will help us get there. Thank you to all of the developers, partners and customers who joined us today. We look forward to building the future with all of you.

Read More

Understanding the world through language

Language is at the heart of how people communicate with each other. It’s also proving to be powerful in advancing AI and building helpful experiences for people worldwide.

From the beginning, we set out to connect words in your search to words on a page so we could make the web’s information more accessible and useful. Over 20 years later, as the web changes, and the ways people consume information expand from text to images to videos and more — the one constant is that language remains a surprisingly powerful tool for understanding information.

In recent years, we’ve seen an incredible acceleration in the field of natural language understanding. While our systems still don’t understand language the way people do, they’re increasingly able to spot patterns in information, identify complex concepts and even draw implicit connections between them. We’re even finding that many of our advanced models can understand information across languages or in non-language-based formats like images and videos.

Building the next generation of language models

In 2017, Google researchers developed the Transformer, the neural network that underlies major advancements like MUM and LaMDA. Last year, we shared our thinking on a new architecture called Pathways, which is loosely inspired by the sparse patterns of neural activity in the brain. When you read a blog post like this one, only the critical parts of your brain needed to process this information fire up — not every single neuron. With Pathways, we’re now able to train AI models to be similarly effective.

Using this system, we recently introduced PaLM, a new model that achieves state-of-the-art performance on challenging language modeling tasks. It can solve complex math word problems, and answer questions in new languages with very little additional training data.

PaLM also shows improvements in understanding and expressing logic. This is significant because it allows the model to express its reasoning through words. Remember your algebra problem sets? It wasn’t enough to just get the right answer — you had to explain how you got there. PaLM is able to prompt a “Chain of Thought” to explain its thought process, step-by-step. This emerging capability helps improve accuracy and our understanding of how a model arrives at answers.

Flow chart for the difference between "Standard Prompting" and "Chain of Thought Prompting"

Translating the languages of the world

Pathways-related models are enabling us to break down language barriers in a way never before possible. Nowhere is this clearer than in our recently added support for 24 new languages in Google Translate, spoken by over 300 million people worldwide — including the first indigenous languages of the Americas. The amazing part is that the neural model did this using only monolingual text with no translation pairs — which allows us to help communities and languages underrepresented by technology. Machine translation at this level helps the world feel a bit smaller, while allowing us to dream bigger.

Unlocking knowledge about the world across modalities

Today, people consume information through webpages, images, videos, and more. Our advanced language and Pathways-related models are learning to make sense of information stemming from these different modalities through language. With these multimodal capabilities, we’re expanding multisearch in the Google app so you can search more naturally than ever before. As the saying goes — “a picture is worth a thousand words” — it turns out, words are really the key to sharing information about the world.

"Scene exploration" GIF of a store shelf demonstrating multisearch

Improving conversational AI

Despite these advancements, human language continues to be one of the most complex undertakings for computers.

In everyday conversation, we all naturally say “um,” pause to find the right words, or correct ourselves — and yet other people have no trouble understanding what we’re saying. That’s because people can react to conversational cues in as little as 200 milliseconds. Moving our speech model from data centers to run on the device made things faster, but we wanted to push the envelope even more.

Computers aren’t there yet — so we’re introducing improvements to responsiveness on the Assistant with unified neural networks, combining many models into smarter ones capable of understanding more — like when someone pauses but is not finished speaking. Getting closer to the fluidity of real-time conversation is finally possible with Google’s Tensor chip, which is custom-engineered to handle on-device machine learning tasks super fast.

We’re also investing in building models that are capable of carrying more natural, sensible and specific conversations. Since introducing LaMDA to the world last year, we’ve made great progress, improving the model in key areas of quality, safety and groundedness — areas where we know conversational AI models can struggle. We’ll be releasing the next iteration, LaMDA 2, as a part of the AI Test Kitchen, which we’ll be opening up to small groups of people gradually. Our goal with AI Test Kitchen is to learn, improve, and innovate responsibly on this technology together. It’s still early days for LaMDA, but we want to continue to make progress and do so responsibly with feedback from the community.

GIF showing LaMDA 2 on device

Responsible development of AI models

While language is a remarkably powerful and versatile tool for understanding the world around us, we also know it comes with its limitations and challenges. In 2018, we published our AI Principles as guidelines to help us avoid bias, test rigorously for safety, design with privacy top of mind and make technology accountable to people. We’re investing in research across disciplines to understand the types of harms language models can affect, and to develop the frameworks and methods to ensure we bring in a diversity of perspectives and make meaningful improvements. We also build and use tools that can help us better understand our models (e.g., identifying how different words affect a prediction, tracing an error back to training data and even measuring correlations within a model). And while we work to improve underlying models, we also test rigorously before and after any kind of product deployment.

We’ve come a long way since introducing the world to the Transformer. We’re proud of the tremendous value that it and its predecessors have brought not only to everyday Google products like Search and Translate, but also the breakthroughs they’ve powered in natural language understanding. Our work advancing the future of AI is driven by something as old as time: the power language has to bring people together.

Read More