Large-Scale Matrix Factorization on TPUs

Matrix factorization is one of the oldest, yet still widely used, techniques for learning how to recommend items such as songs or movies from user ratings. In its basic form, it approximates a large, sparse (i.e., mostly empty) matrix of user-item interactions with a product of two smaller, denser matrices representing learned item and user features. These dense matrices, in turn, can be used to recommend items to a user with which they haven’t interacted before.

Despite its algorithmic simplicity, matrix factorization can still achieve competitive performance in recommender benchmarks. Alternating least squares (ALS), and especially its implicit variation, is a fundamental algorithm to learn the parameters of matrix factorization. ALS is known for its high efficiency because it scales linearly in the number of rows, columns and non-zeros. Hence, this algorithm is very well suited for large-scale challenges. But, for very large real-world matrix factorization datasets, a single machine implementation would not suffice, and so, it would require a large distributed system. Most of the distributed implementations of matrix factorization that employ ALS leverage off-the-shelf CPU devices, and rightfully so, due to the inherently sparse nature of the problem (the input matrix is mostly empty).

On the other hand, recent success of deep learning, which has exhibited growing computational capacity, has spurred a new wave of research and progress on hardware accelerators such as Tensor Processing Units (TPUs). TPUs afford domain specific hardware speedups, especially for use cases like deep learning, which involves a large number of dense matrix multiplications. In particular, they allow significant speedups for traditional data-parallel workloads, such as training models with Stochastic Gradient Descent (SGD) in SPMD (single program multiple data) fashion. The SPMD approach has gained popularity in computations like training neural networks with gradient descent algorithms, and can be used for both data-parallel and model-parallel computations, where we distribute parameters of the model across available devices. Nevertheless, while TPUs have been enormously attractive for methods based on SGD, it is not immediately clear if a high performance implementation of ALS, which requires a large number of distributed sparse matrix multiplies, can be developed for a large-scale cluster of TPU devices.

In “ALX: Large Scale Matrix Factorization on TPUs”, we explore a distributed ALS design that makes efficient use of the TPU architecture and can scale well to matrix factorization problems of the order of billions of rows and columns by scaling the number of available TPU cores. The approach we propose leverages a combination of model and data parallelism, where each TPU core both stores a portion of the embedding table and trains over a unique slice of data, grouped in mini-batches. In order to spur future research on large-scale matrix factorization methods and to illustrate the scalability properties of our own implementation, we also built and released a real world web link prediction dataset called WebGraph.

The figure shows the flow of data and computation through the ALX framework on TPU devices. Similar to SGD-based training procedures, each TPU core performs identical computation for its own batch of data in SPMD fashion, which allows for synchronous computation in parallel on multiple TPU cores. Each TPU starts with gathering all the relevant item embeddings in the Sharded Gather stage. These materialized embeddings are used to solve for user embeddings which are scattered to the relevant shard of the embedding table in the Sharded Scatter stage.

Dense Batching for Improved Efficiency
We designed ALX specifically for TPUs, exploiting unique properties of TPU architecture while overcoming a few interesting limitations. For instance, each TPU core has limited memory and restricts all tensors to have a static shape, but each example in a mini-batch can have a wildly varying number of items (i.e., inputs can be long and sparse). To resolve this, we break exceedingly long examples into multiple smaller examples of the same shape, a process called dense batching. More details about dense batching can be found in our paper.

Illustrating example of how sparse batches are densified to increase efficiency on TPUs.

Uniform Sharding of Embedding Tables
With the batching problem solved, we next want to factorize a sparse matrix into two dense embedding matrices (e.g., user and item embeddings) such that the resulting dot product of embeddings approximate the original sparse matrix — this helps us infer predictions for all the positions from the original matrix, including those that were empty, which can be used to recommend items with which users haven’t interacted. Both the resulting embedding tables (W and H in the figure below) can potentially be too large to fit in a single TPU core, thus requiring a distributed training setup for most large-scale use cases.

Most previous attempts of distributed matrix factorization use a parameter server architecture where the model parameters are stored on highly available servers, and the training data is processed in parallel by workers that are solely responsible for the learning task. In our case, since each TPU core has identical compute and memory, it’s wasteful to only use either memory for storing model parameters or compute for training. Thus, we designed our system such that each core is used to do both.

Illustrative example of factorizing a sparse matrix Y into two dense embedding matrices W and H.

In ALX, we uniformly divide both embedding tables, thus fully exploiting both the size of distributed memory available and the dedicated low-latency interconnects between TPUs. This is highly efficient for very large embedding tables and results in good performance for distributed gather and scatter operations.

Uniform sharding of both embedding tables (W and H) across TPU cores (in blue).

WebGraph
Since potential applications may involve very large data sets, scalability is potentially an important opportunity for advancement in matrix factorization. To that end, we also release a large real-world web link prediction dataset called WebGraph. This dataset can be easily modeled as a matrix factorization problem where rows and columns are source and destination links, respectively, and the task is to predict destination links from each source link. We use WebGraph to illustrate the scaling properties of ALX.

The WebGraph dataset was generated from a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks data. Since the performance of a factorization method depends on the properties of the underlying graph, we created six versions of WebGraph, each varying in the sparsity pattern and locale, to study how well ALS performs on each.

  • To study locale-specific graphs, we filter based on two top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude fewer nodes.
  • These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have a minimum of either 10 or 50 inlinks and outlinks.

For easy access, we have made these available as a Tensorflow Dataset package. For reference, the biggest version, WebGraph-sparse, has more than 365M nodes and 30B edges. We create and publish both training and testing splits for evaluation purposes.

Results
We carefully tune the system and quality parameters of ALX. Based on our observations related to precision and choice of linear solvers. ​​We observed that by carefully selecting the precision for storage of the embedding tables (bfloat16) and for the input to the linear solvers (float32), we were able to halve the memory required for the embeddings while still avoiding problems arising from lower precision values during the solve stage. For our linear solvers, we selected conjugate gradients, which we found to be the fastest across the board on TPUs. We use embeddings of dimension 128 and train the model for 16 epochs. In our experience, hyperparameter tuning over both norm penalty (λ) and unobserved weight (α) has been indispensable for good recall metrics as shown in the table below.

Results obtained by running ALX on all versions of WebGraph dataset. Recall values of 1.0 denote perfect recall.

Scaling Analysis
Since the input data are processed in parallel across TPU cores, increasing the number of cores decreases training time, ideally in a linear fashion. But at the same time, a larger number of cores requires more network communication (due to the sharded embedding tables). Thanks to high-speed interconnects, this overhead can be negligible for a small number of cores, but as the number of cores increases, the overhead eventually slows down the ideal linear scaling.

In order to confirm our hypothesis, we analyze scaling properties of the four biggest WebGraph variants in terms of training time as we increase the number of available TPU cores. As shown below, even empirically, we do observe the predicted linear decrease in training time up to a sweet spot, after which the network overhead slows the decline.

Scaling analysis of running time as the number of TPU cores are increased. Each figure plots the time taken to train for one epoch in seconds.

Conclusion
For easy access and reproducibility, the ALX code is open-sourced and can be easily run on Google Cloud. In fact, we illustrate that a sparse matrix like WebGraph-dense of size 135M x 135M (with 22B edges) can be factorized in a colab connected to 8 TPU cores in less than a day. We have designed the ALX framework with scalability in mind. With 256 TPU cores, one epoch of the largest WebGraph variant, WebGraph-sparse (365M x 365M sparse matrix) takes around 20 minutes to finish (5.5 hours for the whole training run). The final model has around 100B parameters. We hope that the ALX and WebGraph will be useful to both researchers and practitioners working in these fields. The code for ALX can be found here on github!

Acknowledgements
The core team includes Steffen Rendle, Walid Krichene and Li Zhang. We thank many Google colleagues for helping at various stages of this project. In particular, we are grateful to the JAX team for numerous discussions, especially James Bradbury and Skye Wanderman-Milne; Blake Hechtman for help with XLA and Rasmus Larsen for useful discussions about performance of linear solvers on TPUs. Finally, we’re also grateful to Nicolas Mayoraz and John Anderson for providing useful feedback.

Read More

VDTTS: Visually-Driven Text-To-Speech

Recent years have seen a tremendous increase in the creation and serving of video content to users across the world in a variety of languages and over numerous platforms. The process of creating high quality content can include several stages from video capturing and captioning to video and audio editing. In some cases dialogue is re-recorded (referred to as dialog replacement, post-sync or dubbing) in a studio in order to achieve high quality and replace original audio that might have been recorded in noisy conditions. However, the dialog replacement process can be difficult and tedious because the newly recorded audio needs to be well synced with the video, requiring several edits to match the exact timing of mouth movements.

In “More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech”, we present a proof-of-concept visually-driven text-to-speech model, called VDTTS, that automates the dialog replacement process. Given a text and the original video frames of the speaker, VDTTS is trained to generate the corresponding speech. As opposed to standard visual speech recognition models, which focus on the mouth region, we detect and crop full faces using MediaPipe to avoid potentially excluding information pertinent to the speaker’s delivery. This gives the VDTTS model enough information to generate speech that matches the video while also recovering aspects of prosody, such as timing and emotion. Despite not being explicitly trained to generate speech that is synchronized to the input video, the learned model still does so.

Given a text and video frames of a speaker, VDTTS generates speech with prosody that matches the video signal.

VDTTS Model
The VDTTS model resembles Tacotron at its core and has four main components: (1) text and video encoders that process the inputs; (2) a multi-source attention mechanism that connects encoders to a decoder; (3) a spectrogram decoder that incorporates the speaker embedding (similarly to VoiceFilter), and produces mel-spectrograms (which are a form of compressed representation in the frequency domain); and (4) a frozen, pretrained neural vocoder that produces waveforms from the mel-spectrograms.

The overall architecture of VDTTS. Text and video encoders process the inputs and then a multisource attention mechanism connects these to a decoder that produces mel-spectrograms. A vocoder then produces waveforms from the mel-spectrograms to generate speech as an output.

We train VDTTS using video and text pairs from LSVSR in which the text corresponds to the exact words spoken by a person in a video. Throughout our testing, we have determined that VDTTS cannot generate arbitrary text, thus making it less prevalent for misuse (e.g., the generation of fake content).

Quality
To showcase the unique strength of VDTTS in this post, we have selected two inference examples from the VoxCeleb2 test dataset and compare the performance of VDTTS to a standard text-to-speech (TTS) model. In both examples, the video frames provide prosody and word timing clues, visual information that is not available to the TTS model.

In the first example, the speaker talks at a particular pace that can be seen as periodic gaps in the ground-truth mel-spectrogram (shown below). VDTTS preserves this characteristic and generates audio that is much closer to the ground-truth than the audio generated by standard TTS without access to the video.

Similarly, in the second example, the speaker takes long pauses between some of the words. These pauses are captured by VDTTS and are reflected in the video below, whereas the TTS does not capture this aspect of the speaker’s rhythm.

We also plot fundamental frequency (F0) charts to compare the pitch generated by each model to the ground-truth pitch. In both examples, the F0 curve of VDTTS fits the ground-truth much better than the TTS curve, both in the alignment of speech and silence, and also in how the pitch changes over time. See more original videos and VDTTS generated videos.

We present two examples, (a) and (b), from the VoxCeleb2 test set. From top to bottom: input face images, ground-truth (GT) mel-spectrogram, mel-spectrogram output of VDTTS, mel-spectrogram output of a standard TTS model, and two plots showing the normalized F0 (normalized by mean non-zero pitch, i.e., mean is only over voiced periods) of VDTTS and TTS compared to the ground-truth signal.

Video Samples

Original VDTTS VDTTS video-only TTS
Original displays the original video clip. VDTTS, displays the audio predicted using both the video frames and the text as input. VDTTS video-only displays audio predictions using video frames only. TTS displays audio predictions using text only. Top transcript: “of space for people to make their own judgments and to come to their own”. Bottom transcript: “absolutely love dancing I have no dance experience whatsoever but as that”.

Model Performance
We’ve measured the VDTTS model’s performance using the VoxCeleb2 dataset and compared it to TTS and the TTS with length hint (a TTS that receives the scene length) models. We demonstrate that VDTTS outperforms both models by large margins in most of the aspects we measured: higher sync-to-video quality (measured by SyncNet Distance) and better speech quality as measured by mel cepstral distance (MCD), and lower Gross Pitch Error (GPE), which measures the percentage of frames where pitch differed by more than 20% on frames for which voice was present on both the predicted and reference audio.

SyncNet distance comparison between VDTTS, TTS and the TTS with Length hint (a lower metric is better).
Mel cepstral distance comparison between VDTTS, TTS and the TTS with Length hint (a lower metric is better).
Gross Pitch Error comparison between VDTTS, TTS and the TTS with Length hint (a lower metric is better).

Discussion and Future Work
One thing to note is that, intriguingly, VDTTS can produce video synchronized speech without any explicit losses or constraints to promote this, suggesting complexities such as synchronization losses or explicit modeling are unnecessary.

While this is a proof-of-concept demonstration, we believe that in the future, VDTTS can be upgraded to be used in scenarios where the input text differs from the original video signal. This kind of a model would be a valuable tool for tasks such as translation dubbing.

Acknowledgements
We would like to thank the co-authors of this research: Michelle Tadmor Ramanovich, Ye Jia, Brendan Shillingford, and Miaosen Wang. We are also grateful to the valued contributions, discussions, and feedback from Nadav Bar, Jay Tenenbaum, Zach Gleicher, Paul McCartney, Marco Tagliasacchi, and Yoni Tzafir.

Read More

How AI and imagery build a self-updating map

Building a map is complex, and keeping it up-to-date is even more challenging. Think about how often your city, town or neighborhood changes on a day-to-day basis. Businesses and shops open and close, stretches of highway are added, and roadways change. In today’s Maps 101 installment, we’ll dive into two ways Google Maps uses advancements in AI and imagery to help you see the latest information about the world around you every single day.

Automatically updating business hours

Over the past few years, businesses have experienced a lot of change — including constantly updating operating hours based on changing pandemic-related restrictions. To keep up with this pace of change, we developed a machine learning model that automatically identifies if business hours are likely wrong, then instantly updates them with AI-generated predictions.

Let’s look at Liam’s Lemonade Shop as an example. To start, our systems consider multiple factors — such as when Liam last updated their business profile, what we know about other shops’ hours, and the Popular Times information for the shop, which uses location trends to show when it tends to be busiest. Since it appears that Liam’s business profile hasn’t been updated in over a year and its busiest hours are typically Thursday afternoons — even though Google Maps says that it’s closed at that time — Liam’s business hours are likely out of date.

Still images of a business' hours and Popular Times information on Google Maps

To see if business hours need updating, we check a store’s Popular Times information and when its business profile was last updated.

So what’s next? Our algorithms analyze the business hours of other nearby lemonade shops, information from Liam’s website, and Street View images of Liam’s storefront that look specifically for business hour signs to determine the most accurate business hour prediction. At the same time, we enlist the help of the Google Maps community — including Local Guides and even the business owners themselves through their Google Business Profile — to verify the information we predicted. In Argentina, Australia, Chile, France, Japan, Mexico, New Zealand, Peru, and the United States, we also use Duplex conversational technology to call businesses just like Liam’s and ask for their hours directly. With this new AI-first approach, we’re on track to update the hours for over 20 million businesses around the globe in the next six months – helping you know exactly when your favorite store, restaurant or cafe is open for business .

Road information that reflects the real world

We’re also experimenting with ways we can use imagery to make updates to other helpful information. For instance, starting in the U.S., we’re launching a third-party imagery pilot to let you see the most up-to-date speed limit information in your town, which can help keep you safe while driving. Here’s how it works:

Say our systems think that the speed limit information on a particular highway needs to be updated. With the help of third-party imagery partners that already gather roadway imagery to improve delivery routes, we can request a photo of the specific stretch of road that also includes a speed limit sign. If the partner has this photo available, we then use a combination of AI and help from our operations team to identify the sign in the image, extract the new speed limit information, and update Google Maps.

Picture of an intersection that has a speed limit sign

Representative imagery featuring a speed limit sign, with license plates blurred

Over time, this technology will bring more details to the map that can help make your drives safer and more efficient — like where potholes and school zones are or where new construction is happening. And as with all Google Maps features, we designed this pilot with privacy top of mind. For instance, we only reference images taken on public roads, and partners are required to blur information (like faces and license plates) to avoid potentially identifying someone. For an extra layer of privacy, we blur the photo again when we receive it and delete the photo after we use it to update the map.

AI, imagery and Duplex technology will continue to play a critical role in helping make Google Maps the most comprehensive and useful map possible. For more behind-the-scenes looks at the technology that powers Google Maps, check out the rest of our Maps 101 blog series.

Read More

Efficiently Initializing Reinforcement Learning With Prior Policies

Reinforcement learning (RL) can be used to train a policy to perform a task via trial and error, but a major challenge in RL is learning policies from scratch in environments with hard exploration challenges. For example, consider the setting depicted in the door-binary-v0 environment from the adroit manipulation suite, where an RL agent must control a hand in 3D space to open a door placed in front of it.

An RL agent must control a hand in 3D space to open a door placed in front of it. The agent receives a reward signal only when the door is completely open.

Since the agent receives no intermediary rewards, it cannot measure how close it is to completing the task, and so must explore the space randomly until it eventually opens the door. Given how long the task takes and the precise control required, this is extremely unlikely.

For tasks like this, we can avoid exploring the state space randomly by using prior information. This prior information helps the agent understand which states of the environment are good, and should be further explored. We could use offline data (i.e., data collected by human demonstrators, scripted policies, or other RL agents) to train a policy, then use it to initialize a new RL policy. In the case where we use neural networks to represent the policies, this would involve copying the pre-trained policy’s neural network over to the new RL policy. This procedure makes the new RL policy behave like the pre-trained policy. However, naïvely initializing a new RL policy like this often works poorly, especially for value-based RL methods, as shown below.

A policy is pre-trained on the antmaze-large-diverse-v0 D4RL environment with offline data (negative steps correspond to pre-training). We then use the policy to initialize actor-critic fine-tuning (positive steps starting from step 0) with this pre-trained policy as the initial actor. The critic is initialized randomly. The actor’s performance immediately drops and does not recover, as the untrained critic provides a poor learning signal and causes the good initial policy to be forgotten.

With the above in mind, in “Jump-Start Reinforcement Learning” (JSRL), we introduce a meta-algorithm that can use a pre-existing policy of any form to initialize any type of RL algorithm. JSRL uses two policies to learn tasks: a guide-policy, and an exploration-policy. The exploration-policy is an RL policy that is trained online with new experience that the agent collects from the environment, and the guide-policy is a pre-existing policy of any form that is not updated during online training. In this work, we focus on scenarios where the guide-policy is learned from demonstrations, but many other kinds of guide-policies can be used. JSRL creates a learning curriculum by rolling in the guide-policy, which is then followed by the self-improving exploration-policy, resulting in performance that compares to or improves on competitive IL+RL methods.

The JSRL Approach
The guide-policy can take any form: it could be a scripted policy, a policy trained with RL, or even a live human demonstrator. The only requirements are that the guide-policy is reasonable (i.e., better than random exploration), and it can select actions based on observations of the environment. Ideally, the guide-policy can reach poor or medium performance in the environment, but cannot further improve itself with additional fine-tuning. JSRL then allows us to leverage the progress of this guide-policy to take the performance even higher.

At the beginning of training, we roll out the guide-policy for a fixed number of steps so that the agent is closer to goal states. The exploration-policy then takes over and continues acting in the environment to reach these goals. As the performance of the exploration-policy improves, we gradually reduce the number of steps that the guide-policy takes, until the exploration-policy takes over completely. This process creates a curriculum of starting states for the exploration-policy such that in each curriculum stage, it only needs to learn to reach the initial states of prior curriculum stages.

Here, the task is for the robot arm to pick up the blue block. The guide-policy can move the arm to the block, but it cannot pick it up. It controls the agent until it grips the block, then the exploration-policy takes over, eventually learning to pick up the block. As the exploration-policy improves, the guide-policy controls the agent less and less.

Comparison to IL+RL Baselines
Since JSRL can use a prior policy to initialize RL, a natural comparison would be to imitation and reinforcement learning (IL+RL) methods that train on offline datasets, then fine-tune the pre-trained policies with new online experience. We show how JSRL compares to competitive IL+RL methods on the D4RL benchmark tasks. These tasks include simulated robotic control environments, along with datasets of offline data from human demonstrators, planners, and other learned policies. Out of the D4RL tasks, we focus on the difficult ant maze and adroit dexterous manipulation environments.

Example ant maze (left) and adroit dexterous manipulation (right) environments.

For each experiment, we train on an offline dataset and then run online fine-tuning. We compare against algorithms designed specifically for each setting, which include AWAC, IQL, CQL, and behavioral cloning. While JSRL can be used in combination with any initial guide-policy or fine-tuning algorithm, we use our strongest baseline, IQL, as a pre-trained guide and for fine-tuning. The full D4RL dataset includes one million offline transitions for each ant maze task. Each transition is a sequence of format (S, A, R, S’) which specifies what state the agent started in (S), the action the agent took (A), the reward the agent received (R), and the state the agent ended up in (S’) after taking action A. We find that JSRL performs well with as few as ten thousand offline transitions.

Average score (max=100) on the antmaze-medium-diverse-v0 environment from the D4RL benchmark suite. JSRL can improve even with limited access to offline transitions.

Vision-Based Robotic Tasks
Utilizing offline data is especially challenging in complex tasks such as vision-based robotic manipulation due to the curse of dimensionality. The high dimensionality of both the continuous-control action space and the pixel-based state space present scaling challenges for IL+RL methods in terms of the amount of data required to learn good policies. To study how JSRL scales to such settings, we focus on two difficult simulated robotic manipulation tasks: indiscriminate grasping (i.e., lifting any object) and instance grasping (i.e., lifting a specific target object).

A simulated robot arm is placed in front of a table with various categories of objects. When the robot lifts any object, a sparse reward is given for the indiscriminate grasping task. For the instance grasping task, a sparse reward is only given when a specific target object is grasped.

We compare JSRL against methods that are able to scale to complex vision-based robotics settings, such as QT-Opt and AW-Opt. Each method has access to the same offline dataset of successful demonstrations and is allowed to run online fine-tuning for up to 100,000 steps.

In these experiments, we use behavioral cloning as a guide-policy and combine JSRL with QT-Opt for fine-tuning. The combination of QT-Opt+JSRL improves faster than all other methods while achieving the highest success rate.

Mean grasping success for indiscriminate and instance grasping environments using 2k successful demonstrations.

Conclusion
We proposed JSRL, a method for leveraging a prior policy of any form to improve exploration for initializing RL tasks. Our algorithm creates a learning curriculum by rolling in a pre-existing guide-policy, which is then followed by the self-improving exploration-policy. The job of the exploration-policy is greatly simplified since it starts exploring from states closer to the goal. As the exploration-policy improves, the effect of the guide-policy diminishes, leading to a fully capable RL policy. In the future, we plan to apply JSRL to problems such as Sim2Real, and explore how we can leverage multiple guide-policies to train RL agents.

Acknowledgements
This work would not have been possible without Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, Sergey Levine, and Karol Hausman. Special thanks to Tom Small for creating the animations for this post.

Read More

Reproducibility in Deep Learning and Smooth Activations

Ever queried a recommender system and found that the same search only a few moments later or on a different device yields very different results? This is not uncommon and can be frustrating if a person is looking for something specific. As a designer of such a system, it is also not uncommon for the metrics measured to change from design and testing to deployment, bringing into question the utility of the experimental testing phase. Some level of such irreproducibility can be expected as the world changes and new models are deployed. However, this also happens regularly as requests hit duplicates of the same model or models are being refreshed.

Lack of replicability, where researchers are unable to reproduce published results with a given model, has been identified as a challenge in the field of machine learning (ML). Irreproducibility is a related but more elusive problem, where multiple instances of a given model are trained on the same data under identical training conditions, but yield different results. Only recently has irreproducibility been identified as a difficult problem, but due to its complexity, theoretical studies to understand this problem are extremely rare.

In practice, deep network models are trained in highly parallelized and distributed environments. Nondeterminism in training from random initialization, parallelism, distributed training, data shuffling, quantization errors, hardware types, and more, combined with objectives with multiple local optima contribute to the problem of irreproducibility. Some of these factors, such as initialization, can be controlled, but it is impractical to control others. Optimization trajectories can diverge early in training by following training examples in the order seen, leading to very different models. Several recently published solutions [1, 2, 3] based on advanced combinations of ensembling, self-ensembling, and distillation can mitigate the problem, but usually at the cost of accuracy and increased complexity, maintenance and improvement costs.

In “Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations”, we consider a different practical solution to this problem that does not incur the costs of other solutions, while still improving reproducibility and yielding higher model accuracy. We discover that the Rectified Linear Unit (ReLU), which is very popular as the nonlinearity function (i.e., activation function) used to transform values in neural networks, exacerbates the irreproducibility problem. On the other hand, we demonstrate that smooth activation functions, which have derivatives that are continuous for the whole domain, unlike those of ReLU, are able to substantially reduce irreproducibility levels. We then propose the Smooth reLU (SmeLU) activation function, which gives comparable reproducibility and accuracy benefits to other smooth activations but is much simpler.

The ReLU function (left) as function of the input signal, and its gradient (right) as function of the input.

Smooth Activations
An ML model attempts to learn the best model parameters that fit the training data by minimizing a loss, which can be imagined as a landscape with peaks and valleys, where the lowest point attains an optimal solution. For deep models, the landscape may consist of many such peaks and valleys. The activation function used by the model governs the shape of this landscape and how the model navigates it.

ReLU, which is not a smooth function, imposes an objective whose landscape is partitioned into many regions with multiple local minima, each providing different model predictions. With this landscape, the order in which updates are applied is a dominant factor in determining the optimization trajectory, providing a recipe for irreproducibility. Because of its non-continuous gradient, functions expressed by a ReLU network will contain sudden jumps in the gradient, which can occur internally in different layers of the deep network, affecting updates of different internal units, and are likely strong contributors to irreproducibility.

Suppose a sequence of model updates attempts to push the activation of some unit down from a positive value. The gradient of the ReLU function is 1 for positive unit values, so with every update it pushes the unit to become smaller and smaller (to the left in the panel above). At the point the activation of this unit crosses the threshold from a positive value to a negative one, the gradient suddenly changes from magnitude 1 to magnitude 0. Training attempts to keep moving the unit leftwards, but due to the 0 gradient, the unit cannot move further in that direction. Therefore, the model must resort to updating other units that can move.

We find that networks with smooth activations (e.g., GELU, Swish and Softplus) can be substantially more reproducible. They may exhibit a similar objective landscape, but with fewer regions, giving a model fewer opportunities to diverge. Unlike the sudden jumps with ReLU, for a unit with decreasing activations, the gradient gradually reduces to 0, which gives other units opportunities to adjust to the changing behavior. With equal initialization, moderate shuffling of training examples, and normalization of hidden layer outputs, smooth activations are able to increase the chances of converging to the same minimum. Very aggressive data shuffling, however, loses this advantage.

The rate that a smooth activation function transitions between output levels, i.e., its “smoothness”, can be adjusted. Sufficient smoothness leads to improved accuracy and reproducibility. Too much smoothness, though, approaches linear models with a corresponding degradation of model accuracy, thus losing the advantages of using a deep network.

Smooth activations (top) and their gradients (bottom) for different smoothness parameter values β as a function of the input values. β determines the width of the transition region between 0 and 1 gradients. For Swish and Softplus, a greater β gives a narrower region, for SmeLU, a greater β gives a wider region.

Smooth reLU (SmeLU)
Activations like GELU and Swish require complex hardware implementations to support exponential and logarithmic functions. Further, GELU must be computed numerically or approximated. These properties can make deployment error-prone, expensive, or slow. GELU and Swish are not monotonic (they start by slightly decreasing and then switch to increasing), which may interfere with interpretability (or identifiability), nor do they have a full stop or a clean slope 1 region, properties that simplify implementation and may aid in reproducibility. 

The Smooth reLU (SmeLU) activation function is designed as a simple function that addresses the concerns with other smooth activations. It connects a 0 slope on the left with a slope 1 line on the right through a quadratic middle region, constraining continuous gradients at the connection points (as an asymmetric version of a Huber loss function).

SmeLU can be viewed as a convolution of ReLU with a box. It provides a cheap and simple smooth solution that is comparable in reproducibility-accuracy tradeoffs to more computationally expensive and complex smooth activations. The figure below illustrates the transition of the loss (objective) surface as we gradually transition from a non-smooth ReLU to a smoother SmeLU. A transition of width 0 is the basic ReLU function for which the loss objective has many local minima. As the transition region widens (SmeLU), the loss surface becomes smoother. If the transition is too wide, i.e., too smooth, the benefit of using a deep network wanes and we approach the linear model solution — the objective surface flattens, potentially losing the ability of the network to express much information.

Loss surfaces (as functions of a 2D input) for two sample loss functions (middle and right) as the activation function’s transition region widens, going from from ReLU to an increasingly smoother SmeLU (left). The loss surface becomes smoother with increasing the smoothness of the SmeLU function.

<!–

Loss surfaces (as functions of a 2D input) for two sample loss functions (middle and right) as the activation function’s transition region widens, going from from ReLU to an increasingly smoother SmeLU (left). The loss surface becomes smoother with increasing the smoothness of the SmeLU function.

–>

Performance
SmeLU has benefited multiple systems, specifically recommendation systems, increasing their reproducibility by reducing, for example, recommendation swap rates. While the use of SmeLU results in accuracy improvements over ReLU, it also replaces other costly methods to address irreproducibility, such as ensembles, which mitigate irreproducibility at the cost of accuracy. Moreover, replacing ensembles in sparse recommendation systems reduces the need for multiple lookups of model parameters that are needed to generate an inference for each of the ensemble components. This substantially improves training and inference efficiency.

To illustrate the benefits of smooth activations, we plot the relative prediction difference (PD) as a function of change in some loss for the different activations. We define relative PD as the ratio between the absolute difference in predictions of two models and their expected prediction, averaged over all evaluation examples. We have observed that in large scale systems, it is sufficient, and inexpensive, to consider only two models for very consistent results.

The figure below shows curves on the PD-accuracy loss plane. For reproducibility, being lower on the curve is better, and for accuracy, being on the left is better. Smooth activations can yield a ballpark 50% reduction in PD relative to ReLU, while still potentially resulting in improved accuracy. SmeLU yields accuracy comparable to other smooth activations, but is more reproducible (lower PD) while still outperforming ReLU in accuracy.

Relative PD as a function of percentage change in the evaluation ranking loss, which measures how accurately items are ranked in a recommendation system (higher values indicate worse accuracy), for different activations.

<!–

Relative PD as a function of percentage change in the evaluation ranking loss, which measures how accurately items are ranked in a recommendation system (higher values indicate worse accuracy), for different activations.

–>

Conclusion and Future Work
We demonstrated the problem of irreproducibility in real world practical systems, and how it affects users as well as system and model designers. While this particular issue has been given very little attention when trying to address the lack of replicability of research results, irreproducibility can be a critical problem. We demonstrated that a simple solution of using smooth activations can substantially reduce the problem without degrading other critical metrics like model accuracy. We demonstrate a new smooth activation function, SmeLU, which has the added benefits of mathematical simplicity and ease of implementation, and can be cheap and less error prone.

Understanding reproducibility, especially in deep networks, where objectives are not convex, is an open problem. An initial theoretical framework for the simpler convex case has recently been proposed, but more research must be done to gain a better understanding of this problem which will apply to practical systems that rely on deep networks.

Acknowledgements
We would like to thank Sergey Ioffe for early discussions about SmeLU; Lorenzo Coviello and Angel Yu for help in early adoptions of SmeLU; Shiv Venkataraman for sponsorship of the work; Claire Cui for discussion and support from the very beginning; Jeremiah Willcock, Tom Jablin, and Cliff Young for substantial implementation support; Yuyan Wang, Mahesh Sathiamoorthy, Myles Sussman, Li Wei, Kevin Regan, Steven Okamoto, Qiqi Yan, Todd Phillips, Ed Chi, Sunita Verna, and many many others for many discussions, and for integrations in many different systems; Matt Streeter and Yonghui Wu for feedback on the paper and this post; Tom Small for help with the illustrations in this post.

Read More

Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

In recent years, large neural networks trained for language understanding and generation have achieved impressive results across a wide range of tasks. GPT-3 first showed that large language models (LLMs) can be used for few-shot learning and can achieve impressive results without large-scale task-specific data collection or model parameter updating. More recent LLMs, such as GLaM, LaMDA, Gopher, and Megatron-Turing NLG, achieved state-of-the-art few-shot results on many tasks by scaling model size, using sparsely activated modules, and training on larger datasets from more diverse sources. Yet much work remains in understanding the capabilities that emerge with few-shot learning as we push the limits of model scale.

Last year Google Research announced our vision for Pathways, a single model that could generalize across domains and tasks while being highly efficient. An important milestone toward realizing this vision was to develop the new Pathways system to orchestrate distributed computation for accelerators. In “PaLM: Scaling Language Modeling with Pathways”, we introduce the Pathways Language Model (PaLM), a 540-billion parameter, dense decoder-only Transformer model trained with the Pathways system, which enabled us to efficiently train a single model across multiple TPU v4 Pods. We evaluated PaLM on hundreds of language understanding and generation tasks, and found that it achieves state-of-the-art few-shot performance across most tasks, by significant margins in many cases.

As the scale of the model increases, the performance improves across tasks while also unlocking new capabilities.

Training a 540-Billion Parameter Language Model with Pathways
PaLM demonstrates the first large-scale use of the Pathways system to scale training to 6144 chips, the largest TPU-based system configuration used for training to date. The training is scaled using data parallelism at the Pod level across two Cloud TPU v4 Pods, while using standard data and model parallelism within each Pod. This is a significant increase in scale compared to most previous LLMs, which were either trained on a single TPU v3 Pod (e.g., GLaM, LaMDA), used pipeline parallelism to scale to 2240 A100 GPUs across GPU clusters (Megatron-Turing NLG) or used multiple TPU v3 Pods (Gopher) with a maximum scale of 4096 TPU v3 chips.

PaLM achieves a training efficiency of 57.8% hardware FLOPs utilization, the highest yet achieved for LLMs at this scale. This is due to a combination of the parallelism strategy and a reformulation of the Transformer block that allows for attention and feedforward layers to be computed in parallel, enabling speedups from TPU compiler optimizations.

PaLM was trained using a combination of English and multilingual datasets that include high-quality web documents, books, Wikipedia, conversations, and GitHub code. We also created a “lossless” vocabulary that preserves all whitespace (especially important for code), splits out-of-vocabulary Unicode characters into bytes, and splits numbers into individual tokens, one for each digit.

Breakthrough Capabilities on Language, Reasoning, and Code Tasks
PaLM shows breakthrough capabilities on numerous very difficult tasks. We highlight a few examples for language understanding and generation, reasoning, and code-related tasks below.

Language Understanding and Generation
We evaluated PaLM on 29 widely-used English natural language processing (NLP) tasks. PaLM 540B surpassed few-shot performance of prior large models, such as GLaM, GPT-3, Megatron-Turing NLG, Gopher, Chinchilla, and LaMDA, on 28 of 29 of tasks that span question-answering tasks (open-domain closed-book variant), cloze and sentence-completion tasks, Winograd-style tasks, in-context reading comprehension tasks, common-sense reasoning tasks, SuperGLUE tasks, and natural language inference tasks.

PaLM 540B performance improvement over prior state-of-the-art (SOTA) results on 29 English-based NLP tasks.

In addition to English NLP tasks, PaLM also shows strong performance on multilingual NLP benchmarks, including translation, even though only 22% of the training corpus is non-English.

We also probe emerging and future capabilities of PaLM on the Beyond the Imitation Game Benchmark (BIG-bench), a recently released suite of more than 150 new language modeling tasks, and find that PaLM achieves breakthrough performance. We compare the performance of PaLM to Gopher and Chinchilla, averaged across a common subset of 58 of these tasks. Interestingly, we note that PaLM’s performance as a function of scale follows a log-linear behavior similar to prior models, suggesting that performance improvements from scale have not yet plateaued. PaLM 540B 5-shot also does better than the average performance of people asked to solve the same tasks.

Scaling behavior of PaLM on a subset of 58 BIG-bench tasks. 

PaLM demonstrates impressive natural language understanding and generation capabilities on several BIG-bench tasks. For example, the model can distinguish cause and effect, understand conceptual combinations in appropriate contexts, and even guess the movie from an emoji.

Examples that showcase PaLM 540B 1-shot performance on BIG-bench tasks: labeling cause and effect, conceptual understanding, guessing movies from emoji, and finding synonyms and counterfactuals.

Reasoning
By combining model scale with chain-of-thought prompting, PaLM shows breakthrough capabilities on reasoning tasks that require multi-step arithmetic or common-sense reasoning. Prior LLMs, like Gopher, saw less benefit from model scale in improving performance.

Standard prompting versus chain-of-thought prompting for an example grade-school math problem. Chain-of-thought prompting decomposes the prompt for a multi-step reasoning problem into intermediate steps (highlighted in yellow), similar to how a person would approach it.

We observed strong performance from PaLM 540B combined with chain-of-thought prompting on three arithmetic datasets and two commonsense reasoning datasets. For example, with 8-shot prompting, PaLM solves 58% of the problems in GSM8K, a benchmark of thousands of challenging grade school level math questions, outperforming the prior top score of 55% achieved by fine-tuning the GPT-3 175B model with a training set of 7500 problems and combining it with an external calculator and verifier.

This new score is especially interesting, as it approaches the 60% average of problems solved by 9-12 year olds, who are the target audience for the question set. We suspect that separate encoding of digits in the PaLM vocabulary helps enable these performance improvements.

Remarkably, PaLM can even generate explicit explanations for scenarios that require a complex combination of multi-step logical inference, world knowledge, and deep language understanding. For example, it can provide high quality explanations for novel jokes not found on the web.

PaLM explains an original joke with two-shot prompts.

Code Generation
LLMs have also been shown [1, 2, 3, 4] to generalize well to coding tasks, such as writing code given a natural language description (text-to-code), translating code from one language to another, and fixing compilation errors (code-to-code).

PaLM 540B shows strong performance across coding tasks and natural language tasks in a single model, even though it has only 5% code in the pre-training dataset. Its few-shot performance is especially remarkable because it is on par with the fine-tuned Codex 12B while using 50 times less Python code for training. This result reinforces earlier findings that larger models can be more sample efficient than smaller models because they better transfer learning from other programming languages and natural language data.

Examples of a fine-tuned PaLM 540B model on text-to-code tasks, such as GSM8K-Python and HumanEval, and code-to-code tasks, such as Transcoder.

We also see a further increase in performance by fine-tuning PaLM on a Python-only code dataset, which we refer to as PaLM-Coder. For an example code repair task called DeepFix, where the objective is to modify initially broken C programs until they compile successfully, PaLM-Coder 540B demonstrates impressive performance, achieving a compile rate of 82.1%, which outperforms the prior 71.7% state of the art. This opens up opportunities for fixing more complex errors that arise during software development.

An example from the DeepFix Code Repair task. The fine-tuned PaLM-Coder 540B fixes compilation errors (left, in red) to a version of code that compiles (right).

Ethical Considerations
Recent research has highlighted various potential risks associated with LLMs trained on web text. It is crucial to analyze and document such potential undesirable risks through transparent artifacts such as model cards and datasheets, which also include information on intended use and testing. To this end, our paper provides a datasheet, model card and Responsible AI benchmark results, and it reports thorough analyses of the dataset and model outputs for biases and risks. While the analysis helps outline some potential risks of the model, domain- and task-specific analysis is essential to truly calibrate, contextualize, and mitigate possible harms. Further understanding of risks and benefits of these models is a topic of ongoing research, together with developing scalable solutions that can put guardrails against malicious uses of language models.

Conclusion and Future Work
PaLM demonstrates the scaling capability of the Pathways system to thousands of accelerator chips across two TPU v4 Pods by training a 540-billion parameter model efficiently with a well-studied, well-established recipe of a dense decoder-only Transformer model. Pushing the limits of model scale enables breakthrough few-shot performance of PaLM across a variety of natural language processing, reasoning, and code tasks.

PaLM paves the way for even more capable models by combining the scaling capabilities with novel architectural choices and training schemes, and brings us closer to the Pathways vision:

“Enable a single AI system to generalize across thousands or millions of tasks, to understand different types of data, and to do so with remarkable efficiency.”

Acknowledgements
PaLM is the result of a large, collaborative effort by many teams within Google Research and across Alphabet. We’d like to thank the entire PaLM team for their contributions: Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Mishra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, and Jason Wei. PaLM builds on top of work by many, many teams at Google and we would especially like to recognize the T5X team, the Pathways infrastructure team, the JAX team, the Flaxformer team, the XLA team, the Plaque team, the Borg team, and the Datacenter networking infrastructure team. We’d like to thank our co-authors on this blog post, Alexander Spiridonov and Maysam Moussalem, as well as Josh Newlan and Tom Small for the images and animations in this blog post. Finally, we would like to thank our advisors for the project: Noah Fiedel, Slav Petrov, Jeff Dean, Douglas Eck, and Kathy Meier-Hellstern.

Read More

Introducing CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Automatic translation of speech from one language to speech in another language, called speech-to-speech translation (S2ST), is important for breaking down the communication barriers between people speaking different languages. Conventionally, automatic S2ST systems are built with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, so that the system overall is text-centric. Recently, work on S2ST that doesn’t rely on intermediate text representation is emerging, such as end-to-end direct S2ST (e.g., Translatotron) and cascade S2ST based on learned discrete representations of speech (e.g., Tjandra et al.). While early versions of such direct S2ST systems obtained lower translation quality compared to cascade S2ST models, they are gaining traction as they have the potential both to reduce translation latency and compounding errors, and to better preserve paralinguistic and non-linguistic information from the original speech, such as voice, emotion, tone, etc. However, such models usually have to be trained on datasets with paired S2ST data, but the public availability of such corpora is extremely limited.

To foster research on such a new generation of S2ST, we introduce a Common Voice-based Speech-to-Speech translation corpus, or CVSS, which includes sentence-level speech-to-speech translation pairs from 21 languages into English. Unlike existing public corpora, CVSS can be directly used for training such direct S2ST models without any extra processing. In “CVSS Corpus and Massively Multilingual Speech-to-Speech Translation”, we describe the dataset design and development, and demonstrate the effectiveness of the corpus through training of baseline direct and cascade S2ST models and showing performance of a direct S2ST model that approaches that of a cascade S2ST model.

Building CVSS
CVSS is directly derived from the CoVoST 2 speech-to-text (ST) translation corpus, which is further derived from the Common Voice speech corpus. Common Voice is a massively multilingual transcribed speech corpus designed for ASR in which the speech is collected by contributors reading text content from Wikipedia and other text corpora. CoVoST 2 further provides professional text translation for the original transcript from 21 languages into English and from English into 15 languages. CVSS builds on these efforts by providing sentence-level parallel speech-to-speech translation pairs from 21 languages into English (shown in the table below).

To facilitate research with different focuses, two versions of translation speech in English are provided in CVSS, both are synthesized using state-of-the-art TTS systems, with each version providing unique value that doesn’t exist in other public S2ST corpora:

  • CVSS-C: All the translation speech is in a single canonical speaker’s voice. Despite being synthetic, the speech is highly natural, clean, and consistent in speaking style. These properties ease the modeling of the target speech and enable trained models to produce high quality translation speech suitable for general user-facing applications where speech quality is of higher importance than accurately reproducing the speakers’ voices.
  • CVSS-T: The translation speech captures the voice from the corresponding source speech. Each S2ST pair has a similar voice on the two sides, despite being in different languages. Because of this, the dataset is suitable for building models where accurate voice preservation is desired, such as for movie dubbing.

Together with the source speech, the two S2ST datasets contain 1,872 and 1,937 hours of speech, respectively.

Source
Language    
Code     Source
  speech (X)  
CVSS-C
  target speech (En)  
CVSS-T
  target speech (En)  
French fr 309.3 200.3 222.3
German de 226.5 137.0 151.2
Catalan ca 174.8 112.1 120.9
Spanish es 157.6 94.3 100.2
Italian it 73.9 46.5 49.2
Persian fa 58.8 29.9 34.5
Russian ru 38.7 26.9 27.4
Chinese zh 26.5 20.5 22.1
Portuguese     pt 20.0 10.4 11.8
Dutch nl 11.2 7.3 7.7
Estonian et 9.0 7.3 7.1
Mongolian mn 8.4 5.1 5.7
Turkish tr 7.9 5.4 5.7
Arabic ar 5.8 2.7 3.1
Latvian lv 4.9 2.6 3.1
Swedish sv 4.3 2.3 2.8
Welsh cy 3.6 1.9 2.0
Tamil ta 3.1 1.7 2.0
Indonesian id 3.0 1.6 1.7
Japanese ja 3.0 1.7 1.8
Slovenian sl 2.9 1.6 1.9
Total 1,153.2 719.1 784.2
Amount of source and target speech of each X-En pair in CVSS (hours).

In addition to translation speech, CVSS also provides normalized translation text matching the pronunciation in the translation speech (on numbers, currencies, acronyms, etc., see data samples below, e.g., where “100%” is normalized as “one hundred percent” or “King George II” is normalized as “king george the second”), which can benefit both model training as well as standardizing the evaluation.

CVSS is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license and it can be freely downloaded online.

Data Samples

Example 1:
Source audio (French)   
Source transcript (French)    Le genre musical de la chanson est entièrement le disco.
CVSS-C translation audio (English)   
CVSS-T translation audio (English)   
Translation text (English)    The musical genre of the song is 100% Disco.
Normalized translation text (English)        the musical genre of the song is one hundred percent disco
     
     
Example 2:
Source audio (Chinese)       
Source transcript (Chinese)        弗雷德里克王子,英国王室成员,为乔治二世之孙,乔治三世之幼弟。
CVSS-C translation audio (English)       
CVSS-T translation audio (English)       
Translation text (English)        Prince Frederick, member of British Royal Family, Grandson of King George II, brother of King George III.
Normalized translation text (English)        prince frederick member of british royal family grandson of king george the second brother of king george the third

Baseline Models
On each version of CVSS, we trained a baseline cascade S2ST model as well as two baseline direct S2ST models and compared their performance. These baselines can be used for comparison in future research.

Cascade S2ST: To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous states of the art by +5.8 average BLEU on all 21 language pairs (detailed in the paper) when trained on the corpus without using extra data. This ST model is connected to the same TTS models used for constructing CVSS to compose very strong cascade S2ST baselines (ST → TTS).

Direct S2ST: We built two baseline direct S2ST models using Translatotron and Translatotron 2. When trained from scratch with CVSS, the translation quality from Translatotron 2 (8.7 BLEU) approaches that of the strong cascade S2ST baseline (10.6 BLEU). Moreover, when both use pre-training the gap decreases to only 0.7 BLEU on ASR transcribed translation. These results verify the effectiveness of using CVSS to train direct S2ST models.

Translation quality of baseline direct and cascade S2ST models built on CVSS-C, measured by BLEU on ASR transcription from speech translation. The pre-training was done on CoVoST 2 without other extra data sets.

Conclusion
We have released two versions of multilingual-to-English S2ST datasets, CVSS-C and CVSS-T, each with about 1.9K hours of sentence-level parallel S2ST pairs, covering 21 source languages. The translation speech in CVSS-C is in a single canonical speaker’s voice, while the same in CVSS-T is in voices transferred from the source speech. Each of these datasets provides unique value not existing in other public S2ST corpora.

We built baseline multilingual direct S2ST models and cascade S2ST models on both datasets, which can be used for comparison in future works. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous states of the art by +5.8 average BLEU when trained on the corpus without extra data. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.7 BLEU difference on ASR transcribed translation when utilized pre-training. We hope this work helps accelerate the research on direct S2ST.

Acknowledgments
We acknowledge the volunteer contributors and the organizers of the Common Voice and LibriVox projects for their contribution and collection of recordings, the creators of Common Voice, CoVoST, CoVoST 2, Librispeech and LibriTTS corpora for their previous work. The direct contributors to the CVSS corpus and the paper include Ye Jia, Michelle Tadmor Ramanovich, Quan Wang, Heiga Zen. We also thank Ankur Bapna, Yiling Huang, Jason Pelecanos, Colin Cherry, Alexis Conneau, Yonghui Wu, Hadar Shemtov and Françoise Beaufays for helpful discussions and support.

Read More

“Lift as you lead”: Meet 2 women defining responsible AI

At Google, Marian Croak’s technical research team, The Center for Responsible AI and Human-Centered Technology, and Jen Gennai’s operations and governance team, Responsible Innovation, collaborate often on creating a fairer future for AI systems.

The teams complement each other to support computer scientists, UX researchers and designers, product managers and subject matter experts in the social sciences, human rights and civil rights. Collectively, their teams include more than 200 people around the globe focused on putting our AI Principles – Google’s ethical charter – into practice.

“The intersection of AI systems and society is a critical area of my team’s technical research,” Marian says. “Our approach includes working directly with people who use and are impacted by AI systems. Working together with Jen’s central operations team, the idea is to make AI more useful and reduce potential harm before products launch.”

For Women’s History Month, we wanted to talk to them both about this incredibly meaningful work and how they bring their lived experiences to it.

How do you define “responsible AI”?

Marian: It’s the technical realization of our AI Principles. We need to understand how AI systems are performing in respect to fairness, transparency, interpretability, robustness and privacy. When gaps occur, we fix them. We benchmark and evaluate how product teams are adopting what Jen and I call smart practices. These are trusted practices based on patterns we see across Google as we’re developing new AI applications, and the data-driven results of applying these practices over time.

Jen: There are enormous opportunities to use AI for positive impact — and the potential for harm, too. The key is ethical deployment. “Responsible AI” for me means taking deliberate steps to ensure technology works the way it’s intended to and doesn’t lead to malicious or unintended negative consequences. This involves applying the smart practices Marian mentioned through repeatable processes and a governance structure for accountability.

How do your teams work together?

Marian: They work hand in hand. My team conducts scientific research and creates open source tools like Fairness Indicators and Know Your Data. A large portion of our technical research and product work is centered in societal context and human and civil rights, so Jen’s team is integral to understanding the problems we seek to help solve.

Jen: The team I lead defines Google policies, handles day-to-day operations and central governance structure, and conducts ethical assessments. We’re made up of user researchers, social scientists, ethicists, human rights specialists, policy and privacy advisors and legal experts.

One team can’t work without the other! This complementary relationship allows many different perspectives and lived experiences to inform product design decisions. Here’s an example, which was led by women from a variety of global backgrounds: Marian’s team designed a streamlined, open source format for documenting technical details of datasets, called data cards. When researchers on the Translate team, led by product manager Romina Stella, recently developed a new dataset for studying and preventing gender bias in machine learning, members of my team, Anne P., N’Mah Y. and Reena Jana, reviewed the dataset for alignment with the AI Principles. They recommended that the Translate researchers publish a data card for details on how the dataset was created and tested. The Translate team then worked with UX designer Mahima Pushkarna on Marian’s team to create and launch the card alongside the dataset.

I’m inspired most when someone tells me I can’t do something. No matter what obstacles you face, believe you have the skills, the knowledge and the passion to make your dreams come true.

How did you end up working in this very new field?

Marian: I’ve always been drawn to hard problems. This is a very challenging area! It’s so multifaceted and constantly evolving. That excites me. It’s an honor to work with so many passionate people who care so deeply about our world and understanding how to use technology for social good.

I’ll always continue to seek out solutions to these problems because I understand the profound impact this work will have on our society and our world, especially communities underrepresented in the tech industry.

Jen: I spent many years leading User Research and User Advocacy on Google’s Trust and Safety team. An area I focused on was ML Fairness. I never thought I’d get to work on it full time. But in 2016 my leadership team wanted to have a company-wide group concentrating on worldwide positive social benefits of AI. In 2017, I joined the team that was writing and publishing the AI Principles. Today, I apply my operational knowledge to make sure that as a company, we meet the obligations we laid out in the Principles.

What advice do you have for girls and women interested in pursuing careers in responsible tech?

Marian: I’m inspired most when someone tells me I can’t do something. No matter what obstacles you face, believe you have the skills, the knowledge and the passion to make your dreams come true. Find motivation in the small moments, find motivation in those who doubt you, but most importantly, never forget to believe in the greatness of you.

Jen: Don’t limit yourself even if you don’t have a computer science degree. I don’t. I was convinced I’d work in sustainability and environmental non-profits, and now I lead a team working to make advanced technologies work better for everyone. This space requires so many different skills, whether in program management, policy, engineering, UX or business and strategy.

My mantra is “lift as you lead.” Don’t just build a network for yourself; build a supportive network to empower everyone who works with you — and those who come after you, especially those who are currently underrepresented in the tech sector. Your collective presence in this space makes a positive impact! And it’s even stronger when you build a better future together.

Read More

Go with the flow state: What music and AI have in common

Carrie Cai, Ben Zevenbergen and Johnny Soraker all work on developing artificial intelligence (AI) responsibly at Google, in the larger research community and across the technology industry. Carrie is a research scientist focusing on human-AI interaction, Ben is an ethicist and policy advisor and Johnny is an AI Principles ethicist. They all work within a global team of experts from a variety of fields, including the social sciences and humanities, focused on the ethical development of AI. They’re driven to make systems that are fair, inclusive and focused on people.

But they have more than their work in common: They’re all accomplished musicians who’ve studied music, composed and published pieces and even played at the professional level. We wanted to know more about their musical backgrounds, and how this creative interest informs their work building AI systems that take everyone into account.

What instrument — or instruments — do you play?

Ben: Guitar, bass and drums.

Johnny: Mainly drums these days, but I’ve also done ambient and electronica.

Carrie: I play piano and I also compose music.

Where did your interest in playing music come from?

Ben: I grew up in a musical family where instruments were always lying around. My parents’ friends would bring their instruments when they came to visit and our house would turn into a music venue. I enrolled in a music degree in my late teens to become a professional drummer. Then, a year later, I serendipitously became a bassist: I went to law school in the Netherlands, and the university band already had someone who was a better drummer than I was — but they needed a bassist, so I grabbed the opportunity.

Carrie: I started out in the Yamaha music program when I was six, where rather than learning technical piano playing skills you focus on ear training, hearing the music and how to play as an ensemble. I think that foundation led me to be a lot more creative with my music than I would have been otherwise. I spent part of my childhood years composing music, too — here are some of my early compositions from my high school days!

Johnny: I’ve played lots of instruments since I was a child, but never had the tenacity to get very good at any of them. Perhaps as a result of this, I got involved with a highly experimental ambient scene in the early 2000s and started the one-man project Metus Mortuus, using samples and DIY equipment to create often disturbing soundscapes. It was really only when I got hooked on the video game “Rock Band,” where you play “fake” instruments along with the notation on screen, that I put in the hours needed to get some basic limb independence and with that a platform for learning real drums.

Did you gravitate toward the drums in the game?

Johnny: No, I hardly ever touched them — I simply couldn’t make my left arm do something my right arm wasn’t doing, but one day I decided to try an experiment: Can I make these stale neural pathways of mine actually learn something new well into adulthood? I started practicing on these toy drums every day, which was painful and frustrating, but occasional breakthroughs kept me going. Eventually I achieved a level of limb independence I hadn’t thought I was capable of. I invested in proper e-drums and I’ve played almost every day since.

This [work] often requires you to think creatively. And I feel that the way in which drumming almost literally rewired my brain has made me much better at doing that.

What’s your favorite thing about playing?

Johnny: It’s really the ultimate flow experience, where you’re fully immersed in an activity to the extent you lose track of time and only focus on the present moment. There’s lots of empirical research in the field of positive psychology suggesting that regular flow experiences promote better well-being.

Ben: I love playing the bass with a band because it’s the glue between the rhythm section and the melody sections. It’s fun when you purposefully come in a beat later, you really see people not sure whether to dance or not. When you start playing, suddenly the whole audience understands what’s going on. And then they have the audacity to say they never hear the bass!

How has music made its way into your work, if at all?

Carrie: It’s certainly affected how I think about my work today, particularly around how to make AI more controllable to everyday people. For example, we’re now seeing these new, highly capable generative AI models that can compose music sounding indistinguishable from something written by Bach. But we discovered that, just because an AI model is capable of making beautiful music, doesn’t mean humans can always make beautiful music using AI.

When I create music, I’m thinking, “I want the beginning of the song to sound cheerful, then I want it to build tension, before ending in a somber way.” When I’m creating with AI, it can be difficult to express that — I can’t easily say, “Hey AI, make the first part happy and then build tension here” This can make it difficult for people to feel a sense of artistic agency and authorship when they’re creating any kind of content with AI.

Recently, I collaborated with other human-computer interaction (HCI) and machine learning (ML) researchers at Google to create new tools enabling people to steer and control AI as they compose music with it. We found that these “steering tools” significantly boost users’ sense of creative ownership and trust as they compose with AI.

Do you think there’s anything about the sort of work you do that exercises the same sort of “mental muscles” as music does?

Johnny: Yes, I think the key to ethics — and especially ethics of AI where there often is no precedent — is to be able to approach a problem from different angles and draw connections between the case at hand and relevant, similar cases from the past. This requires you to think creatively. And I feel that the way in which drumming almost literally rewired my brain has made me much better at doing that.

Ben: When you learn to play the drums, one of the hardest things is learning you must separate the movements of your limbs in your mind. It’s pretty difficult to process — which makes it a very nice experience once your mind can asynchronously control parts of your thinking to create interesting rhythms that are still in time. I think for my work on ethics of technical design, I have to frequently understand many interacting but very different disciplines. I’m not sure if it has anything to do with drumming, but I find that I can think about these things in tandem, while they are completely different.

Once when I was little, I woke up and without even changing out of my pajamas, spent the entire day composing a piece of music.

Carrie: I remember once when I was little, I woke up and without even changing out of my pajamas, spent the entire day composing a piece of music. I realize now that that was a flow state — I was working on something that was challenging yet doable. I think that’s a key property of creativity and it’s affected how I work in general. It’s easiest for me to be productive when I’m in that state — working on something that’s challenging, but not so difficult that I won’t want to start it or keep going. That’s helpful in research because there’s so much uncertainty — you never know if your experiments are going to work! But I can take a lesson from how I got into that flow state with music and apply it to research: How can I as a research scientist enter a flow state?

Read More

Detecting Signs of Disease from External Images of the Eye

Three years ago we wrote about our work on predicting a number of cardiovascular risk factors from fundus photos (i.e., photos of the back of the eye)1 using deep learning. That such risk factors could be extracted from fundus photos was a novel discovery and thus a surprising outcome to clinicians and laypersons alike. Since then, we and other researchers have discovered additional novel biomarkers from fundus photos, such as markers for chronic kidney disease and diabetes, and hemoglobin levels to detect anemia.

A unifying goal of work like this is to develop new disease detection or monitoring approaches that are less invasive, more accurate, cheaper and more readily available. However, one restriction to potential broad population-level applicability of efforts to extract biomarkers from fundus photos is getting the fundus photos themselves, which requires specialized imaging equipment and a trained technician.

The eye can be imaged in multiple ways. A common approach for diabetic retinal disease screening is to examine the posterior segment using fundus photographs (left), which have been shown to contain signals of kidney and heart disease, as well as anemia. Another way is to take photographs of the front of the eye (external eye photos; right), which is typically used to track conditions affecting the eyelids, conjunctiva, cornea, and lens.

In “Detection of signs of disease in external photographs of the eyes via deep learning”, in press at Nature Biomedical Engineering, we show that a deep learning model can extract potentially useful biomarkers from external eye photos (i.e., photos of the front of the eye). In particular, for diabetic patients, the model can predict the presence of diabetic retinal disease, elevated HbA1c (a biomarker of diabetic blood sugar control and outcomes), and elevated blood lipids (a biomarker of cardiovascular risk). External eye photos as an imaging modality are particularly interesting because their use may reduce the need for specialized equipment, opening the door to various avenues of improving the accessibility of health screening.

Developing the Model
To develop the model, we used de-identified data from over 145,000 patients from a teleretinal diabetic retinopathy screening program. We trained a convolutional neural network both on these images and on the corresponding ground truth for the variables we wanted the model to predict (i.e., whether the patient has diabetic retinal disease, elevated HbA1c, or elevated lipids) so that the neural network could learn from these examples. After training, the model is able to take external eye photos as input and then output predictions for whether the patient has diabetic retinal disease, or elevated sugars or lipids.

A schematic showing the model generating predictions for an external eye photo.

We measured model performance using the area under the receiver operator characteristic curve (AUC), which quantifies how frequently the model assigns higher scores to patients who are truly positive than patients who are truly negative (i.e., a perfect model scores 100%, compared to 50% for random guesses). The model detected various forms of diabetic retinal disease with AUCs of 71-82%, AUCs of 67-70% for elevated HbA1c, and AUCs of 57-68% for elevated lipids. These results indicate that, though imperfect, external eye photos can help detect and quantify various parameters of systemic health.

Much like the CDC’s pre-diabetes screening questionnaire, external eye photos may be able to help “pre-screen” people and identify those who may benefit from further confirmatory testing. If we sort all patients in our study based on their predicted risk and look at the top 5% of that list, 69% of those patients had HbA1c measurements ≥ 9 (indicating poor blood sugar control for patients with diabetes). For comparison, among patients who are at highest risk according to a risk score based on demographics and years with diabetes, only 55% had HbA1c ≥ 9, and among patients selected at random only 33% had HbA1c ≥ 9.

Assessing Potential Bias
We emphasize that this is promising, yet early, proof-of-concept research showcasing a novel discovery. That said, because we believe that it is important to evaluate potential biases in the data and model, we undertook a multi-pronged approach for bias assessment.

First, we conducted various explainability analyses aimed at discovering what parts of the image contribute most to the algorithm’s predictions (similar to our anemia work). Both saliency analyses (which examine which pixels most influenced the predictions) and ablation experiments (which examine the impact of removing various image regions) indicate that the algorithm is most influenced by the center of the image (the areas of the sclera, iris, and pupil of the eye, but not the eyelids). This is demonstrated below where one can see that the AUC declines much more quickly when image occlusion starts in the center (green lines) than when it starts in the periphery (blue lines).

Explainability analysis shows that (top) all predictions focused on different parts of the eye, and that (bottom) occluding the center of the image (corresponding to parts of the eyeball) has a much greater effect than occluding the periphery (corresponding to the surrounding structures, such as eyelids), as shown by the green line’s steeper decline. The “baseline” is a logistic regression model that takes self-reported age, sex, race and years with diabetes as input.

Second, our development dataset spanned a diverse set of locations within the U.S., encompassing over 300,000 de-identified photos taken at 301 diabetic retinopathy screening sites. Our evaluation datasets comprised over 95,000 images from 198 sites in 18 US states, including datasets of predominantly Hispanic or Latino patients, a dataset of majority Black patients, and a dataset that included patients without diabetes. We conducted extensive subgroup analyses across groups of patients with different demographic and physical characteristics (such as age, sex, race and ethnicity, presence of cataract, pupil size, and even camera type), and controlled for these variables as covariates. The algorithm was more predictive than the baseline in all subgroups after accounting for these factors.

Conclusion
This exciting work demonstrates the feasibility of extracting useful health related signals from external eye photographs, and has potential implications for the large and rapidly growing population of patients with diabetes or other chronic diseases. There is a long way to go to achieve broad applicability, for example understanding what level of image quality is needed, generalizing to patients with and without known chronic diseases, and understanding generalization to images taken with different cameras and under a wider variety of conditions, like lighting and environment. In continued partnership with academic and nonacademic experts, including EyePACS and CGMH, we look forward to further developing and testing our algorithm on larger and more comprehensive datasets, and broadening the set of biomarkers recognized (e.g., for liver disease). Ultimately we are working towards making non-invasive health and wellness tools more accessible to everyone.

Acknowledgements
This work involved the efforts of a multidisciplinary team of software engineers, researchers, clinicians and cross functional contributors. Key contributors to this project include: Boris Babenko, Akinori Mitani, Ilana Traynis, Naho Kitade‎, Preeti Singh, April Y. Maa, Jorge Cuadros, Greg S. Corrado, Lily Peng, Dale R. Webster, Avinash Varadarajan‎, Naama Hammel, and Yun Liu. The authors would also like to acknowledge Huy Doan, Quang Duong, Roy Lee, and the Google Health team for software infrastructure support and data collection. We also thank Tiffany Guo, Mike McConnell, Michael Howell, and Sam Kavusi for their feedback on the manuscript. Last but not least, gratitude goes to the graders who labeled data for the pupil segmentation model, and a special thanks to Tom Small for the ideation and design that inspired the animation used in this blog post.


1The information presented here is research and does not reflect a product that is available for sale. Future availability cannot be guaranteed. 

Read More