Differentially private heatmaps

Differentially private heatmaps

Recently, differential privacy (DP) has emerged as a mathematically robust notion of user privacy for data aggregation and machine learning (ML), with practical deployments including the 2022 US Census and in industry. Over the last few years, we have open-sourced libraries for privacy-preserving analytics and ML and have been constantly enhancing their capabilities. Meanwhile, new algorithms have been developed by the research community for several analytic tasks involving private aggregation of data.

One such important data aggregation method is the heatmap. Heatmaps are popular for visualizing aggregated data in two or more dimensions. They are widely used in many fields including computer vision, image processing, spatial data analysis, bioinformatics, and more. Protecting the privacy of user data is critical for many applications of heatmaps. For example, heatmaps for gene microdata are based on private data from individuals. Similarly, a heatmap of popular locations in a geographic area are based on user location check-ins that need to be kept private.

Motivated by such applications, in “Differentially Private Heatmaps” (presented at AAAI 2023), we describe an efficient DP algorithm for computing heatmaps with provable guarantees and evaluate it empirically. At the core of our DP algorithm for heatmaps is a solution to the basic problem of how to privately aggregate sparse input vectors (i.e., input vectors with a small number of non-zero coordinates) with a small error as measured by the Earth Mover’s Distance (EMD). Using a hierarchical partitioning procedure, our algorithm views each input vector, as well as the output heatmap, as a probability distribution over a number of items equal to the dimension of the data. For the problem of sparse aggregation under EMD, we give an efficient algorithm with error asymptotically close to the best possible.

Algorithm description

Our algorithm works by privatizing the aggregated distribution (obtained by averaging over all user inputs), which is sufficient for computing a final heatmap that is private due to the post-processing property of DP. This property ensures that any transformation of the output of a DP algorithm remains differentially private. Our main contribution is a new privatization algorithm for the aggregated distribution, which we will describe next.

The EMD measure, which is a distance-like measure of dissimilarity between two probability distributions originally proposed for computer vision tasks, is well-suited for heatmaps since it takes the underlying metric space into account and considers “neighboring” bins. EMD is used in a variety of applications including deep learning, spatial analysis, human mobility, image retrieval, face recognition, visual tracking, shape matching, and more.

To achieve DP, we need to add noise to the aggregated distribution. We would also like to preserve statistics at different scales of the grid to minimize the EMD error. So, we create a hierarchical partitioning of the grid, add noise at each level, and then recombine into the final DP aggregated distribution. In particular, the algorithm has the following steps:

  1. Quadtree construction: Our hierarchical partitioning procedure first divides the grid into four cells, then divides each cell into four subcells; it recursively continues this process until each cell is a single pixel. This procedure creates a quadtree over the subcells where the root represents the entire grid and each leaf represents a pixel. The algorithm then calculates the total probability mass for each tree node (obtained by adding up the aggregated distribution’s probabilities of all leaves in the subtree rooted at this node). This step is illustrated below.
    In the first step, we take the (non-private) aggregated distribution (top left) and repeatedly divide it to create a quadtree. Then, we compute the total probability mass is each cell (bottom).
  2. Noise addition: To each tree node’s mass we then add Laplace noise calibrated to the use case.
  3. Truncation: To help reduce the final amount of noise in our DP aggregated distribution, the algorithm traverses the tree starting from the root and, at each level, it discards all but the top w nodes with highest (noisy) masses together with their descendants.
  4. Reconstruction: Finally, the algorithm solves a linear program to recover the aggregated distribution. This linear program is inspired by the sparse recovery literature where the noisy masses are viewed as (noisy) measurements of the data.
In step 2, noise is added to each cell’s probability mass. Then in step 3, only top-w cells are kept (green) whereas the remaining cells are truncated (red). Finally, in the last step, we write a linear program on these top cells to reconstruct the aggregation distribution, which is now differentially private.

Experimental results

We evaluate the performance of our algorithm in two different domains: real-world location check-in data and image saliency data. We consider as a baseline the ubiquitous Laplace mechanism, where we add Laplace noise to each cell, zero out any negative cells, and produce the heatmap from this noisy aggregate. We also consider a “thresholding” variant of this baseline that is more suited to sparse data: only keep top t% of the cell values (based on the probability mass in each cell) after noising while zeroing out the rest. To evaluate the quality of an output heatmap compared to the true heatmap, we use Pearson coefficient, KL-divergence, and EMD. Note that when the heatmaps are more similar, the first metric increases but the latter two decrease.

The locations dataset is obtained by combining two datasets, Gowalla and Brightkite, both of which contain check-ins by users of location-based social networks. We pre-processed this dataset to consider only check-ins in the continental US resulting in a final dataset consisting of ~500,000 check-ins by ~20,000 users. Considering the top cells (from an initial partitioning of the entire space into a 300 x 300 grid) that have check-ins from at least 200 unique users, we partition each such cell into subgrids with a resolution of ∆ × ∆ and assign each check-in to one of these subgrids.

In the first set of experiments, we fix ∆ = 256. We test the performance of our algorithm for different values of ε (the privacy parameter, where smaller ε means stronger DP guarantees), ranging from 0.1 to 10, by running our algorithms together with the baseline and its variants on all cells, randomly sampling a set of 200 users in each trial, and then computing the distance metrics between the true heatmap and the DP heatmap. The average of these metrics is presented below. Our algorithm (the red line) performs better than all versions of the baseline across all metrics, with improvements that are especially significant when ε is not too large or small (i.e., 0.2 ≤ ε ≤ 5).

Metrics averaged over 60 runs when varying ε for the location dataset. Shaded areas indicate 95% confidence interval.

Next, we study the effect of varying the number n of users. By fixing a single cell (with > 500 users) and ε, we vary n from 50 to 500 users. As predicted by theory, our algorithms and the baseline perform better as n increases. However, the behavior of the thresholding variants of the baseline are less predictable.

We also run another experiment where we fix a single cell and ε, and vary the resolution ∆ from 64 to 256. In agreement with theory, our algorithm’s performance remains nearly constant for the entire range of ∆. However, the baseline suffers across all metrics as ∆ increases while the thresholding variants occasionally improve as ∆ increases.

Effect of the number of users and grid resolution on EMD.

We also experiment on the Salicon image saliency dataset (SALICON). This dataset is a collection of saliency annotations on the Microsoft Common Objects in Context image database. We downsized the images to a fixed resolution of 320 × 240 and each [user, image] pair consists of a sequence of coordinates in the image where the user looked. We repeat the experiments described previously on 38 randomly sampled images (with ≥ 50 users each) from SALICON. As we can see from the examples below, the heatmap obtained by our algorithm is very close to the ground truth.

Example visualization of different algorithms for two different natural images from SALICON for ε = 10 and n = 50 users. The algorithms from left to right are: original heatmap (no privacy), baseline, and ours.

Additional experimental results, including those on other datasets, metrics, privacy parameters and DP models, can be found in the paper.

Conclusion

We presented a privatization algorithm for sparse distribution aggregation under the EMD metric, which in turn yields an algorithm for producing privacy-preserving heatmaps. Our algorithm extends naturally to distributed models that can implement the Laplace mechanism, including the secure aggregation model and the shuffle model. This does not apply to the more stringent local DP model, and it remains an interesting open question to devise practical local DP heatmap/EMD aggregation algorithms for “moderate” number of users and privacy parameters.

Acknowledgments

This work was done jointly with Junfeng He, Kai Kohlhoff, Ravi Kumar, Pasin Manurangsi, and Vidhya Navalpakkam.

Read More

Beyond automatic differentiation

Beyond automatic differentiation

Derivatives play a central role in optimization and machine learning. By locally approximating a training loss, derivatives guide an optimizer toward lower values of the loss. Automatic differentiation frameworks such as TensorFlow, PyTorch, and JAX are an essential part of modern machine learning, making it feasible to use gradient-based optimizers to train very complex models.

But are derivatives all we need? By themselves, derivatives only tell us how a function behaves on an infinitesimal scale. To use derivatives effectively, we often need to know more than that. For example, to choose a learning rate for gradient descent, we need to know something about how the loss function behaves over a small but finite window. A finite-scale analogue of automatic differentiation, if it existed, could help us make such choices more effectively and thereby speed up training.

In our new paper “Automatically Bounding The Taylor Remainder Series: Tighter Bounds and New Applications“, we present an algorithm called AutoBound that computes polynomial upper and lower bounds on a given function, which are valid over a user-specified interval. We then begin to explore AutoBound’s applications. Notably, we present a meta-optimizer called SafeRate that uses the upper bounds computed by AutoBound to derive learning rates that are guaranteed to monotonically reduce a given loss function, without the need for time-consuming hyperparameter tuning. We are also making AutoBound available as an open-source library.

The AutoBound algorithm

Given a function f and a reference point x0, AutoBound computes polynomial upper and lower bounds on f that hold over a user-specified interval called a trust region. Like Taylor polynomials, the bounding polynomials are equal to f at x0. The bounds become tighter as the trust region shrinks, and approach the corresponding Taylor polynomial as the trust region width approaches zero.

Automatically-derived quadratic upper and lower bounds on a one-dimensional function f, centered at x0=0.5. The upper and lower bounds are valid over a user-specified trust region, and become tighter as the trust region shrinks.

Like automatic differentiation, AutoBound can be applied to any function that can be implemented using standard mathematical operations. In fact, AutoBound is a generalization of Taylor mode automatic differentiation, and is equivalent to it in the special case where the trust region has a width of zero.

To derive the AutoBound algorithm, there were two main challenges we had to address:

  1. We had to derive polynomial upper and lower bounds for various elementary functions, given an arbitrary reference point and arbitrary trust region.
  2. We had to come up with an analogue of the chain rule for combining these bounds.

Bounds for elementary functions

For a variety of commonly-used functions, we derive optimal polynomial upper and lower bounds in closed form. In this context, “optimal” means the bounds are as tight as possible, among all polynomials where only the maximum-degree coefficient differs from the Taylor series. Our theory applies to elementary functions, such as exp and log, and common neural network activation functions, such as ReLU and Swish. It builds upon and generalizes earlier work that applied only to quadratic bounds, and only for an unbounded trust region.

Optimal quadratic upper and lower bounds on the exponential function, centered at x0=0.5 and valid over the interval [0, 2].

A new chain rule

To compute upper and lower bounds for arbitrary functions, we derived a generalization of the chain rule that operates on polynomial bounds. To illustrate the idea, suppose we have a function that can be written as

f(x) = g(h(x))

and suppose we already have polynomial upper and lower bounds on g and h. How do we compute bounds on f?

The key turns out to be representing the upper and lower bounds for a given function as a single polynomial whose highest-degree coefficient is an interval rather than a scalar. We can then plug the bound for h into the bound for g, and convert the result back to a polynomial of the same form using interval arithmetic. Under suitable assumptions about the trust region over which the bound on g holds, it can be shown that this procedure yields the desired bound on f.

The interval polynomial chain rule applied to the functions h(x) = sqrt(x) and g(y) = exp(y), with x0=0.25 and trust region [0, 0.5].

Our chain rule applies to one-dimensional functions, but also to multivariate functions, such as matrix multiplications and convolutions.

Propagating bounds

Using our new chain rule, AutoBound propagates interval polynomial bounds through a computation graph from the inputs to the outputs, analogous to forward-mode automatic differentiation.

Forward propagation of interval polynomial bounds for the function f(x) = exp(sqrt(x)). We first compute (trivial) bounds on x, then use the chain rule to compute bounds on sqrt(x) and exp(sqrt(x)).

To compute bounds on a function f(x), AutoBound requires memory proportional to the dimension of x. For this reason, practical applications apply AutoBound to functions with a small number of inputs. However, as we will see, this does not prevent us from using AutoBound for neural network optimization.

Automatically deriving optimizers, and other applications

What can we do with AutoBound that we couldn’t do with automatic differentiation alone?

Among other things, AutoBound can be used to automatically derive problem-specific, hyperparameter-free optimizers that converge from any starting point. These optimizers iteratively reduce a loss by first using AutoBound to compute an upper bound on the loss that is tight at the current point, and then minimizing the upper bound to obtain the next point.

Minimizing a one-dimensional logistic regression loss using quadratic upper bounds derived automatically by AutoBound.

Optimizers that use upper bounds in this way are called majorization-minimization (MM) optimizers. Applied to one-dimensional logistic regression, AutoBound rederives an MM optimizer first published in 2009. Applied to more complex problems, AutoBound derives novel MM optimizers that would be difficult to derive by hand.

We can use a similar idea to take an existing optimizer such as Adam and convert it to a hyperparameter-free optimizer that is guaranteed to monotonically reduce the loss (in the full-batch setting). The resulting optimizer uses the same update direction as the original optimizer, but modifies the learning rate by minimizing a one-dimensional quadratic upper bound derived by AutoBound. We refer to the resulting meta-optimizer as SafeRate.

Performance of SafeRate when used to train a single-hidden-layer neural network on a subset of the MNIST dataset, in the full-batch setting.

Using SafeRate, we can create more robust variants of existing optimizers, at the cost of a single additional forward pass that increases the wall time for each step by a small factor (about 2x in the example above).

In addition to the applications just discussed, AutoBound can be used for verified numerical integration and to automatically prove sharper versions of Jensen’s inequality, a fundamental mathematical inequality used frequently in statistics and other fields.

Improvement over classical bounds

Bounding the Taylor remainder term automatically is not a new idea. A classical technique produces degree k polynomial bounds on a function f that are valid over a trust region [a, b] by first computing an expression for the kth derivative of f (using automatic differentiation), then evaluating this expression over [a,b] using interval arithmetic.

While elegant, this approach has some inherent limitations that can lead to very loose bounds, as illustrated by the dotted blue lines in the figure below.

Quadratic upper and lower bounds on the loss of a multi-layer perceptron with two hidden layers, as a function of the initial learning rate. The bounds derived by AutoBound are much tighter than those obtained using interval arithmetic evaluation of the second derivative.

Looking forward

Taylor polynomials have been in use for over three hundred years, and are omnipresent in numerical optimization and scientific computing. Nevertheless, Taylor polynomials have significant limitations, which can limit the capabilities of algorithms built on top of them. Our work is part of a growing literature that recognizes these limitations and seeks to develop a new foundation upon which more robust algorithms can be built.

Our experiments so far have only scratched the surface of what is possible using AutoBound, and we believe it has many applications we have not discovered. To encourage the research community to explore such possibilities, we have made AutoBound available as an open-source library built on top of JAX. To get started, visit our GitHub repo.

Acknowledgements

This post is based on joint work with Josh Dillon. We thank Alex Alemi and Sergey Ioffe for valuable feedback on an earlier draft of the post.

Read More

Robotic deep RL at scale: Sorting waste and recyclables with a fleet of robots

Robotic deep RL at scale: Sorting waste and recyclables with a fleet of robots

Reinforcement learning (RL) can enable robots to learn complex behaviors through trial-and-error interaction, getting better and better over time. Several of our prior works explored how RL can enable intricate robotic skills, such as robotic grasping, multi-task learning, and even playing table tennis. Although robotic RL has come a long way, we still don’t see RL-enabled robots in everyday settings. The real world is complex, diverse, and changes over time, presenting a major challenge for robotic systems. However, we believe that RL should offer us an excellent tool for tackling precisely these challenges: by continually practicing, getting better, and learning on the job, robots should be able to adapt to the world as it changes around them.

In “Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators”, we discuss how we studied this problem through a recent large-scale experiment, where we deployed a fleet of 23 RL-enabled robots over two years in Google office buildings to sort waste and recycling. Our robotic system combines scalable deep RL from real-world data with bootstrapping from training in simulation and auxiliary object perception inputs to boost generalization, while retaining the benefits of end-to-end training, which we validate with 4,800 evaluation trials across 240 waste station configurations.

Problem setup

When people don’t sort their trash properly, batches of recyclables can become contaminated and compost can be improperly discarded into landfills. In our experiment, a robot roamed around an office building searching for “waste stations” (bins for recyclables, compost, and trash). The robot was tasked with approaching each waste station to sort it, moving items between the bins so that all recyclables (cans, bottles) were placed in the recyclable bin, all the compostable items (cardboard containers, paper cups) were placed in the compost bin, and everything else was placed in the landfill trash bin. Here is what that looks like:

This task is not as easy as it looks. Just being able to pick up the vast variety of objects that people deposit into waste bins presents a major learning challenge. Robots also have to identify the appropriate bin for each object and sort them as quickly and efficiently as possible. In the real world, the robots can encounter a variety of situations with unique objects, like the examples from real office buildings below:

Learning from diverse experience

Learning on the job helps, but before even getting to that point, we need to bootstrap the robots with a basic set of skills. To this end, we use four sources of experience: (1) a set of simple hand-designed policies that have a very low success rate, but serve to provide some initial experience, (2) a simulated training framework that uses sim-to-real transfer to provide some initial bin sorting strategies, (3) “robot classrooms” where the robots continually practice at a set of representative waste stations, and (4) the real deployment setting, where robots practice in real office buildings with real trash.

A diagram of RL at scale. We bootstrap policies from data generated with a script (top-left). We then train a sim-to-real model and generate additional data in simulation (top-right). At each deployment cycle, we add data collected in our classrooms (bottom-right). We further deploy and collect data in office buildings (bottom-left).

Our RL framework is based on QT-Opt, which we previously applied to learn bin grasping in laboratory settings, as well as a range of other skills. In simulation, we bootstrap from simple scripted policies and use RL, with a CycleGAN-based transfer method that uses RetinaGAN to make the simulated images appear more life-like.

From here, it’s off to the classroom. While real-world office buildings can provide the most representative experience, the throughput in terms of data collection is limited — some days there will be a lot of trash to sort, some days not so much. Our robots collect a large portion of their experience in “robot classrooms.” In the classroom shown below, 20 robots practice the waste sorting task:

While these robots are training in the classrooms, other robots are simultaneously learning on the job in 3 office buildings, with 30 waste stations:

Sorting performance

In the end, we gathered 540k trials in the classrooms and 32.5k trials from deployment. Overall system performance improved as more data was collected. We evaluated our final system in the classrooms to allow for controlled comparisons, setting up scenarios based on what the robots saw during deployment. The final system could accurately sort about 84% of the objects on average, with performance increasing steadily as more data was added. In the real world, we logged statistics from three real-world deployments between 2021 and 2022, and found that our system could reduce contamination in the waste bins by between 40% and 50% by weight. Our paper provides further insights on the technical design, ablations studying various design decisions, and more detailed statistics on the experiments.

Conclusion and future work

Our experiments showed that RL-based systems can enable robots to address real-world tasks in real office environments, with a combination of offline and online data enabling robots to adapt to the broad variability of real-world situations. At the same time, learning in more controlled “classroom” environments, both in simulation and in the real world, can provide a powerful bootstrapping mechanism to get the RL “flywheel” spinning to enable this adaptation. There is still a lot left to do: our final RL policies do not succeed every time, and larger and more powerful models will be needed to improve their performance and extend them to a broader range of tasks. Other sources of experience, including from other tasks, other robots, and even Internet videos may serve to further supplement the bootstrapping experience that we obtained from simulation and classrooms. These are exciting problems to tackle in the future. Please see the full paper here, and the supplementary video materials on the project webpage.

Acknowledgements

This research was conducted by multiple researchers at Robotics at Google and Everyday Robots, with contributions from Alexander Herzog, Kanishka Rao, Karol Hausman, Yao Lu, Paul Wohlhart, Mengyuan Yan, Jessica Lin, Montserrat Gonzalez Arenas, Ted Xiao, Daniel Kappler, Daniel Ho, Jarek Rettinghouse, Yevgen Chebotar, Kuang-Huei Lee, Keerthana Gopalakrishnan, Ryan Julian, Adrian Li, Chuyuan Kelly Fu, Bob Wei, Sangeetha Ramesh, Khem Holden, Kim Kleiven, David Rendleman, Sean Kirmani, Jeff Bingham, Jon Weisz, Ying Xu, Wenlong Lu, Matthew Bennice, Cody Fong, David Do, Jessica Lam, Yunfei Bai, Benjie Holson, Michael Quinlan, Noah Brown, Mrinal Kalakrishnan, Julian Ibarz, Peter Pastor, Sergey Levine and the entire Everyday Robots team.

Read More

UniPi: Learning universal policies via text-guided video generation

UniPi: Learning universal policies via text-guided video generation

Building models that solve a diverse set of tasks has become a dominant paradigm in the domains of vision and language. In natural language processing, large pre-trained models, such as PaLM, GPT-3 and Gopher, have demonstrated remarkable zero-shot learning of new language tasks. Similarly, in computer vision, models like CLIP and Flamingo have shown robust performance on zero-shot classification and object recognition. A natural next step is to use such tools to construct agents that can complete different decision-making tasks across many environments.

However, training such agents faces the inherent challenge of environmental diversity, since different environments operate with distinct state action spaces (e.g., the joint space and continuous controls in MuJoCo are fundamentally different from the image space and discrete actions in Atari). This environmental diversity hampers knowledge sharing, learning, and generalization across tasks and environments. Furthermore, it is difficult to construct reward functions across environments, as different tasks generally have different notions of success.

In “Learning Universal Policies via Text-Guided Video Generation”, we propose a Universal Policy (UniPi) that addresses environmental diversity and reward specification challenges. UniPi leverages text for expressing task descriptions and video (i.e., image sequences) as a universal interface for conveying action and observation behavior in different environments. Given an input image frame paired with text describing a current goal (i.e., the next high-level step), UniPi uses a novel video generator (trajectory planner) to generate video with snippets of what an agent’s trajectory should look like to achieve that goal. The generated video is fed into an inverse dynamics model that extracts underlying low-level control actions, which are then executed in simulation or by a real robot agent. We demonstrate that UniPi enables the use of language and video as a universal control interface for generalizing to novel goals and tasks across diverse environments.

Video policies generated by UniPi.
UniPi may be applied to downstream multi-task settings that require combinatorial language generalization, long-horizon planning, or internet-scale knowledge. In the bottom example, UniPi takes the image of the white robot arm from the internet and generates video snippets according to the text description of the goal.

UniPi implementation

To generate a valid and executable plan, a text-to-video model must synthesize a constrained video plan starting at the current observed image. We found it more effective to explicitly constrain a video synthesis model during training (as opposed to only constraining videos at sampling time) by providing the first frame of each video as explicit conditioning context.

At a high level, UniPi has four major components: 1) consistent video generation with first-frame tiling, 2) hierarchical planning through temporal super resolution, 3) flexible behavior synthesis, and 4) task-specific action adaptation. We explain the implementation and benefit of each component in detail below.

Video generation through tiling

Existing text-to-video models like Imagen typically generate videos where the underlying environment state changes significantly throughout the duration. To construct an accurate trajectory planner, it is important that the environment remains consistent across all time points. We enforce environment consistency in conditional video synthesis by providing the observed image as additional context when denoising each frame in the synthesized video. To achieve context conditioning, UniPi directly concatenates each intermediate frame sampled from noise with the conditioned observed image across sampling steps, which serves as a strong signal to maintain the underlying environment state across time.

Text-conditional video generation enables UniPi to train general purpose policies on a wide range of data sources (simulated, real robots and YouTube).

Hierarchical planning

When constructing plans in high-dimensional environments with long time horizons, directly generating a set of actions to reach a goal state quickly becomes intractable due to the exponential growth of the underlying search space as the plan gets longer. Planning methods often circumvent this issue by leveraging a natural hierarchy in planning. Specifically, planning methods first construct coarse plans (the intermediate key frames spread out across time) operating on low-dimensional states and actions, which are then refined into plans in the underlying state and action spaces.

Similar to planning, our conditional video generation procedure exhibits a natural temporal hierarchy. UniPi first generates videos at a coarse level by sparsely sampling videos (“abstractions”) of desired agent behavior along the time axis. UniPi then refines the videos to represent valid behavior in the environment by super-resolving videos across time. Meanwhile, coarse-to-fine super-resolution further improves consistency via interpolation between frames.

Given an input observation and text instruction, we plan a set of images representing agent behavior. Images are converted to actions using an inverse dynamics model.

Flexible behavioral modulation

When planning a sequence of actions for a given sub-goal, one can readily incorporate external constraints to modulate a generated plan. Such test-time adaptability can be implemented by composing a probabilistic prior incorporating properties of the desired plan to specify desired constraints across the synthesized action trajectory, which is also compatible with UniPi. In particular, the prior can be specified using a learned classifier on images to optimize a particular task, or as a Dirac delta distribution on a particular image to guide a plan towards a particular set of states. To train the text-conditioned video generation model, we utilize the video diffusion algorithm, where pre-trained language features from the Text-To-Text Transfer Transformer (T5) are encoded.

Task-specific action adaptation

Given a set of synthesized videos, we train a small task-specific inverse dynamics model to translate frames into a set of low-level control actions. This is independent from the planner and can be done on a separate, smaller and potentially suboptimal dataset generated by a simulator.

Given the input frame and text description of the current goal, the inverse dynamics model synthesizes image frames and generates a control action sequence that predicts the corresponding future actions. An agent then executes inferred low-level control actions via closed-loop control.

Capabilities and evaluation of UniPi

We measure the task success rate on novel language-based goals, and find that UniPi generalizes well to both seen and novel combinations of language prompts, compared to baselines such as Transformer BC, Trajectory Transformer (TT), and Diffuser.

UniPi generalizes well to both seen and novel combinations of language prompts in Place (e.g., “place X in Y”) and Relation (e.g., “place X to the left of Y”) tasks.

Below, we illustrate generated videos on unseen combinations of goals. UniPi is able to synthesize a diverse set of behaviors that satisfy unseen language subgoals:

Generated videos for unseen language goals at test time.

Multi-environment transfer

We measure the task success rate of UniPi and baselines on novel tasks not seen during training. UniPi again outperforms the baselines by a large margin:

UniPi generalizes well to new environments when trained on a set of different multi-task environments.

Below, we illustrate generated videos on unseen tasks. UniPi is further able to synthesize a diverse set of behaviors that satisfy unseen language tasks:

Generated video plans on different new test tasks in the multitask setting.

Real world transfer

Below, we further illustrate generated videos given language instructions on unseen real images. Our approach is able to synthesize a diverse set of different behaviors which satisfy language instructions:

Using internet pre-training enables UniPi to synthesize videos of tasks not seen during training. In contrast, a model trained from scratch incorrectly generates plans of different tasks:

To evaluate the quality of videos generated by UniPi when pre-trained on non-robot data, we use the Fréchet Inception Distance (FID) and Fréchet Video Distance (FVD) metrics. We used Contrastive Language-Image Pre-training scores (CLIPScores) to measure the language-image alignment. We demonstrate that pre-trained UniPi achieves significantly higher FID and FVD scores and a better CLIPScore compared to UniPi without pre-training, suggesting that pre-training on non-robot data helps with generating plans for robots. We report the CLIPScore, FID, and VID scores for UniPi trained on Bridge data, with and without pre-training:

Model (24×40)       CLIPScore ↑       FID ↓       FVD ↓      
No pre-training       24.43 ± 0.04       17.75 ± 0.56       288.02 ± 10.45      
Pre-trained       24.54 ± 0.03       14.54 ± 0.57       264.66 ± 13.64      

Using existing internet data improves video plan predictions under all metrics considered.

The future of large-scale generative models for decision making

The positive results of UniPi point to the broader direction of using generative models and the wealth of data on the internet as powerful tools to learn general-purpose decision making systems. UniPi is only one step towards what generative models can bring to decision making. Other examples include using generative foundation models to provide photorealistic or linguistic simulators of the world in which artificial agents can be trained indefinitely. Generative models as agents can also learn to interact with complex environments such as the internet, so that much broader and more complex tasks can eventually be automated. We look forward to future research in applying internet-scale foundation models to multi-environment and multi-embodiment settings.

Acknowledgements

We’d like to thank all remaining authors of the paper including Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. We would like to thank George Tucker, Douglas Eck, and Vincent Vanhoucke for the feedback on this post and on the original paper.

Read More

Developing an aging clock using deep learning on retinal images

Developing an aging clock using deep learning on retinal images

Aging is a process that is characterized by physiological and molecular changes that increase an individual’s risk of developing diseases and eventually dying. Being able to measure and estimate the biological signatures of aging can help researchers identify preventive measures to reduce disease risk and impact. Researchers have developed “aging clocks” based on markers such as blood proteins or DNA methylation to measure individuals’ biological age, which is distinct from one’s chronological age. These aging clocks help predict the risk of age-related diseases. But because protein and methylation markers require a blood draw, non-invasive ways to find similar measures could make aging information more accessible.

Perhaps surprisingly, the features on our retinas reflect a lot about us. Images of the retina, which has vascular connections to the brain, are a valuable source of biological and physiological information. Its features have been linked to several aging-related diseases, including diabetic retinopathy, cardiovascular disease, and Alzheimer’s disease. Moreover, previous work from Google has shown that retinal images can be used to predict age, risk of cardiovascular disease, or even sex or smoking status. Could we extend those findings to aging, and maybe in the process identify a new, useful biomarker for human disease?

In a new paper “Longitudinal fundus imaging and its genome-wide association analysis provide evidence for a human retinal aging clock”, we show that deep learning models can accurately predict biological age from a retinal image and reveal insights that better predict age-related disease in individuals. We discuss how the model’s insights can improve our understanding of how genetic factors influence aging. Furthermore, we’re releasing the code modifications for these models, which build on ML frameworks for analyzing retina images that we have previously publicly released.

Predicting chronological age from retinal images

We trained a model to predict chronological age using hundreds of thousands of retinal images from a telemedicine-based blindness prevention program that were captured in primary care clinics and de-identified. A subset of these images has been used in a competition by Kaggle and academic publications, including prior Google work with diabetic retinopathy.

We evaluated the resulting model performance both on a held-out set of 50,000 retinal images and on a separate UKBiobank dataset containing approximately 120,000 images. The model predictions, named eyeAge, strongly correspond with the true chronological age of individuals (shown below; Pearson correlation coefficient of 0.87). This is the first time that retinal images have been used to create such an accurate aging clock.

Left: A retinal image showing the macula (dark spot in the middle), optic disc (bright spot at the right), and blood vessels (dark red lines extending from the optic disc). Right: Comparison of an individual’s true chronological age with the retina model predictions, “eyeAge”.

Analyzing the predicted and real age gap

Even though eyeAge correlates with chronological age well across many samples, the figure above also shows individuals for which the eyeAge differs substantially from chronological age, both in cases where the model predicts a value much younger or older than the chronological age. This could indicate that the model is learning factors in the retinal images that reflect real biological effects that are relevant to the diseases that become more prevalent with biological age.

To test whether this difference reflects underlying biological factors, we explored its correlation with conditions such as chronic obstructive pulmonary disease (COPD) and myocardial infarction and other biomarkers of health like systolic blood pressure. We observed that a predicted age higher than the chronological age, correlates with disease and biomarkers of health in these cases. For example, we showed a statistically significant (p=0.0028) correlation between eyeAge and all-cause mortality — that is a higher eyeAge was associated with a greater chance of death during the study.

Revealing genetic factors for aging

To further explore the utility of the eyeAge model for generating biological insights, we related model predictions to genetic variants, which are available for individuals in the large UKBiobank study. Importantly, an individual’s germline genetics (the variants inherited from your parents) are fixed at birth, making this measure independent of age. This analysis generated a list of genes associated with accelerated biological aging (labeled in the figure below). The top identified gene from our genome-wide association study is ALKAL2, and interestingly the corresponding gene in fruit flies had previously been shown to be involved in extending life span in flies. Our collaborator, Professor Pankaj Kapahi from the Buck Institute for Research on Aging, found in laboratory experiments that reducing the expression of the gene in flies resulted in improved vision, providing an indication of ALKAL2 influence on the aging of the visual system.

Manhattan plot representing significant genes associated with gap between chronological age and eyeAge. Significant genes displayed as points above the dotted threshold line.

Applications

Our eyeAge clock has many potential applications. As demonstrated above, it enables researchers to discover markers for aging and age-related diseases and to identify genes whose functions might be changed by drugs to promote healthier aging. It may also help researchers further understand the effects of lifestyle habits and interventions such as exercise, diet, and medication on an individual’s biological aging. Additionally, the eyeAge clock could be useful in the pharmaceutical industry for evaluating rejuvenation and anti-aging therapies. By tracking changes in the retina over time, researchers may be able to determine the effectiveness of these interventions in slowing or reversing the aging process.

Our approach to use retinal imaging for tracking biological age involves collecting images at multiple time points and analyzing them longitudinally to accurately predict the direction of aging. Importantly, this method is non-invasive and does not require specialized lab equipment. Our findings also indicate that the eyeAge clock, which is based on retinal images, is independent from blood-biomarker–based aging clocks. This allows researchers to study aging through another angle, and when combined with other markers, provides a more comprehensive understanding of an individual’s biological age. Also unlike current aging clocks, the less invasive nature of imaging (compared to blood tests) might enable eyeAge to be used for actionable biological and behavioral interventions.

Conclusion

We show that deep learning models can accurately predict an individual’s chronological age using only images of their retina. Moreover, when the predicted age differs from chronological age, this difference can identify accelerated onset of age-related disease. Finally, we show that the models learn insights which can improve our understanding of how genetic factors influence aging.

We’ve publicly released the code modifications used for these models which build on ML frameworks for analyzing retina images that we have previously publicly released.

It is our hope that this work will help scientists create better processes to identify disease and disease risk early, and lead to more effective drug and lifestyle interventions to promote healthy aging.

Acknowledgments

This work is the outcome of the combined efforts of multiple groups. We thank all contributors: Sara Ahadi, Boris Babenko, Cory McLean, Drew Bryant, Orion Pritchard, Avinash Varadarajan, Marc Berndl and Ali Bashir (Google Research), Kenneth Wilson, Enrique Carrera and Pankaj Kapahi (Buck Institute of Aging Research), and Ricardo Lamy and Jay Stewart (University of California, San Francisco). We would also like to thank Michelle Dimon and John Platt for reviewing the manuscript, and Preeti Singh for helping with publication logistics.

Read More

Towards ML-enabled cleaning robots

Towards ML-enabled cleaning robots

Over the past several years, the capabilities of robotic systems have improved dramatically. As the technology continues to improve and robotic agents are more routinely deployed in real-world environments, their capacity to assist in day-to-day activities will take on increasing importance. Repetitive tasks like wiping surfaces, folding clothes, and cleaning a room seem well-suited for robots, but remain challenging for robotic systems designed for structured environments like factories. Performing these types of tasks in more complex environments, like offices or homes, requires dealing with greater levels of environmental variability captured by high-dimensional sensory inputs, from images plus depth and force sensors.

For example, consider the task of wiping a table to clean a spill or brush away crumbs. While this task may seem simple, in practice, it encompasses many interesting challenges that are omnipresent in robotics. Indeed, at a high-level, deciding how to best wipe a spill from an image observation requires solving a challenging planning problem with stochastic dynamics: How should the robot wipe to avoid dispersing the spill perceived by a camera? But at a low-level, successfully executing a wiping motion also requires the robot to position itself to reach the problem area while avoiding nearby obstacles, such as chairs, and then to coordinate its motions to wipe clean the surface while maintaining contact with the table. Solving this table wiping problem would help researchers address a broader range of robotics tasks, such as cleaning windows and opening doors, which require both high-level planning from visual observations and precise contact-rich control.

   

Learning-based techniques such as reinforcement learning (RL) offer the promise of solving these complex visuo-motor tasks from high-dimensional observations. However, applying end-to-end learning methods to mobile manipulation tasks remains challenging due to the increased dimensionality and the need for precise low-level control. Additionally, on-robot deployment either requires collecting large amounts of data, using accurate but computationally expensive models, or on-hardware fine-tuning.

In “Robotic Table Wiping via Reinforcement Learning and Whole-body Trajectory Optimization”, we present a novel approach to enable a robot to reliably wipe tables. By carefully decomposing the task, our approach combines the strengths of RL — the capacity to plan in high-dimensional observation spaces with complex stochastic dynamics — and the ability to optimize trajectories, effectively finding whole-body robot commands that ensure the satisfaction of constraints, such as physical limits and collision avoidance. Given visual observations of a surface to be cleaned, the RL policy selects wiping actions that are then executed using trajectory optimization. By leveraging a new stochastic differential equation (SDE) simulator of the wiping task to train the RL policy for high-level planning, the proposed end-to-end approach avoids the need for task-specific training data and is able to transfer zero-shot to hardware.

Combining the strengths of RL and of optimal control

We propose an end-to-end approach for table wiping that consists of four components: (1) sensing the environment, (2) planning high-level wiping waypoints with RL, (3) computing trajectories for the whole-body system (i.e., for each joint) with optimal control methods, and (4) executing the planned wiping trajectories with a low-level controller.

System Architecture

The novel component of this approach is an RL policy that effectively plans high-level wiping waypoints given image observations of spills and crumbs. To train the RL policy, we completely bypass the problem of collecting large amounts of data on the robotic system and avoid using an accurate but computationally expensive physics simulator. Our proposed approach relies on a stochastic differential equation (SDE) to model latent dynamics of crumbs and spills, which yields an SDE simulator with four key features:

  • It can describe both dry objects pushed by the wiper and liquids absorbed during wiping.
  • It can simultaneously capture multiple isolated spills.
  • It models the uncertainty of the changes to the distribution of spills and crumbs as the robot interacts with them.
  • It is faster than real-time: simulating a wipe only takes a few milliseconds.

<!–

   
The SDE simulator allows simulating dry crumbs (left), which are pushed during each wipe, and spills (right), which are absorbed while wiping. The simulator allows modeling particles with different properties, such as with different absorption and adhesion coefficients and different uncertainty levels.

–>

   
The SDE simulator allows simulating dry crumbs (left), which are pushed during each wipe, and spills (right), which are absorbed while wiping. The simulator allows modeling particles with different properties, such as with different absorption and adhesion coefficients and different uncertainty levels.

This SDE simulator is able to rapidly generate large amounts of data for RL training. We validate the SDE simulator using observations from the robot by predicting the evolution of perceived particles for a given wipe. By comparing the result with perceived particles after executing the wipe, we observe that the model correctly predicts the general trend of the particle dynamics. A policy trained with this SDE model should be able to perform well in the real world.

Using this SDE model, we formulate a high-level wiping planning problem and train a vision-based wiping policy using RL. We train entirely in simulation without collecting a dataset using the robot. We simply randomize the initial state of the SDE to cover a wide range of particle dynamics and spill shapes that we may see in the real world.

In deployment, we first convert the robot’s image observations into black and white to better isolate the spills and crumb particles. We then use these “thresholded” images as the input to the RL policy. With this approach we do not require a visually-realistic simulator, which would be complex and potentially difficult to develop, and we are able to minimize the sim-to-real gap.

The RL policy’s inputs are thresholded image observations of the cleanliness state of the table. Its outputs are the desired wiping actions. The policy uses a ResNet50 neural network architecture followed by two fully-connected (FC) layers.

The desired wiping motions from the RL policy are executed with a whole-body trajectory optimizer that efficiently computes base and arm joint trajectories. This approach allows satisfying constraints, such as avoiding collisions, and enables zero-shot sim-to-real deployment.

   

Experimental results

We extensively validate our approach in simulation and on hardware. In simulation, our RL policies outperform heuristics-based baselines, requiring significantly fewer wipes to clean spills and crumbs. We also test our policies on problems that were not observed at training time, such as multiple isolated spill areas on the table, and find that the RL policies generalize well to these novel problems.

     
Example of wiping actions selected by the RL policy (left) and wiping performance compared with a baseline (middle, right). The baseline wipes to the center of the table, rotating after each wipe. We report the total dirty surface of the table (middle) and the spread of crumbs particles (right) after each additional wipe.

Our approach enables the robot to reliably wipe spills and crumbs (without accidentally pushing debris from the table) while avoiding collisions with obstacles like chairs.

For further results, please check out the video below:

Conclusion

The results from this work demonstrate that complex visuo-motor tasks such as table wiping can be reliably accomplished without expensive end-to-end training and on-robot data collection. The key consists of decomposing the task and combining the strengths of RL, trained using an SDE model of spill and crumb dynamics, with the strengths of trajectory optimization. We see this work as an important step towards general-purpose home-assistive robots. For more details, please check out the original paper.

Acknowledgements

We’d like to thank our coauthors Sumeet Singh, Mario Prats, Jeffrey Bingham, Jonathan Weisz, Benjie Holson, Xiaohan Zhang, Vikas Sindhwani, Yao Lu, Fei Xia, Peng Xu, Tingnan Zhang, and Jie Tan. We’d also like to thank Benjie Holson, Jake Lee, April Zitkovich, and Linda Luu for their help and support in various aspects of the project. We’re particularly grateful to the entire team at Everyday Robots for their partnership on this work, and for developing the platform on which these experiments were conducted.

Read More

Directing ML toward natural hazard mitigation through collaboration

Directing ML toward natural hazard mitigation through collaboration

Floods are the most common type of natural disaster, affecting more than 250 million people globally each year. As part of Google’s Crisis Response and our efforts to address the climate crisis, we are using machine learning (ML) models for Flood Forecasting to alert people in areas that are impacted before disaster strikes.

Collaboration between researchers in the industry and academia is essential for accelerating progress towards mutual goals in ML-related research. Indeed, Google’s current ML-based flood forecasting approach was developed in collaboration with researchers (1, 2) at the Johannes Kepler University in Vienna, Austria, the University of Alabama, and the Hebrew University of Jerusalem, among others.

Today we discuss our recent Machine Learning Meets Flood Forecasting Workshop, which highlights efforts to bring together researchers from Google and other universities and organizations to advance our understanding of flood behavior and prediction, and build more robust solutions for early detection and warning. We also discuss the Caravan project, which is helping to create an open-source repository for global streamflow data, and is itself an example of a collaboration that developed from the previous Flood Forecasting Meets Machine Learning Workshop.

2023 Machine Learning Meets Flood Forecasting Workshop

The fourth annual Google Machine Learning Meets Flood Forecasting Workshop was held in January. This 2-day virtual workshop hosted over 100 participants from 32 universities, 20 governmental and non-governmental agencies, and 11 private companies. This forum provided an opportunity for hydrologists, computer scientists, and aid workers to discuss challenges and efforts toward improving global flood forecasts, to keep up with state-of-the-art technology advances, and to integrate domain knowledge into ML-based forecasting approaches.

The event included talks from six invited speakers, a series of small-group discussion sessions focused on hydrological modeling, inundation mapping, and hazard alerting–related topics, as well as a presentation by Google on the FloodHub, which provides free, public access to Google’s flood forecasts, up to 7 days in advance.

Invited speakers at the workshop included:

The presentations can be viewed on YouTube:

2023 Flood Forecasting Meets Machine Learning Talks Day 1

2023 Flood Forecasting Meets Machine Learning Talks Day 2

Some of the top challenges highlighted during the workshop were related to the integration of physical and hydrological science with ML to help build trust and reliability; filling gaps in observations of inundated areas with models and satellite data; measuring the skill and reliability of flood warning systems; and improving the communication of flood warnings to diverse, global populations. In addition, participants stressed that addressing these and other challenges will require collaboration between a number of different organizations and scientific disciplines.

The Caravan project

One of the main challenges in conducting successful ML research and creating advanced tools for flood forecasting is the need for large amounts of data for computationally expensive training and evaluation. Today, many countries and organizations collect streamflow data (typically either water levels or flow rates), but it is not standardized or held in a central repository, which makes it difficult for researchers to access.

During the 2019 Machine Learning Meets Flood Forecasting Workshop, a group of researchers identified the need for an open source, global streamflow data repository, and developed ideas around leveraging free computational resources from Google Earth Engine to address the flood forecasting community’s challenge of data collection and accessibility. Following two years of collaborative work between researchers from Google, the school of Geography at the University of Exeter, the Institute for Machine Learning at Johannes Kepler University, and the Institute for Atmospheric and Climate Science at ETH Zurich, the Caravan project was created.

In “Caravan – A global community dataset for large-sample hydrology”, published in Nature Scientific Data, we describe the project in more detail. Based on a global dataset for the development and training of hydrological models (see figure below), Caravan provides open-source Python scripts that leverage essential weather and geographical data that was previously made public on Google Earth Engine to match streamflow data that users upload to the repository. This repository originally contained data from more than 13,000 watersheds in Central Europe, Brazil, Chile, Australia, the United States, Canada, and Mexico. It has further benefited from community contributions from the Geological Survey of Denmark and Greenland that includes streamflow data from most of the watersheds in Denmark. The goal is to continue to develop and grow this repository to enable researchers to access most of the world’s streamflow data. For more information regarding contributing to the Caravan dataset, reach out to caravan@google.com.

Locations of the 13,000 streamflow gauges in the Caravan dataset and the distribution of those gauges in GEnS global climate zones.

The path forward

Google plans to continue to host these workshops to help broaden and deepen collaboration between industry and academia in the development of environmental AI models. We are looking forward to seeing what advances might come out of the most recent workshop. Hydrologists and researchers interested in participating in future workshops are encouraged to contact flood-forecasting-meets-ml@google.com.

Read More

How Project Starline improves remote communication

How Project Starline improves remote communication

As companies settle into a new normal of hybrid and distributed work, remote communication technology remains critical for connecting and collaborating with colleagues. While this technology has improved, the core user experience often falls short: conversation can feel stilted, attention can be difficult to maintain, and usage can be fatiguing.

Project Starline renders people at natural scale on a 3D display and enables natural eye contact.

At Google I/O 2021 we announced Project Starline, a technology project that combines advances in hardware and software to create a remote communication experience that feels like you’re together, even when you’re thousands of miles apart. This perception of co-presence is created by representing users in 3D at natural scale, enabling eye contact, and providing spatially accurate audio. But to what extent do these technological innovations translate to meaningful, observable improvement in user value compared to traditional video conferencing?

In this blog we share results from a number of studies across a variety of methodologies, finding converging evidence that Project Starline outperforms traditional video conferencing in terms of conversation dynamics, video meeting fatigue, and attentiveness. Some of these results were previously published while others we are sharing for the first time as preliminary findings.

Improved conversation dynamics

In our qualitative studies, users often describe conversations in Project Starline as “more natural.” However, when asked to elaborate, many have difficulty articulating this concept in a way that fully captures their experience. Because human communication relies partly on unconscious processes like nonverbal behavior, people might have a hard time reflecting on these processes that are potentially impacted by experiencing a novel technology. To address this challenge, we conducted a series of behavioral lab experiments to shed light on what “more natural” might mean for Project Starline. These experiments employed within-subjects designs in which participants experienced multiple conditions (e.g., meeting in Project Starline vs. traditional videoconferencing) in randomized order. This allowed us to control for between-subject differences by comparing how the same individual responded to a variety of conditions, thus increasing statistical power and reducing the sample size necessary to detect statistical differences (sample sizes in our behavioral experiments range from ~ 20 to 30).

In one study, preliminary data suggest Project Starline improves conversation dynamics by increasing rates of turn-taking. We recruited pairs of participants who had never met each other to have unstructured conversations in both Project Starline and traditional video conferencing. We analyzed the audio from each conversation and found that Project Starline facilitated significantly more dynamic “back and forth” conversations compared to traditional video conferencing. Specifically, participants averaged about 2-3 more speaker hand-offs in Project Starline conversations compared to those in traditional video conferencing across a two minute subsample of their conversation (a uniform selection at the end of each conversation to help standardize for interpersonal rapport). Participants also rated their Starline conversations as significantly more natural (“smooth,” “easy,” “not awkward”), higher in quality, and easier to recognize when it was their turn to speak compared to conversations using traditional video conferencing.

In another study, participants had conversations with a confederate in both Project Starline and traditional video conferencing. We recorded these conversations to analyze select nonverbal behaviors. In Project Starline, participants were more animated, using significantly more hand gestures (+43%), head nods (+26%), and eyebrow movements (+49%). Participants also reported a significantly better ability to perceive and convey nonverbal cues in Project Starline than in traditional video conferencing. Together with the turn-taking results, these data help explain why conversations in Project Starline may feel more natural.

We recorded participants to quantify their nonverbal behaviors and found that they were more animated in Project Starline (left) compared to traditional video conferencing (right).

Reduced video meeting fatigue

A well-documented challenge of video conferencing, especially within the workplace, is video meeting fatigue. The causes of video meeting fatigue are complex, but one possibility is that video communication is cognitively taxing because it becomes more difficult to convey and interpret nonverbal behavior. Considering previous findings that suggested Project Starline might improve nonverbal communication, we examined whether video meeting fatigue might also be improved (i.e., reduced) compared to traditional video conferencing.

Our study found preliminary evidence that Project Starline indeed reduces video meeting fatigue. Participants held 30-minute mock meetings in Project Starline and traditional video conferencing. Meeting content was standardized across participants using an exercise adapted from academic literature that emulates key elements of a work meeting, such as brainstorming and persuasion. We then measured video meeting fatigue via the Zoom Exhaustion and Fatigue (ZEF) Scale. Additionally, we measured participants’ reaction times on a complex cognitive task originally used in cognitive psychology. We repurposed this task as a proxy for video meeting fatigue based on the assumption that more fatigue would lead to slower reaction times. Participants reported significantly less video meeting fatigue on the ZEF Scale (-31%) and had faster reaction times (-12%) on the cognitive task after using Project Starline compared to traditional video conferencing.

Increased attentiveness

Another challenge with video conferencing is focusing attention on the meeting at hand, rather than on other browser windows or secondary devices.

In our earlier study on nonverbal behavior, we included an exploratory information-retention task. We asked participants to write as much as they could remember about each conversation (one in Project Starline, and one in traditional video conferencing). We found that participants wrote 28% more in this task (by character count) after their conversation in Project Starline. This could be because they paid closer attention when in Project Starline, or possibly that they found conversations in Project Starline to be more engaging.

To explore the concept of attentiveness further, we conducted a study in which participants wore eye-tracking glasses. This allowed us to calculate the percentage of time participants spent focusing on their conversation partner’s face, an important source of social information in human interaction. Participants had a conversation with a confederate in Project Starline, traditional video conferencing, and in person. We found that participants spent a significantly higher proportion of time looking at their conversation partner’s face in Project Starline (+14%) than they did in traditional video conferencing. In fact, visual attentiveness in Project Starline mirrored that of the in-person condition: participants spent roughly the same proportion of time focusing on their meeting partner’s face in the Project Starline and in-person conditions.

The use of eye-tracking glasses and facial detection software allowed us to quantify participants’ gaze patterns. The video above illustrates how a hypothetical participant’s eye tracking data (red dot) correspond to their meeting partner’s face (white box).

User value in real meetings

The lab-based, experimental approach used in the studies above allows for causal inference while minimizing confounding variables. However, one limitation of these studies is that they are low in external validity — that is, they took place in a lab environment, and the extent to which their results extend to the real world is unclear. Thus, we studied actual users within Google who used Project Starline for their day-to-day work meetings and collected their feedback.

An internal pilot revealed that users derive meaningful value from using Project Starline. We used post-meeting surveys to capture immediate feedback on individual meetings, longer monthly surveys to capture holistic feedback on the experience, and conducted in-depth qualitative interviews with a subset of users. We evaluated Project Starline on concepts such as presence, nonverbal behavior, attentiveness, and personal connection. We found strong evidence that Project Starline delivered across these four metrics, with over 87% of participants expressing that their meetings in Project Starline were better than their previous experiences with traditional video conferencing.

Conclusion

Together, these findings offer a compelling case for Project Starline’s value to users: improved conversation dynamics, reduced video meeting fatigue, and increased attentiveness. Participants expressed that Project Starline was a significant improvement over traditional video conferencing in highly controlled lab experiments, as well as when they used Project Starline for their actual work meetings. We’re excited to see these findings converge across multiple methodologies (surveys, qualitative interviews, experiments) and measurements (self-report, behavioral, qualitative), and we’re eager to continue exploring the implications of Project Starline on human interaction.

Acknowledgments

We’d like to thank Melba Tellez, Eric Baczuk, Jinghua Zhang, Matthew DuVall, and Travis Miller for contributing to visual assets and illustrations.

Read More

Pre-trained Gaussian processes for Bayesian optimization

Pre-trained Gaussian processes for Bayesian optimization

Bayesian optimization (BayesOpt) is a powerful tool widely used for global optimization tasks, such as hyperparameter tuning, protein engineering, synthetic chemistry, robot learning, and even baking cookies. BayesOpt is a great strategy for these problems because they all involve optimizing black-box functions that are expensive to evaluate. A black-box function’s underlying mapping from inputs (configurations of the thing we want to optimize) to outputs (a measure of performance) is unknown. However, we can attempt to understand its internal workings by evaluating the function for different combinations of inputs. Because each evaluation can be computationally expensive, we need to find the best inputs in as few evaluations as possible. BayesOpt works by repeatedly constructing a surrogate model of the black-box function and strategically evaluating the function at the most promising or informative input location, given the information observed so far.

Gaussian processes are popular surrogate models for BayesOpt because they are easy to use, can be updated with new data, and provide a confidence level about each of their predictions. The Gaussian process model constructs a probability distribution over possible functions. This distribution is specified by a mean function (what these possible functions look like on average) and a kernel function (how much these functions can vary across inputs). The performance of BayesOpt depends on whether the confidence intervals predicted by the surrogate model contain the black-box function. Traditionally, experts use domain knowledge to quantitatively define the mean and kernel parameters (e.g., the range or smoothness of the black-box function) to express their expectations about what the black-box function should look like. However, for many real-world applications like hyperparameter tuning, it is very difficult to understand the landscapes of the tuning objectives. Even for experts with relevant experience, it can be challenging to narrow down appropriate model parameters.

In “Pre-trained Gaussian processes for Bayesian optimization”, we consider the challenge of hyperparameter optimization for deep neural networks using BayesOpt. We propose Hyper BayesOpt (HyperBO), a highly customizable interface with an algorithm that removes the need for quantifying model parameters for Gaussian processes in BayesOpt. For new optimization problems, experts can simply select previous tasks that are relevant to the current task they are trying to solve. HyperBO pre-trains a Gaussian process model on data from those selected tasks, and automatically defines the model parameters before running BayesOpt. HyperBO enjoys theoretical guarantees on the alignment between the pre-trained model and the ground truth, as well as the quality of its solutions for black-box optimization. We share strong results of HyperBO both on our new tuning benchmarks for near–state-of-the-art deep learning models and classic multi-task black-box optimization benchmarks (HPO-B). We also demonstrate that HyperBO is robust to the selection of relevant tasks and has low requirements on the amount of data and tasks for pre-training.

In the traditional BayesOpt interface, experts need to carefully select the mean and kernel parameters for a Gaussian process model. HyperBO replaces this manual specification with a selection of related tasks, making Bayesian optimization easier to use. The selected tasks are used for pre-training, where we optimize a Gaussian process such that it can gradually generate functions that are similar to the functions corresponding to those selected tasks. The similarity manifests in individual function values and variations of function values across the inputs.

Loss functions for pre-training

We pre-train a Gaussian process model by minimizing the Kullback–Leibler divergence (a commonly used divergence) between the ground truth model and the pre-trained model. Since the ground truth model is unknown, we cannot directly compute this loss function. To solve for this, we introduce two data-driven approximations: (1) Empirical Kullback–Leibler divergence (EKL), which is the divergence between an empirical estimate of the ground truth model and the pre-trained model; (2) Negative log likelihood (NLL), which is the the sum of negative log likelihoods of the pre-trained model for all training functions. The computational cost of EKL or NLL scales linearly with the number of training functions. Moreover, stochastic gradient–based methods like Adam can be employed to optimize the loss functions, which further lowers the cost of computation. In well-controlled environments, optimizing EKL and NLL lead to the same result, but their optimization landscapes can be very different. For example, in the simplest case where the function only has one possible input, its Gaussian process model becomes a Gaussian distribution, described by the mean (m) and variance (s). Hence the loss function only has those two parameters, m and s, and we can visualize EKL and NLL as follows:

We simulate the loss landscapes of EKL (left) and NLL (right) for a simple model with parameters m and s. The colors represent a heatmap of the EKL or NLL values, where red corresponds to higher values and blue denotes lower values. These two loss landscapes are very different, but they both aim to match the pre-trained model with the ground truth model.

Pre-training improves Bayesian optimization

In the BayesOpt algorithm, decisions on where to evaluate the black-box function are made iteratively. The decision criteria are based on the confidence levels provided by the Gaussian process, which are updated in each iteration by conditioning on previous data points acquired by BayesOpt. Intuitively, the updated confidence levels should be just right: not overly confident or too unsure, since in either of these two cases, BayesOpt cannot make the decisions that can match what an expert would do.

In HyperBO, we replace the hand-specified model in traditional BayesOpt with the pre-trained Gaussian process. Under mild conditions and with enough training functions, we can mathematically verify good theoretical properties of HyperBO: (1) Alignment: the pre-trained Gaussian process guarantees to be close to the ground truth model when both are conditioned on observed data points; (2) Optimality: HyperBO guarantees to find a near-optimal solution to the black-box optimization problem for any functions distributed according to the unknown ground truth Gaussian process.

We visualize the Gaussian process (areas shaded in purple are 95% and 99% confidence intervals) conditional on observations (black dots) from an unknown test function (orange line). Compared to the traditional BayesOpt without pre-training, the predicted confidence levels in HyperBO captures the unknown test function much better, which is a critical prerequisite for Bayesian optimization.

Empirically, to define the structure of pre-trained Gaussian processes, we choose to use very expressive mean functions modeled by neural networks, and apply well-defined kernel functions on inputs encoded to a higher dimensional space with neural networks.

To evaluate HyperBO on challenging and realistic black-box optimization problems, we created the PD1 benchmark, which contains a dataset for multi-task hyperparameter optimization for deep neural networks. PD1 was developed by training tens of thousands of configurations of near–state-of-the-art deep learning models on popular image and text datasets, as well as a protein sequence dataset. PD1 contains approximately 50,000 hyperparameter evaluations from 24 different tasks (e.g., tuning Wide ResNet on CIFAR100) with roughly 12,000 machine days of computation.

We demonstrate that when pre-training for only a few hours on a single CPU, HyperBO can significantly outperform BayesOpt with carefully hand-tuned models on unseen challenging tasks, including tuning ResNet50 on ImageNet. Even with only ~100 data points per training function, HyperBO can perform competitively against baselines.

Tuning validation error rates of ResNet50 on ImageNet and Wide ResNet (WRN) on the Street View House Numbers (SVHN) dataset and CIFAR100. By pre-training on only ~20 tasks and ~100 data points per task, HyperBO can significantly outperform traditional BayesOpt (with a carefully hand-tuned Gaussian process) on previously unseen tasks.

Conclusion and future work

HyperBO is a framework that pre-trains a Gaussian process and subsequently performs Bayesian optimization with a pre-trained model. With HyperBO, we no longer have to hand-specify the exact quantitative parameters in a Gaussian process. Instead, we only need to identify related tasks and their corresponding data for pre-training. This makes BayesOpt both more accessible and more effective. An important future direction is to enable HyperBO to generalize over heterogeneous search spaces, for which we are developing new algorithms by pre-training a hierarchical probabilistic model.

Acknowledgements

The following members of the Google Research Brain Team conducted this research: Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, and Zoubin Ghahramani. We’d like to thank Zelda Mariet and Matthias Feurer for help and consultation on transfer learning baselines. We’d also like to thank Rif A. Saurous for constructive feedback, and Rodolphe Jenatton and David Belanger for feedback on previous versions of the manuscript. In addition, we thank Sharat Chikkerur, Ben Adlam, Balaji Lakshminarayanan, Fei Sha and Eytan Bakshy for comments, and Setareh Ariafar and Alexander Terenin for conversations on animation. Finally, we thank Tom Small for designing the animation for this post.

Read More