Google Research, 2022 & beyond: Natural sciences

Google Research, 2022 & beyond: Natural sciences

(This is Part 7 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

It’s an incredibly exciting time to be a scientist. With the amazing advances in machine learning (ML) and quantum computing, we now have powerful new tools that enable us to act on our curiosity, collaborate in new ways, and radically accelerate progress toward breakthrough scientific discoveries.

Since joining Google Research eight years ago, I’ve had the privilege of being part of a community of talented researchers fascinated by applying cutting-edge computing to push the boundaries of what is possible in applied science. Our teams are exploring topics across the physical and natural sciences. So, for this year’s blog post I want to focus on high-impact advances we’ve made recently in the fields of biology and physics, from helping to organize the world’s protein and genomics information to benefit people’s lives to improving our understanding of the nature of the universe with quantum computers. We are inspired by the great potential of this work.

Using machine learning to unlock mysteries in biology

Many of our researchers are fascinated by the extraordinary complexity of biology, from the mysteries of the brain, to the potential of proteins, and to the genome, which encodes the very language of life. We’ve been working alongside scientists from other leading organizations around the world to tackle important challenges in the fields of connectomics, protein function prediction, and genomics, and to make our innovations accessible and useful to the greater scientific community.

Neurobiology

One exciting application of our Google-developed ML methods was to explore how information travels through the neuronal pathways in the brains of zebrafish, which provides insight into how the fish engage in social behavior like swarming. In collaboration with researchers from the Max Planck Institute for Biological Intelligence, we were able to computationally reconstruct a portion of zebrafish brains imaged with 3D electron microscopy — an exciting advance in the use of imaging and computational pipelines to map out the neuronal circuitry in small brains, and another step forward in our long-standing contributions to the field of connectomics.

Reconstruction of the neural circuitry of a larval zebrafish brain, courtesy of the Max Planck Institute for Biological Intelligence.

The technical advances necessary for this work will have applications even beyond neuroscience. For example, to address the difficulty of working with such large connectomics datasets, we developed and released TensorStore, an open-source C++ and Python software library designed for storage and manipulation of n-dimensional data. We look forward to seeing the ways it is used in other fields for the storage of large datasets.

We’re also using ML to shed light on how human brains perform remarkable feats like language by comparing human language processing and autoregressive deep language models (DLMs). For this study, a collaboration with colleagues at Princeton University and New York University Grossman School of Medicine, participants listened to a 30-minute podcast while their brain activity was recorded using electrocorticography. The recordings suggested that the human brain and DLMs share computational principles for processing language, including continuous next-word prediction, reliance on contextual embeddings, and calculation of post-onset surprise based on word match (we can measure how surprised the human brain is by the word, and correlate that surprise signal with how well the word is predicted by the DLM). These results provide new insights into language processing in the human brain, and suggest that DLMs can be used to reveal valuable insights about the neural basis of language.

Biochemistry

ML has also allowed us to make significant advances in understanding biological sequences. In 2022, we leveraged recent advances in deep learning to accurately predict protein function from raw amino acid sequences. We also worked in close collaboration with the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) to carefully assess model performance and add hundreds of millions of functional annotations to the public protein databases UniProt, Pfam/InterPro, and MGnify. Human annotation of protein databases can be a laborious and slow process and our ML methods enabled a giant leap forward — for example, increasing the number of Pfam annotations by a larger number than all other efforts during the past decade combined. The millions of scientists worldwide who access these databases each year can now use our annotations for their research.

Google Research contributions to Pfam exceed in size all expansion efforts made to the database over the last decade.

Although the first draft of the human genome was released in 2003, it was incomplete and had many gaps due to technical limitations in the sequencing technologies. In 2022 we celebrated the remarkable achievements of the Telomere-2-Telomere (T2T) Consortium in resolving these previously unavailable regions — including five full chromosome arms and nearly 200 million base pairs of novel DNA sequences — which are interesting and important for questions of human biology, evolution, and disease. Our open source genomics variant caller, DeepVariant, was one of the tools used by the T2T Consortium to prepare their release of a complete 3.055 billion base pair sequence of a human genome. The T2T Consortium is also using our newer open source method DeepConsensus, which provides on-device error correction for Pacific Biosciences long-read sequencing instruments, in their latest research toward comprehensive pan-genome resources that can represent the breadth of human genetic diversity.

Using quantum computing for new physics discoveries

When it comes to making scientific discoveries, quantum computing is still in its infancy, but has a lot of potential. We’re exploring ways of advancing the capabilities of quantum computing so that it can become a tool for scientific discovery and breakthroughs. In collaboration with physicists from around the world, we are also starting to use our existing quantum computers to create interesting new experiments in physics.

As an example of such experiments, consider the problem where a sensor measures something, and a computer then processes the data from the sensor. Traditionally, this means the sensor’s data is processed as classical information on our computers. Instead, one idea in quantum computing is to directly process quantum data from sensors. Feeding data from quantum sensors directly to quantum algorithms without going through classical measurements may provide a large advantage. In a recent Science paper written in collaboration with researchers from multiple universities, we show that quantum computing can extract information from exponentially fewer experiments than classical computing, as long as the quantum computer is coupled directly to the quantum sensors and is running a learning algorithm. This “quantum machine learning” can yield an exponential advantage in dataset size, even with today’s noisy intermediate-scale quantum computers. Because experimental data is often the limiting factor in scientific discovery, quantum ML has the potential to unlock the vast power of quantum computers for scientists. Even better, the insights from this work are also applicable to learning on the output of quantum computations, such as the output of quantum simulations that may otherwise be difficult to extract.

Even without quantum ML, a powerful application of quantum computers is to experimentally explore quantum systems that would be otherwise impossible to observe or simulate. In 2022, the Quantum AI team used this approach to observe the first experimental evidence of multiple microwave photons in a bound state using superconducting qubits. Photons typically do not interact with one another, and require an additional element of non-linearity to cause them to interact. The results of our quantum computer simulations of these interactions surprised us — we thought the existence of these bound states relied on fragile conditions, but instead we found that they were robust even to relatively strong perturbations that we applied.

Occupation probability versus discrete time step for n-photon bound states. We observe that the majority of the photons (darker colors) remain bound together.

Given the initial successes we have had in applying quantum computing to make physics breakthroughs, we are hopeful about the possibility of this technology to enable future groundbreaking discoveries that could have as significant a societal impact as the creation of transistors or GPS. The future of quantum computing as a scientific tool is exciting!

Acknowledgements

I would like to thank everyone who worked hard on the advances described in this post, including the Google Applied Sciences, Quantum AI, Genomics and Brain teams and their collaborators across Google Research and externally. Finally, I would like to thank the many Googlers who provided feedback in the writing of this post, including Lizzie Dorfman, Erica Brand, Elise Kleeman, Abe Asfaw, Viren Jain, Lucy Colwell, Andrew Carroll, Ariel Goldstein and Charina Chou.

Top

Google Research, 2022 & beyond

This was the seventh blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Read More

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

Many languages spoken worldwide cover numerous regional varieties (sometimes called dialects), such as Brazilian and European Portuguese or Mainland and Taiwan Mandarin Chinese. Although such varieties are often mutually intelligible to their speakers, there are still important differences. For example, the Brazilian Portuguese word for “bus” is ônibus, while the European Portuguese word is autocarro. Yet, today’s machine translation (MT) systems typically do not allow users to specify which variety of a language to translate into. This may lead to confusion if the system outputs the “wrong” variety or mixes varieties in an unnatural way. Also, region-unaware MT systems tend to favor whichever variety has more data available online, which disproportionately affects speakers of under-resourced language varieties.

In “FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation”, accepted for publication in Transactions of the Association for Computational Linguistics, we present an evaluation dataset used to measure MT systems’ ability to support regional varieties through a case study on Brazilian vs. European Portuguese and Mainland vs. Taiwan Mandarin Chinese. With the release of the FRMT data and accompanying evaluation code, we hope to inspire and enable the research community to discover new ways of creating MT systems that are applicable to the large number of regional language varieties spoken worldwide.

Challenge: Few-Shot Generalization

Most modern MT systems are trained on millions or billions of example translations, such as an English input sentence and its corresponding Portuguese translation. However, the vast majority of available training data doesn’t specify what regional variety the translation is in. In light of this data scarcity, we position FRMT as a benchmark for few-shot translation, measuring an MT model’s ability to translate into regional varieties when given no more than 100 labeled examples of each language variety. MT models need to use the linguistic patterns showcased in the small number of labeled examples (called “exemplars”) to identify similar patterns in their unlabeled training examples. In this way, models can generalize, producing correct translations of phenomena not explicitly shown in the exemplars.

An illustration of a few-shot MT system translating the English sentence, “The bus arrived,” into two regional varieties of Portuguese: Brazilian (🇧🇷; left) and European (🇵🇹; right).

Few-shot approaches to MT are attractive because they make it much easier to add support for additional regional varieties to an existing system. While our work is specific to regional varieties of two languages, we anticipate that methods that perform well will be readily applicable to other languages and regional varieties. In principle, those methods should also work for other language distinctions, such as formality and style.

Data Collection

The FRMT dataset consists of partial English Wikipedia articles, sourced from the Wiki40b dataset, that have been translated by paid, professional translators into different regional varieties of Portuguese and Mandarin. In order to highlight key region-aware translation challenges, we designed the dataset using three content buckets: (1) Lexical, (2) Entity, and (3) Random.

  1. The Lexical bucket focuses on regional differences in word choice, such as the “ônibus” vs. “autocarro” distinction when translating a sentence with the word “bus” into Brazilian vs. European Portuguese, respectively. We manually collected 20-30 terms that have regionally distinctive translations according to blogs and educational websites, and filtered and vetted the translations with feedback from volunteer native speakers from each region. Given the resulting list of English terms, we extracted texts of up to 100 sentences each from the associated English Wikipedia articles (e.g., bus). The same process was carried out independently for Mandarin.
  2. The Entity bucket is populated in a similar way and concerns people, locations or other entities strongly associated with one of the two regions in question for a given language. Consider an illustrative sentence like, “In Lisbon, I often took the bus.” In order to translate this correctly into Brazilian Portuguese, a model must overcome two potential pitfalls:
    1. The strong geographical association between Lisbon and Portugal might influence a model to generate a European Portuguese translation instead, e.g., by selecting “autocarro” rather than “ônibus“.
    2. Replacing “Lisbon” with “Brasília” might be a naive way for a model to localize its output toward Brazilian Portuguese, but would be semantically inaccurate, even in an otherwise fluent translation.
  3. The Random bucket is used to check that a model correctly handles other diverse phenomena, and consists of text from 100 randomly sampled articles from Wikipedia’s “featured” and “good” collections.

Evaluation Methodology

To verify that the translations collected for the FRMT dataset capture region-specific phenomena, we conducted a human evaluation of their quality. Expert annotators from each region used the Multi-dimensional Quality Metrics (MQM) framework to identify and categorize errors in the translations. The framework includes a category-wise weighting scheme to convert the identified errors into a single score that roughly represents the number of major errors per sentence; so a lower number indicates a better translation. For each region, we asked MQM raters to score both translations from their region and translations from their language’s other region. For example, Brazilian Portuguese raters scored both the Brazilian and European Portuguese translations. The difference between these two scores indicates the prevalence of linguistic phenomena that are acceptable in one variety but not the other. We found that in both Portuguese and Chinese, raters identified, on average, approximately two more major errors per sentence in the mismatched translations than in the matched ones. This indicates that our dataset truly does capture region-specific phenomena.

While human evaluation is the best way to be sure of model quality, it is often slow and expensive. We therefore wanted to find an existing automatic metric that researchers can use to evaluate their models on our benchmark, and considered chrF, BLEU, and BLEURT. Using the translations from a few baseline models that were also evaluated by our MQM raters, we discovered that BLEURT has the best correlation with human judgments, and that the strength of that correlation (0.65 Pearson correlation coefficient, ρ) is comparable to the inter-annotator consistency (0.70 intraclass correlation).

Metric       Pearson’s ρ
chrF       0.48
BLEU       0.58
BLEURT       0.65

Correlation between different automatic metrics and human judgements of translation quality on a subset of FRMT. Values are between -1 and 1; higher is better.

System Performance

Our evaluation covered a handful of recent models capable of few-shot control. Based on human evaluation with MQM, the baseline methods all showed some ability to localize their output for Portuguese, but for Mandarin, they mostly failed to use knowledge of the targeted region to produce superior Mainland or Taiwan translations.

Google’s recent language model, PaLM, was rated best overall among the baselines we evaluated. In order to produce region-targeted translations with PaLM, we feed an instructive prompt into the model and then generate text from it to fill in the blank (see the example shown below).

    Translate the following texts from English to European Portuguese.
English: [English example 1].
European Portuguese: [correct translation 1].
...
English: [input].
European Portuguese: _____"

PaLM obtained strong results using a single example, and had marginal quality gains on Portuguese when increasing to ten examples. This performance is impressive when taking into consideration that PaLM was trained in an unsupervised way. Our results also suggest language models like PaLM may be particularly adept at memorizing region-specific word choices required for fluent translation. However, there is still a significant performance gap between PaLM and human performance. See our paper for more details.

MQM performance across dataset buckets using human and PaLM translations. Thick bars represent the region-matched case, where raters from each region evaluate translations targeted at their own region. Thin, inset bars represent the region-mismatched case, where raters from each region evaluate translations targeted at the other region. Human translations exhibit regional phenomena in all cases. PaLM translations do so for all Portuguese buckets and the Mandarin lexical bucket only.

Conclusion

In the near future, we hope to see a world where language generation systems, especially machine translation, can support all speaker communities. We want to meet users where they are, generating language fluent and appropriate for their locale or region. To that end, we have released the FRMT dataset and benchmark, enabling researchers to easily compare performance for region-aware MT models. Validated via our thorough human-evaluation studies, the language varieties in FRMT have significant differences that outputs from region-aware MT models should reflect. We are excited to see how researchers utilize this benchmark in development of new MT models that better support under-represented language varieties and all speaker communities, leading to improved equitability in natural-language technologies.

Acknowledgements

We gratefully acknowledge our paper co-authors for all their contributions to this project: Timothy Dozat, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, and Noah Constant. For helpful discussion and comments on the paper, we thank Jacob Eisenstein, Noah Fiedel, Macduff Hughes and Mingfei Lau. For essential feedback around specific regional language differences, we thank Andre Araujo, Chung-Ching Chang, Andreia Cunha, Filipe Gonçalves, Nuno Guerreiro, Mandy Guo, Luis Miranda, Vitor Rodrigues and Linting Xue. For logistical support in collecting human translations and ratings, we thank the Google Translate team. We thank the professional translators and MQM raters for their role in producing the dataset. We also thank Tom Small for providing the animation in this post.

Read More

FriendlyCore: A novel differentially private aggregation framework

FriendlyCore: A novel differentially private aggregation framework

Differential privacy (DP) machine learning algorithms protect user data by limiting the effect of each data point on an aggregated output with a mathematical guarantee. Intuitively the guarantee implies that changing a single user’s contribution should not significantly change the output distribution of the DP algorithm.

However, DP algorithms tend to be less accurate than their non-private counterparts because satisfying DP is a worst-case requirement: one has to add noise to “hide” changes in any potential input point, including “unlikely points’’ that have a significant impact on the aggregation. For example, suppose we want to privately estimate the average of a dataset, and we know that a sphere of diameter, Λ, contains all possible data points. The sensitivity of the average to a single point is bounded by Λ, and therefore it suffices to add noise proportional to Λ to each coordinate of the average to ensure DP.

A sphere of diameter Λ containing all possible data points.

Now assume that all the data points are “friendly,” meaning they are close together, and each affects the average by at most 𝑟, which is much smaller than Λ. Still, the traditional way for ensuring DP requires adding noise proportional to Λ to account for a neighboring dataset that contains one additional “unfriendly” point that is unlikely to be sampled.

Two adjacent datasets that differ in a single outlier. A DP algorithm would have to add noise proportional to Λ to each coordinate to hide this outlier.

In “FriendlyCore: Practical Differentially Private Aggregation”, presented at ICML 2022, we introduce a general framework for computing differentially private aggregations. The FriendlyCore framework pre-processes data, extracting a “friendly” subset (the core) and consequently reducing the private aggregation error seen with traditional DP algorithms. The private aggregation step adds less noise since we do not need to account for unfriendly points that negatively impact the aggregation.

In the averaging example, we first apply FriendlyCore to remove outliers, and in the aggregation step, we add noise proportional to 𝑟 (not Λ). The challenge is to make our overall algorithm (outlier removal + aggregation) differentially private. This constrains our outlier removal scheme and stabilizes the algorithm so that two adjacent inputs that differ by a single point (outlier or not) should produce any (friendly) output with similar probabilities.

FriendlyCore Framework

We begin by formalizing when a dataset is considered friendly, which depends on the type of aggregation needed and should capture datasets for which the sensitivity of the aggregate is small. For example, if the aggregate is averaging, the term friendly should capture datasets with a small diameter.

To abstract away the particular application, we define friendliness using a predicate 𝑓 that is positive on points 𝑥 and 𝑦 if they are “close” to each other. For example,in the averaging application 𝑥 and 𝑦 are close if the distance between them is less than 𝑟. We say that a dataset is friendly (for this predicate) if every pair of points 𝑥 and 𝑦 are both close to a third point 𝑧 (not necessarily in the data).

Once we have fixed 𝑓 and defined when a dataset is friendly, two tasks remain. First, we construct the FriendlyCore algorithm that extracts a large friendly subset (the core) of the input stably. FriendlyCore is a filter satisfying two requirements: (1) It has to remove outliers to keep only elements that are close to many others in the core, and (2) for neighboring datasets that differ by a single element, 𝑦, the filter outputs each element except 𝑦 with almost the same probability. Furthermore, the union of the cores extracted from these neighboring datasets is friendly.

The idea underlying FriendlyCore is simple: The probability that we add a point, 𝑥, to the core is a monotonic and stable function of the number of elements close to 𝑥. In particular, if 𝑥 is close to all other points, it’s not considered an outlier and can be kept in the core with probability 1.

Second, we develop the Friendly DP algorithm that satisfies a weaker notion of privacy by adding less noise to the aggregate. This means that the outcomes of the aggregation are guaranteed to be similar only for neighboring datasets 𝐶 and 𝐶’ such that the union of 𝐶 and 𝐶’ is friendly.

Our main theorem states that if we apply a friendly DP aggregation algorithm to the core produced by a filter with the requirements listed above, then this composition is differentially private in the regular sense.

Clustering and other applications

Other applications of our aggregation method are clustering and learning the covariance matrix of a Gaussian distribution. Consider the use of FriendlyCore to develop a differentially private k-means clustering algorithm. Given a database of points, we partition it into random equal-size smaller subsets and run a good non-private k-means clustering algorithm on each small set. If the original dataset contains k large clusters then each smaller subset will contain a significant fraction of each of these k clusters. It follows that the tuples (ordered sets) of k-centers we get from the non-private algorithm for each small subset are similar. This dataset of tuples is expected to have a large friendly core (for an appropriate definition of closeness).

We use our framework to aggregate the resulting tuples of k-centers (k-tuples). We define two such k-tuples to be close if there is a matching between them such that a center is substantially closer to its mate than to any other center.

In this picture, any pair of the red, blue, and green tuples are close to each other, but none of them is close to the pink tuple. So the pink tuple is removed by our filter and is not in the core.

We then extract the core by our generic sampling scheme and aggregate it using the following steps:

  1. Pick a random k-tuple 𝑇 from the core.
  2. Partition the data by putting each point in a bucket according to its closest center in 𝑇.
  3. Privately average the points in each bucket to get our final k-centers.

Empirical results

Below are the empirical results of our algorithms based on FriendlyCore. We implemented them in the zero-Concentrated Differential Privacy (zCDP) model, which gives improved accuracy in our setting (with similar privacy guarantees as the more well-known (𝜖, 𝛿)-DP).

Averaging

We tested the mean estimation of 800 samples from a spherical Gaussian with an unknown mean. We compared it to the algorithm CoinPress. In contrast to FriendlyCore, CoinPress requires an upper bound 𝑅 on the norm of the mean. The figures below show the effect on accuracy when increasing 𝑅 or the dimension 𝑑. Our averaging algorithm performs better on large values of these parameters since it is independent of 𝑅 and 𝑑.

Left: Averaging in 𝑑= 1000, varying 𝑅. Right: Averaging with 𝑅= √𝑑, varying 𝑑.

Clustering

We tested the performance of our private clustering algorithm for k-means. We compared it to the Chung and Kamath algorithm that is based on recursive locality-sensitive hashing (LSH-clustering). For each experiment, we performed 30 repetitions and present the medians along with the 0.1 and 0.9 quantiles. In each repetition, we normalize the losses by the loss of k-means++ (where a smaller number is better).

The left figure below compares the k-means results on a uniform mixture of eight separated Gaussians in two dimensions. For small values of 𝑛 (the number of samples from the mixture), FriendlyCore often fails and yields inaccurate results. Yet, increasing 𝑛 increases the success probability of our algorithm (because the generated tuples become closer to each other) and yields very accurate results, while LSH-clustering lags behind.

Left: k-means results in 𝑑= 2 and k= 8, for varying 𝑛(number of samples). Right: A graphical illustration of the centers in one of the iterations for 𝑛= 2 X 105. Green points are the centers of our algorithm and the red points are the centers of LSH-clustering.

FriendlyCore also performs well on large datasets, even without clear separation into clusters. We used the Fonollosa and Huerta gas sensors dataset that contains 8M rows, consisting of a 16-dimensional point defined by 16 sensors’ measurements at a given point in time. We compared the clustering algorithms for varying k. FriendlyCore performs well except for k= 5 where it fails due to the instability of the non-private algorithm used by our method (there are two different solutions for k= 5 with similar cost that makes our approach fail since we do not get one set of tuples that are close to each other).

k-means results on gas sensors’ measurements over time, varying k.

Conclusion

FriendlyCore is a general framework for filtering metric data before privately aggregating it. The filtered data is stable and makes the aggregation less sensitive, enabling us to increase its accuracy with DP. Our algorithms outperform private algorithms tailored for averaging and clustering, and we believe this technique can be useful for additional aggregation tasks. Initial results show that it can effectively reduce utility loss when we deploy DP aggregations. To learn more, and see how we apply it for estimating the covariance matrix of a Gaussian distribution, see our paper.

Acknowledgements

This work was led by Eliad Tsfadia in collaboration with Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer, Avinatan Hassidim and Yossi Matias.

Read More

Google Research, 2022 & beyond: Robotics

Google Research, 2022 & beyond: Robotics

(This is Part 6 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Within our lifetimes, we will see robotic technologies that can help with everyday activities, enhancing human productivity and quality of life. Before robotics can be broadly useful in helping with practical day-to-day tasks in people-centered spaces — spaces designed for people, not machines — they need to be able to safely & competently provide assistance to people.

In 2022, we focused on challenges that come with enabling robots to be more helpful to people: 1) allowing robots and humans to communicate more efficiently and naturally; 2) enabling robots to understand and apply common sense knowledge in real-world situations; and 3) scaling the number of low-level skills robots need to effectively perform tasks in unstructured environments.

An undercurrent this past year has been the exploration of how large, generalist models, like PaLM, can work alongside other approaches to surface capabilities allowing robots to learn from a breadth of human knowledge and allowing people to engage with robots more naturally. As we do this, we’re transforming robot learning into a scalable data problem so that we can scale learning of generalized low-level skills, like manipulation. In this blog post, we’ll review key learnings and themes from our explorations in 2022.

Bringing the capabilities of LLMs to robotics

An incredible feature of large language models (LLMs) is their ability to encode descriptions and context into a format that’s understandable by both people and machines. When applied to robotics, LLMs let people task robots more easily — just by asking — with natural language. When combined with vision models and robotics learning approaches, LLMs give robots a way to understand the context of a person’s request and make decisions about what actions should be taken to complete it.

One of the underlying concepts is using LLMs to prompt other pretrained models for information that can build context about what is happening in a scene and make predictions about multimodal tasks. This is similar to the socratic method in teaching, where a teacher asks students questions to lead them through a rational thought process. In “Socratic Models”, we showed that this approach can achieve state-of-the-art performance in zero-shot image captioning and video-to-text retrieval tasks. It also enables new capabilities, like answering free-form questions about and predicting future activity from video, multimodal assistive dialogue, and as we’ll discuss next, robot perception and planning.

In “Towards Helpful Robots: Grounding Language in Robotic Affordances”, we partnered with Everyday Robots to ground the PaLM language model in a robotics affordance model to plan long horizon tasks. In previous machine-learned approaches, robots were limited to short, hard-coded commands, like “Pick up the sponge,” because they struggled with reasoning about the steps needed to complete a task — which is even harder when the task is given as an abstract goal like, “Can you help clean up this spill?”

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

For this approach to work, one needs to have both an LLM that can predict the sequence of steps to complete long horizon tasks and an affordance model representing the skills a robot can actually do in a given situation. In “Extracting Skill-Centric State Abstractions from Value Functions”, we showed that the value function in reinforcement learning (RL) models can be used to build the affordance model — an abstract representation of the actions a robot can perform under different states. This lets us connect long-horizons of real-world tasks, like “tidy the living room”, to the short-horizon skills needed to complete the task, like correctly picking, placing, and arranging items.

Having both an LLM and an affordance model doesn’t mean that the robot will actually be able to complete the task successfully. However, with Inner Monologue, we closed the loop on LLM-based task planning with other sources of information, like human feedback or scene understanding, to detect when the robot fails to complete the task correctly. Using a robot from Everyday Robots, we show that LLMs can effectively replan if the current or previous plan steps failed, allowing the robot to recover from failures and complete complex tasks like “Put a coke in the top drawer,” as shown in the video below.

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

An emergent capability from closing the loop on LLM-based task planning that we saw with Inner Monologue is that the robot can react to changes in the high-level goal mid-task. For example, a person might tell the robot to change its behavior as it is happening, by offering quick corrections or redirecting the robot to another task. This behavior is especially useful to let people interactively control and customize robot tasks when robots are working near people.

While natural language makes it easier for people to specify and modify robot tasks, one of the challenges is being able to react in real time to the full vocabulary people can use to describe tasks that a robot is capable of doing. In “Talking to Robots in Real Time”, we demonstrated a large-scale imitation learning framework for producing real-time, open-vocabulary, language-conditionable robots. With one policy we were able to address over 87,000 unique instructions, with an estimated average success rate of 93.5%. As part of this project, we released Language-Table, the largest available language-annotated robot dataset, which we hope will drive further research focused on real-time language-controllable robots.

Examples of long horizon goals reached under real time human language guidance.

We’re also excited about the potential for LLMs to write code that can control robot actions. Code-writing approaches, like in “Robots That Write Their Own Code”, show promise in increasing the complexity of tasks robots can complete by autonomously generating new code that re-composes API calls, synthesizes new functions, and expresses feedback loops to assemble new behaviors at runtime.

Code as Policies uses code-writing language models to map natural language instructions to robot code to complete tasks. Generated code can call existing perception action APIs, third party libraries, or write new functions at runtime.

Turning robot learning into a scalable data problem

Large language and multimodal models help robots understand the context in which they’re operating, like what’s happening in a scene and what the robot is expected to do. But robots also need low-level physical skills to complete tasks in the physical world, like picking up and precisely placing objects.

While we often take these physical skills for granted, executing them hundreds of times every day without even thinking, they present significant challenges to robots. For example, to pick up an object, the robot needs to perceive and understand the environment, reason about the spatial relation and contact dynamics between its gripper and the object, actuate the high degrees-of-freedom arm precisely, and exert the right amount of force to stably grasp the object without breaking it. The difficulty of learning these low-level skills is known as Moravec’s paradox: reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources.

Inspired by the recent success of LLMs, which shows that the generalization and performance of large Transformer-based models scale with the amount of data, we are taking a data-driven approach, turning the problem of learning low-level physical skills into a scalable data problem. With Robotics Transformer-1 (RT-1), we trained a robot manipulation policy on a large-scale, real-world robotics dataset of 130k episodes that cover 700+ tasks using a fleet of 13 robots from Everyday Robots and showed the same trend for robotics — increasing the scale and diversity of data improves the model ability to generalize to new tasks, environments, and objects.

Example PaLM-SayCan-RT1 executions of long-horizon tasks in real kitchens.

Behind both language models and many of our robotics learning approaches, like RT-1, are Transformers, which allow models to make sense of Internet-scale data. Unlike LLMs, robotics is challenged by multimodal representations of constantly changing environments and limited compute. In 2020, we introduced Performers as an approach to make Transformers more computationally efficient, which has implications for many applications beyond robotics. In Performer-MPC, we applied this to introduce a new class of implicit control policies combining the benefits of imitation learning with the robust handling of system constraints from Model Predictive Control (MPC). We show a >40% improvement on the robot reaching its goal and a >65% improvement on social metrics when navigating around humans in comparison to a standard MPC policy. Performer-MPC provides 8 ms latency for the 8.3M parameter model, making on-robot deployment of Transformers practical.

Navigation robot maneuvering through highly constrained spaces using: Regular MPC, Explicit Policy, and Performer-MPC.

In the last year, our team has shown that data-driven approaches are generally applicable on different robotic platforms in diverse environments to learn a wide range of tasks, including mobile manipulation, navigation, locomotion and table tennis. This shows us a clear path forward for learning low-level robot skills: scalable data collection. Unlike video and text data that is abundant on the Internet, robotic data is extremely scarce and hard to acquire. Finding approaches to collect and efficiently use rich datasets representative of real-world interactions is the key for our data-driven approaches.

Simulation is a fast, safe, and easily parallelizable option, but it is difficult to replicate the full environment, especially physics and human-robot interactions, in simulation. In i-Sim2Real, we showed an approach to address the sim-to-real gap and learn to play table tennis with a human opponent by bootstrapping from a simple model of human behavior and alternating between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are refined.

Learning to play table tennis with a human opponent.

While simulation helps, collecting data in the real world is essential for fine-tuning simulation policies or adapting existing policies in new environments. While learning, robots are prone to failure, which can cause damage to itself and surroundings — especially in the early stages of learning where they are exploring how to interact with the world. We need to collect training data safely, even while the robot is learning, and enable the robot to autonomously recover from failure. In “Learning Locomotion Skills Safely in the Real World”, we introduced a safe RL framework that switches between a “learner policy” optimized to perform the desired task and a “safe recovery policy” that prevents the robot from unsafe states. In “Legged Robots that Keep on Learning”, we trained a reset policy so the robot can recover from failures, like learning to stand up by itself after falling.

Automatic reset policies enable the robot to continue learning in a lifelong fashion without human supervision.

While robot data is scarce, videos of people performing different tasks are abundant. Of course, robots aren’t built like people — so the idea of robotic learning from people raises the problem of transferring learning across different embodiments. In “Robot See, Robot Do”, we developed Cross-Embodiment Inverse Reinforcement Learning to learn new tasks by watching people. Instead of trying to replicate the task exactly as a person would, we learn the high-level task objective, and summarize that knowledge in the form of a reward function. This type of demonstration learning could allow robots to learn skills by watching videos readily available on the internet.

We’re also progressing towards making our learning algorithms more data efficient so that we’re not relying only on scaling data collection. We improved the efficiency of RL approaches by incorporating prior information, including predictive information, adversarial motion priors, and guide policies. Further improvements are gained by utilizing a novel structured dynamical systems architecture and combining RL with trajectory optimization, supported by novel solvers. These types of prior information helped alleviate the exploration challenges, served as good regularizers, and significantly reduced the amount of data required. Furthermore, our team has invested heavily in more data-efficient imitation learning. We showed that a simple imitation learning approach, BC-Z, can enable zero-shot generalization to new tasks that were not seen during training. We also introduced an iterative imitation learning algorithm, GoalsEye, which combined Learning from Play and Goal-Conditioned Behavior Cloning for high-speed and high-precision table tennis games. On the theoretical front, we investigated dynamical-systems stability for characterizing the sample complexity of imitation learning, and the role of capturing failure-and-recovery within demonstration data to better condition offline learning from smaller datasets.

Closing

Advances in large models across the field of AI have spurred a leap in capabilities for robot learning. This past year, we’ve seen the sense of context and sequencing of events captured in LLMs help solve long-horizon planning for robotics and make robots easier for people to interact with and task. We’ve also seen a scalable path to learning robust and generalizable robot behaviors by applying a transformer model architecture to robot learning. We continue to open source data sets, like “Scanned Objects: A Dataset of 3D-Scanned Common Household Items”, and models, like RT-1, in the spirit of participating in the broader research community. We’re excited about building on these research themes in the coming year to enable helpful robots.

Acknowledgements

We would like to thank everyone who supported our research. This includes the entire Robotics at Google team, and collaborators from Everyday Robots and Google Research. We also want to thank our external collaborators, including UC Berkeley, Stanford, Gatech, University of Washington, MIT, CMU and U Penn.

Top

Google Research, 2022 & beyond

This was the sixth blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI ML & Computer Systems
Efficient Deep Learning Algorithmic Advances Robotics
Health* General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More

Google Research, 2022 & beyond: Algorithmic advances

Google Research, 2022 & beyond: Algorithmic advances

(This is Part 5 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Robust algorithm design is the backbone of systems across Google, particularly for our ML and AI models. Hence, developing algorithms with improved efficiency, performance and speed remains a high priority as it empowers services ranging from Search and Ads to Maps and YouTube. Google Research has been at the forefront of this effort, developing many innovations from privacy-safe recommendation systems to scalable solutions for large-scale ML. In 2022, we continued this journey, and advanced the state-of-the-art in several related areas. Here we highlight our progress in a subset of these, including scalability, privacy, market algorithms, and algorithmic foundations.

Scalable algorithms: Graphs, clustering, and optimization

As the need to handle large-scale datasets increases, scalability and reliability of complex algorithms that also exhibit improved explainability, robustness, and speed remain a high priority. We continued our efforts in developing new algorithms for handling large datasets in various areas, including unsupervised and semi-supervised learning, graph-based learning, clustering, and large-scale optimization.

An important component of such systems is to build a similarity graph — a nearest-neighbor graph that represents similarities between objects. For scalability and speed, this graph should be sparse without compromising quality. We proposed a 2-hop spanner technique, called STAR, as an efficient and distributed graph building strategy, and showed how it significantly decreases the number of similarity computations in theory and practice, building much sparser graphs while producing high-quality graph learning or clustering outputs. As an example, for graphs with 10T edges, we demonstrate ~100-fold improvements in pairwise similarity comparisons and significant running time speedups with negligible quality loss. We had previously applied this idea to develop massively parallel algorithms for metric, and minimum-size clustering. More broadly in the context of clustering, we developed the first linear-time hierarchical agglomerative clustering (HAC) algorithm as well as DBSCAN, the first parallel algorithm for HAC with logarithmic depth, which achieves 50x speedup on 100B-edge graphs. We also designed improved sublinear algorithms for different flavors of clustering problems such as geometric linkage clustering, constant-round correlation clustering, and fully dynamic k-clustering.

Inspired by the success of multi-core processing (e.g., GBBS), we embarked on a mission to develop graph mining algorithms that can handle graphs with 100B edges on a single multi-core machine. The big challenge here is to achieve fast (e.g., sublinear) parallel running time (i.e., depth). Following our previous work for community detection and correlation clustering, we developed an algorithm for HAC, called ParHAC, which has provable polylogarithmic depth and near-linear work and achieves a 50x speedup. As an example, it took ParHAC only ~10 minutes to find an approximate affinity hierarchy over a graph of over 100B edges, and ~3 hours to find the full HAC on a single machine. Following our previous work on distributed HAC, we use these multi-core algorithms as a subroutine within our distributed algorithms in order to handle tera-scale graphs.

We also had a number of interesting results on graph neural networks (GNN) in 2022. We provided a model-based taxonomy that unified many graph learning methods. In addition, we discovered insights for GNN models from their performance across thousands of graphs with varying structure (shown below). We also proposed a new hybrid architecture to overcome the depth requirements of existing GNNs for solving fundamental graph problems, such as shortest paths and the minimum spanning tree.

Relative performance results of three GNN variants (GCN, APPNP, FiLM) across 50,000 distinct node classification datasets in GraphWorld. We find that academic GNN benchmark datasets exist in regions where model rankings do not change. GraphWorld can discover previously unexplored graphs that reveal new insights about GNN architectures.

Furthermore, to bring some of these many advances to the broader community, we had three releases of our flagship modeling library for building graph neural networks in TensorFlow (TF-GNN). Highlights include a model library and model orchestration API to make it easy to compose GNN solutions. Following our NeurIPS’20 workshop on Mining and Learning with Graphs at Scale, we ran a workshop on graph-based learning at ICML’22, and a tutorial for GNNs in TensorFlow at NeurIPS’22.

In “Robust Routing Using Electrical Flows”, we presented a recent paper that proposed a Google Maps solution to efficiently compute alternate paths in road networks that are resistant to failures (e.g., closures, incidents). We demonstrate how it significantly outperforms the state-of-the-art plateau and penalty methods on real-world road networks.

Example of how we construct the electrical circuit corresponding to the road network. The current can be decomposed into three flows, i1, i2 and i3, each of which corresponds to a viable alternate path from Fremont, CA to San Rafael, CA.

On the optimization front, we open-sourced Vizier, our flagship blackbox optimization and hyperparameter tuning library at Google. We also developed new techniques for linear programming (LP) solvers that address scalability limits caused by their reliance on matrix factorizations, which restricts the opportunity for parallelism and distributed approaches. To this end, we open-sourced a primal-dual hybrid gradient (PDHG) solution for LP called primal-dual linear programming (PDLP), a new first-order solver for large-scale LP problems. PDLP has been used to solve real-world problems with as many as 12B non-zeros (and an internal distributed version scaled to 92B non-zeros). PDLP’s effectiveness is due to a combination of theoretical developments and algorithm engineering.

With OSS Vizier, multiple clients each send a “Suggest” request to the Service API, which produces Suggestions for the clients using Pythia policies. The clients evaluate these suggestions and return measurements. All transactions are stored to allow fault-tolerance.

Top

Privacy and federated learning

Respecting user privacy while providing high-quality services remains a top priority for all Google systems. Research in this area spans many products and uses principles from differential privacy (DP) and federated learning.

First of all, we have made a variety of algorithmic advances to address the problem of training large neural networks with DP. Building on our earlier work, which enabled us to launch a DP neural network based on the DP-FTRL algorithm, we developed the matrix factorization DP-FTRL approach. This work demonstrates that one can design a mathematical program to optimize over a large set of possible DP mechanisms to find those best suited for specific learning problems. We also establish margin guarantees that are independent of the input feature dimension for DP learning of neural networks and kernel-based methods. We further extend this concept to a broader range of ML tasks, matching baseline performance with 300x less computation. For fine-tuning of large models, we argued that once pre-trained, these models (even with DP) essentially operate over a low-dimensional subspace, hence circumventing the curse of dimensionality that DP imposes.

On the algorithmic front, for estimating the entropy of a high-dimensional distribution, we obtained local DP mechanisms (that work even when as little as one bit per sample is available) and efficient shuffle DP mechanisms. We proposed a more accurate method to simultaneously estimate the top-k most popular items in the database in a private manner, which we employed in the Plume library. Moreover, we showed a near-optimal approximation algorithm for DP clustering in the massively parallel computing (MPC) model, which further improves on our previous work for scalable and distributed settings.

Another exciting research direction is the intersection of privacy and streaming. We obtained a near-optimal approximation-space trade-off for the private frequency moments and a new algorithm for privately counting distinct elements in the sliding window streaming model. We also presented a general hybrid framework for studying adversarial streaming.

Addressing applications at the intersection of security and privacy, we developed new algorithms that are secure, private, and communication-efficient, for measuring cross-publisher reach and frequency. The World Federation of Advertisers has adopted these algorithms as part of their measurement system. In subsequent work, we developed new protocols that are secure and private for computing sparse histograms in the two-server model of DP. These protocols are efficient from both computation and communication points of view, are substantially better than what standard methods would yield, and combine tools and techniques from sketching, cryptography and multiparty computation, and DP.

While we have trained BERT and transformers with DP, understanding training example memorization in large language models (LLMs) is a heuristic way to evaluate their privacy. In particular, we investigated when and why LLMs forget (potentially memorized) training examples during training. Our findings suggest that earlier-seen examples may observe privacy benefits at the expense of examples seen later. We also quantified the degree to which LLMs emit memorized training data.

Top

Market algorithms and causal inference

We also continued our research in improving online marketplaces in 2022. For example, an important recent area in ad auction research is the study of auto-bidding online advertising where the majority of bidding happens via proxy bidders that optimize higher-level objectives on behalf of advertisers. The complex dynamics of users, advertisers, bidders, and ad platforms leads to non-trivial problems in this space. Following our earlier work in analyzing and improving mechanisms under auto-bidding auctions, we continued our research in improving online marketplaces in the context of automation while taking different aspects into consideration, such as user experience and advertiser budgets. Our findings suggest that properly incorporating ML advice and randomization techniques, even in non-truthful auctions, can robustly improve the overall welfare at equilibria among auto-bidding algorithms.

Structure of auto-bidding online ads system.

Beyond auto-bidding systems, we also studied auction improvements in complex environments, e.g., settings where buyers are represented by intermediaries, and with Rich Ads where each ad can be shown in one of several possible variants. We summarize our work in this area in a recent survey. Beyond auctions, we also investigate the use of contracts in multi-agent and adversarial settings.

Online stochastic optimization remains an important part of online advertising systems with application in optimal bidding and budget pacing. Building on our long-term research in online allocation, we recently blogged about dual mirror descent, a new algorithm for online allocation problems that is simple, robust, and flexible. This state-of-the-art algorithm is robust against a wide range of adversarial and stochastic input distributions and can optimize important objectives beyond economic efficiency, such as fairness. We also show that by tailoring dual mirror descent to the special structure of the increasingly popular return-on-spend constraints, we can optimize advertiser value. Dual mirror descent has a wide range of applications and has been used over time to help advertisers obtain more value through better algorithmic decision making.

An overview of the dual mirror descent algorithm.

Furthermore, following our recent work at the interplay of ML, mechanism design and markets, we investigated transformers for asymmetric auction design, designed utility-maximizing strategies for no-regret learning buyers, and developed new learning algorithms to bid or to price in auctions.

An overview of bipartite experimental design to reduce causal interactions between entities.

A critical component of any sophisticated online service is the ability to experimentally measure the response of users and other players to new interventions. A major challenge of estimating these causal effects accurately is handling complex interactions — or interference — between the control and treatment units of these experiments. We combined our graph clustering and causal inference expertise to expand the results of our previous work in this area, with improved results under a flexible response model and a new experimental design that is more effective at reducing these interactions when treatment assignments and metric measurements occur on the same side of a bipartite platform. We also showed how synthetic control and optimization techniques can be combined to design more powerful experiments, especially in small data regimes.

Top

Algorithmic foundations and theory

Finally, we continued our fundamental algorithmic research by tackling long-standing open problems. A surprisingly concise paper affirmatively resolved a four-decade old open question on whether there is a mechanism that guarantees a constant fraction of the gains-from-trade attainable whenever buyer’s value weakly exceeds seller’s cost. Another recent paper obtained the state-of-the-art approximation for the classic and highly-studied k-means problem. We also improved the best approximation for correlation clustering breaking the barrier approximation factor of 2. Finally, our work on dynamic data structures to solve min-cost and other network flow problems has contributed to a breakthrough line of work in adapting continuous optimization techniques to solve classic discrete optimization problems.

Top

Concluding thoughts

Designing effective algorithms and mechanisms is a critical component of many Google systems that need to handle tera-scale data robustly with critical privacy and safety considerations. Our approach is to develop algorithms with solid theoretical foundations that can be deployed effectively in our product systems. In addition, we are bringing many of these advances to the broader community by open-sourcing some of our most novel developments and by publishing the advanced algorithms behind them. In this post, we covered a subset of algorithmic advances in privacy, market algorithms, scalable algorithms, graph-based learning, and optimization. As we move toward an AI-first Google with further automation, developing robust, scalable, and privacy-safe ML algorithms remains a high priority. We are excited about developing new algorithms and deploying them more broadly.

Acknowledgements

This post summarizes research from a large number of teams and benefited from input from several researchers including Gagan Aggarwal, Amr Ahmed, David Applegate, Santiago Balseiro, Vincent Cohen-addad, Yuan Deng, Alessandro Epasto, Matthew Fahrbach, Badih Ghazi, Sreenivas Gollapudi, Rajesh Jayaram, Ravi Kumar, Sanjiv Kumar, Silvio Lattanzi, Kuba Lacki, Brendan McMahan, Aranyak Mehta, Bryan Perozzi, Daniel Ramage, Ananda Theertha Suresh, Andreas Terzis, Sergei Vassilvitskii, Di Wang, and Song Zuo. Special thanks to Ravi Kumar for his contributions to this post.

Google Research, 2022 & beyond

This was the fifth blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI ML & Computer Systems
Efficient Deep Learning Algorithmic Advances Robotics*
Health General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More

Amplification at the Quantum limit

Amplification at the Quantum limit

The Google Quantum AI team is building quantum computers with superconducting microwave circuits, but much like a classical computer the superconducting processor at the heart of these computers is only part of the story. An entire technology stack of peripheral hardware is required to make the quantum computer work properly. In many cases these parts must be custom designed, requiring extensive research and development to reach the highest levels of performance.

In this post, we highlight one aspect of this supplemental hardware: our superconducting microwave amplifiers. In “Readout of a Quantum Processor with High Dynamic Range Josephson Parametric Amplifiers”, published in Applied Physics Letters, we describe how we increased the maximum output power of our superconducting microwave amplifiers by a factor of over 100x. We discuss how this work can pave the way for the operation of larger quantum processor chips with improved performance.

Why microwave amplifiers?

One of the challenges of operating a superconducting quantum processor is measuring the state of a qubit without disturbing its operation. Fundamentally, this comes down to a microwave engineering problem, where we need to be able to measure the energy inside the qubit resonator without exposing it to noisy or lossy wiring. This can be accomplished by adding an additional microwave resonator to the system that is coupled to the qubit, but far from the qubit’s resonance frequency. The resonator acts as a filter that isolates the qubit from the control lines but also picks up a state-dependent frequency shift from the qubit. Just like in the binary phase shift keying (BPSK) encoding technique, the digital state of the qubit (0 or 1) is translated into a phase for a probe tone (microwave signal) reflecting off of this auxiliary resonator. Measuring the phase of this probe tone allows us to infer the state of the qubit without directly interfacing with the qubit itself.

While this sounds simple, the qubit actually imposes a severe cap on how much power can be used for this probe tone. In normal operation, a qubit should be in the 0 state or the 1 state or some superposition of the two. A measurement pulse should collapse the qubit into one of these two states, but using too much power can push it into a higher excited state and corrupt the computation. A safe measurement power is typically around -125 dBm, which amounts to only a handful of microwave photons interacting with the processor during the measurement. Typically, small signals are measured using microwave amplifiers, which increase the signal level, but also add their own noise. How much noise is acceptable? If the measurement process takes too long, the qubit state can change due to energy loss in the circuit. This means that these very small signals must be measured in just a few hundred nanoseconds with very high (>99%) fidelity. We therefore cannot afford to average the signal over a longer time to reduce the noise. Unfortunately, even the best semiconductor low-noise amplifiers are still almost a factor of 10 too noisy.

The solution is to design our own custom amplifiers based on the same circuit elements as the qubits themselves. These amplifiers typically consist of Josephson junctions to provide a tunable inductance wired into a superconducting resonant circuit. By constructing a resonant circuit out of these elements, you can create a parametric amplifier where amplification is achieved by modulating the tunable inductance at twice the frequency you want to amplify. Additionally, because all of the wiring is made of lossless superconductors, these devices operate near the quantum limit of added noise, where the only noise in the signal is coming from amplification of the zero point quantum voltage fluctuations.

The one downside to these devices is that the Josephson junctions constrain the power of the signals we can measure. If the signal is too large, the drive current can approach the junction critical current and degrade the amplifier performance. Even if this limit was sufficient to measure a single qubit, our goal was to increase efficiency by measuring up to six qubits at a time using the same amplifier. Some groups get around this limit by making traveling wave amplifiers, where the signals are distributed across thousands of junctions. This increases the saturation power, but the amplifiers get very complicated to produce and take up a lot of space on the chip. Our goal was to create an amplifier that could handle as much power as a traveling wave amplifier but with the same simple and compact design we were used to.

Results

The critical current of each Josephson junction limits our amplifier’s power handling. However, increasing this critical current also changes the inductance and, thus, the operating frequency of the amplifier. To avoid these constraints, we replaced a standard 2-junction DC SQUID with a nonlinear tunable inductor made up of two RF-SQUID arrays in parallel, which we call a snake inductor. Each RF-SQUID consists of a Josephson junction and geometric inductances L1 and L2, and each array contains 20 RF-SQUIDs. In this case, each junction of a standard DC SQUID is replaced by one of these RF-SQUID arrays. While the critical current of each RF-SQUID is much higher, we chain them together to keep the inductance and operating frequency the same. While this is a relatively modest increase in device complexity, it enables us to increase the power handling of each amplifier by roughly a factor of 100x. It is also fully compatible with existing designs that use impedance matching circuits to provide large measurement bandwidth.

Circuit diagram of our superconducting microwave amplifier. A split bias coil allows both DC and RF modulation of the snake inductor, while a shunt capacitor sets the frequency range. The flow of current is illustrated in the animation where an applied current (blue) on the bias line causes a circulating current (red) in the snake. A tapered impedance transformer lowers the loaded Q of the device. Since the Q is defined as frequency divided by bandwidth, lowering the Q with a constant frequency increases the bandwidth of the amplifier. Example circuit parameters used for a real device are Cs=6.0 pF, L1=2.6 pH, L2=8.0 pH, Lb=30 pH, M=50 pH, Z0 = 50 Ohms, and Zfinal = 18 ohms. The device operation is illustrated with a small signal (magenta) reflecting off the input of the amplifier. When the large pump tone (blue) is applied to the bias port, it generates amplified versions of the signal (gold) and a secondary tone known as an idler (also gold).
Microscope image of the nonlinear resonator showing the resonant circuit that consists of a large parallel plate capacitor, nonlinear snake inductor, and a current bias transformer to tune the inductance.

We measure this performance improvement by measuring the saturation power of the amplifier, or the point at which the gain is compressed by 1 dB. We also measure this power value vs. frequency to see how it scales with amplifier gain and distance from the center of the amplifier bandwidth. Since the amplifier gain is symmetric about its center frequency we measure this in terms of absolute detuning, which is just the absolute value of the difference between the center frequency of the amplifier and the probe tone frequency.

Input and output saturation power (1-dB gain compression point), calibrated using a superconducting quantum processor vs. absolute detuning from the amplifier center frequency.

Conclusion and future directions

The new microwave amplifiers represent a big step forward for our qubit measurement system. They will allow us to measure more qubits using a single device, and enable techniques that require higher power for each measurement tone. However, there are still quite a few areas we would like to explore. For example, we are currently investigating the application of snake inductors in amplifiers with advanced impedance matching techniques, directional amplifiers, and non-reciprocal devices like microwave circulators.

Acknowledgements

We would like to thank the Quantum AI team for the infrastructure and support that enabled the creation and measurement of our microwave amplifier devices. Thanks to our cohort of talented Google Research Interns that contributed to the future work mentioned above: Andrea Iorio for developing algorithms that automatically tune amplifiers and provide a snapshot of the local parameter space, Ryan Kaufman for measuring a new class of amplifiers using multi-pole impedance matching networks, and Randy Kwende for designing and testing a range of parametric devices based on snake inductors. With their contributions, we are gaining a better understanding of our amplifiers and designing the next generation of parametrically-driven devices.

Read More

Unsupervised and semi-supervised anomaly detection with data-centric ML

Unsupervised and semi-supervised anomaly detection with data-centric ML

Anomaly detection (AD), the task of distinguishing anomalies from normal data, plays a vital role in many real-world applications, such as detecting faulty products from vision sensors in manufacturing, fraudulent behaviors in financial transactions, or network security threats. Depending on the availability of the type of data — negative (normal) vs. positive (anomalous) and the availability of their labels — the task of AD involves different challenges.

(a) Fully supervised anomaly detection, (b) normal-only anomaly detection, (c, d, e) semi-supervised anomaly detection, (f) unsupervised anomaly detection.

While most previous works were shown to be effective for cases with fully-labeled data (either (a) or (b) in the above figure), such settings are less common in practice because labels are particularly tedious to obtain. In most scenarios users have a limited labeling budget, and sometimes there aren’t even any labeled samples during training. Furthermore, even when labeled data are available, there could be biases in the way samples are labeled, causing distribution differences. Such real-world data challenges limit the achievable accuracy of prior methods in detecting anomalies.

This post covers two of our recent papers on AD, published in Transactions on Machine Learning Research (TMLR), that address the above challenges in unsupervised and semi-supervised settings. Using data-centric approaches, we show state-of-the-art results in both. In “Self-supervised, Refine, Repeat: Improving Unsupervised Anomaly Detection”, we propose a novel unsupervised AD framework that relies on the principles of self-supervised learning without labels and iterative data refinement based on the agreement of one-class classifier (OCC) outputs. In “SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch”, we propose a novel semi-supervised AD framework that yields robust performance even under distribution mismatch with limited labeled samples.

Unsupervised anomaly detection with SRR: Self-supervised, Refine, Repeat

Discovering a decision boundary for a one-class (normal) distribution (i.e., OCC training) is challenging in fully unsupervised settings as unlabeled training data include two classes (normal and abnormal). The challenge gets further exacerbated as the anomaly ratio gets higher for unlabeled data. To construct a robust OCC with unlabeled data, excluding likely-positive (anomalous) samples from the unlabeled data, the process referred to as data refinement, is critical. The refined data, with a lower anomaly ratio, are shown to yield superior anomaly detection models.

SRR first refines data from an unlabeled dataset, then iteratively trains deep representations using refined data while improving the refinement of unlabeled data by excluding likely-positive samples. For data refinement, an ensemble of OCCs is employed, each of which is trained on a disjoint subset of unlabeled training data. If there is consensus among all the OCCs in the ensemble, the data that are predicted to be negative (normal) are included in the refined data. Finally, the refined training data are used to train the final OCC to generate the anomaly predictions.

Training SRR with a data refinement module (OCCs ensemble), representation learner, and final OCC. (Green/red dots represent normal/abnormal samples, respectively).

SRR results

We conduct extensive experiments across various datasets from different domains, including semantic AD (CIFAR-10, Dog-vs-Cat), real-world manufacturing visual AD (MVTec), and real-world tabular AD benchmarks such as detecting medical (Thyroid) or network security (KDD 1999) anomalies. We consider methods with both shallow (e.g., OC-SVM) and deep (e.g., GOAD, CutPaste) models. Since the anomaly ratio of real-world data can vary, we evaluate models at different anomaly ratios of unlabeled training data and show that SRR significantly boosts AD performance. For example, SRR improves more than 15.0 average precision (AP) with a 10% anomaly ratio compared to a state-of-the-art one-class deep model on CIFAR-10. Similarly, on MVTec, SRR retains solid performance, dropping less than 1.0 AUC with a 10% anomaly ratio, while the best existing OCC drops more than 6.0 AUC. Lastly, on Thyroid (tabular data), SRR outperforms a state-of-the-art one-class classifier by 22.9 F1 score with a 2.5% anomaly ratio.

Across various domains, SRR (blue line) significantly boosts AD performance with various anomaly ratios in fully unsupervised settings.

SPADE: Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling

Most semi-supervised learning methods (e.g., FixMatch, VIME) assume that the labeled and unlabeled data come from the same distributions. However, in practice, distribution mismatch commonly occurs, with labeled and unlabeled data coming from different distributions. One such case is positive and unlabeled (PU) or negative and unlabeled (NU) settings, where the distributions between labeled (either positive or negative) and unlabeled (both positive and negative) samples are different. Another cause of distribution shift is additional unlabeled data being gathered after labeling. For example, manufacturing processes may keep evolving, causing the corresponding defects to change and the defect types at labeling to differ from the defect types in unlabeled data. In addition, for applications like financial fraud detection and anti-money laundering, new anomalies can appear after the data labeling process, as criminal behavior may adapt. Lastly, labelers are more confident on easy samples when they label them; thus, easy/difficult samples are more likely to be included in the labeled/unlabeled data. For example, with some crowd-sourcing–based labeling, only the samples with some consensus on the labels (as a measure of confidence) are included in the labeled set.

Three common real-world scenarios with distribution mismatches (blue box: normal samples, red box: known/easy anomaly samples, yellow box: new/difficult anomaly samples).

Standard semi-supervised learning methods assume that labeled and unlabeled data come from the same distribution, so are sub-optimal for semi-supervised AD under distribution mismatch. SPADE utilizes an ensemble of OCCs to estimate the pseudo-labels of the unlabeled data — it does this independent of the given positive labeled data, thus reducing the dependency on the labels. This is especially beneficial when there is a distribution mismatch. In addition, SPADE employs partial matching to automatically select the critical hyper-parameters for pseudo-labeling without relying on labeled validation data, a crucial capability given limited labeled data.

Block diagram of SPADE with zoom in the detailed block diagram of the proposed pseudo-labelers.

SPADE results

We conduct extensive experiments to showcase the benefits of SPADE in various real-world settings of semi-supervised learning with distribution mismatch. We consider multiple AD datasets for image (including MVTec) and tabular (including Covertype, Thyroid) data.

SPADE shows state-of-the-art semi-supervised anomaly detection performance across a wide range of scenarios: (i) new-types of anomalies, (ii) easy-to-label samples, and (iii) positive-unlabeled examples. As shown below, with new-types of anomalies, SPADE outperforms the state-of-the-art alternatives by 5% AUC on average.

AD performances with three different scenarios across various datasets (Covertype, MVTec, Thyroid) in terms of AUC. Some baselines are only applicable to some scenarios. More results with other baselines and datasets can be found in the paper.

We also evaluate SPADE on real-world financial fraud detection datasets: Kaggle credit card fraud and Xente fraud detection. For these, anomalies evolve (i.e., their distributions change over time) and to identify evolving anomalies, we need to keep labeling for new anomalies and retrain the AD model. However, labeling would be costly and time consuming. Even without additional labeling, SPADE can improve the AD performance using both labeled data and newly-gathered unlabeled data.

AD performances with time-varying distributions using two real-world fraud detection datasets with 10% labeling ratio. More baselines can be found in the paper.

As shown above, SPADE consistently outperforms alternatives on both datasets, taking advantage of the unlabeled data and showing robustness to evolving distributions.

Conclusions

AD has a wide range of use cases with significant importance in real-world applications, from detecting security threats in financial systems to identifying faulty behaviors of manufacturing machines.

One challenging and costly aspect of building an AD system is that anomalies are rare and not easily detectable by people. To this end, we have proposed SRR, a canonical AD framework to enable high performance AD without the need for manual labels for training. SRR can be flexibly integrated with any OCC, and applied on raw data or on trainable representations.

Semi-supervised AD is another highly-important challenge — in many scenarios, the distributions of labeled and unlabeled samples don’t match. SPADE introduces a robust pseudo-labeling mechanism using an ensemble of OCCs and a judicious way of combining supervised and self-supervised learning. In addition, SPADE introduces an efficient approach to pick critical hyperparameters without a validation set, a crucial component for data-efficient AD.

Overall, we demonstrate that SRR and SPADE consistently outperform the alternatives in various scenarios across multiple types of datasets.

Acknowledgements

We gratefully acknowledge the contributions of Kihyuk Sohn, Chun-Liang Li, Chen-Yu Lee, Kyle Ziegler, Nate Yoder, and Tomas Pfister.

Read More