(De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools

An abstract image in pastel colors showing a vortex of vectors.

It’s a well-known challenge that large language models (LLMs)—growing in popularity thanks to their adaptability across a variety of applications—carry risks. Because they’re trained on large amounts of data from across the internet, they’re capable of generating inappropriate and harmful language based on similar language encountered during training.  

Content moderation tools can be deployed to flag or filter such language in some contexts, but unfortunately, datasets available to train these tools often fail to capture the complexities of potentially inappropriate and toxic language, especially hate speech. Specifically, the toxic examples in many existing hate speech datasets tend either to be too hard or too easy for tools to learn from—the too-easy examples contain slurs, profanity, and explicit mentions of minority identity groups; the too-hard examples involve obscure references or inside jokes within the hate speech community. Additionally, the neutral examples in these datasets tend not to contain group mentions. As a result, tools may flag any language that references a minority identity group as hate speech, even when that language is neutral. Alternatively, tools trained on this data fail to detect harmful language when it lacks known or explicit slurs, profanity, or explicit mentions of minority identity groups.  

Generating the kind of data needed to strengthen content moderation tools against the above failures and harms is challenging for numerous reasons. In particular, toxic text that is more implicit and that existing machine learning architectures can still learn from or neutral text with group mentions is difficult to collect at scale. Additionally, asking people to write such examples—particularly the toxic ones—can have a negative impact mentally on those assigned the task. 

Inspired by the ability of large language models to mimic the tone, style, and vocabulary of prompts they receive—whether toxic or neutral—we set out to create a dataset for training content moderation tools that can be used to better flag implicitly harmful language. In our paper “ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection,” we collected initial examples of neutral statements with group mentions and examples of implicit hate speech across 13 minority identity groups and used a large-scale language model to scale up and guide the generation process. The outcome is the largest implicit hate speech dataset to date that is publicly available: 274,000 examples comprising both neutral and toxic statements. We conducted a human study on the generated dataset to better understand different aspects of harm beyond binary labels of toxic and neutral assigned by content moderation tools. To stress test existing content moderation tools across minority identity groups studied in this work, we also propose an adversarial classifier-in-the-loop decoding approach. The dataset, two content moderation tools trained on the dataset, prompts used as seed data, and the source codes for our proposed adversarial decoding approach are available in the ToxiGen GitHub repo (please see footnote).

We’re presenting this work at the 2022 Meeting of the Association for Computational Linguistics (ACL), where our colleagues will also be presenting work that leverages the generative power of large language models and human expertise

A horizontal chart comparing the proportion of minority identity group mentions in the prompts with the minority identity group mentions in the generated text for the 13 minority identity groups in this work: Black, Mexican, people with physical disabilities, LGBTQ+, people with cognitive disabilities, Chinese, Muslim, Jewish, Middle Eastern, Women, Asian, Native American, and Latino.
Figure 1: The ToxiGen dataset—an implicit hate speech dataset created by using a large-scale language model with both regular and adversarial decoding to scale up and guide the generation process—contains 274,000 examples comprising both neutral and toxic statements across 13 minority identity groups. As illustrated above, mentions of a specific minority identity group in the prompts and mentions of the same minority identity group in the corresponding generated text are proportional.

Demonstration-based prompting for building better datasets

Large Transformer-based language models don’t explicitly encode semantic information; nevertheless, these models can distinguish the statistical interactions of words in different contexts. Through experimentation with the generation of language via one of these large language models, we learned how to utilize careful prompt engineering strategies to create the ToxiGen implicit hate speech dataset. 

Our first experiments were to generate examples of hate speech and neutral speech related to the 13 minority identity groups in our work. We started by collecting implicit hate speech prompts from existing datasets and neutral prompts drawn from news articles, opinion pieces, podcast transcripts, and other similar public sources and feeding them into the LLM to create a broader, deeper set of prompts. What we found was that the LLM could generate examples that were qualitatively different depending on the source material. When prompted with bits from different writers on the above topics, in each case, the LLM produced linguistically diverse outputs that were nonetheless similar in style and tone. 

Furthermore, we found that through careful cultivation of prompt sets, we could generate a wide variety of text reflecting diverse opinions and thoughts on these topics that weren’t found in our original source materials. We could generate neutral statements about sensitive topics that mentioned the relevant minority identity groups, and we could consistently generate hate speech statements about these minority identity groups that didn’t contain slurs or profanity. And the more we experimented with the source material, the more interesting our dataset became. This is particularly exciting because we hope that other individuals and groups can use these tools to extend our dataset; different disciplinary experts could utilize the same strategies and collect even better prompt sets, resulting in even more subtle and rich examples of neutral speech and hate speech. 

We also found that the model often generated examples of speech that we ourselves had trouble labeling. In essence, we were using the LLM as a probe to explore the delicate boundaries between acceptable and offensive speech. As a result, our own understanding of the problem definition itself grew through our interactions with the model.  

The first 260,000 examples from our dataset were drawn from this experimental approach. 

Examples of statements generated by (De)ToxiGen that fool Google’s Perspective API, HateBERT, OpenAI content filter, AI2 Delphi, and RoBERTa.
Figure 2: Examples of statements generated by (De)ToxiGen that fool Google’s Perspective API, HateBERT, OpenAI content filter, AI2 Delphi, and RoBERTa. Five statements are neutral but mention minority identity groups, so the content moderation tools find them hateful. Five are toxic sentences, but the tools find them neutral. The proposed decoding approach, (De)ToxiGen (referred to as ALICE in the paper), can challenge these content moderation tools, allowing developers to increase their coverage by creating adversarial examples. 

(De)ToxiGen: An adversarial decoding approach for strengthening content moderation tools

While demonstration-based prompting can facilitate large-scale data generation, it doesn’t generate data targeted specifically to challenge a given content moderation tool, or content classifier. This is important because every content moderation tool has unique vulnerabilities depending on the type of data it has been trained on. To address this, we developed (De)ToxiGen (referred to as ALICE in the paper), an algorithmic mechanism that creates an adversarial set-up between an LLM and a given content moderation tool in which the content classifier is in the loop during decoding.  

The proposed approach can increase or decrease the likelihood that a generated statement is classified as hate speech while maintaining the coherence of the generated language. It can generate both false negatives and false positives for a given content moderation tool. For false negatives, toxic prompts are used to elicit toxic responses, and then the tool’s probability of the neutral class is maximized during decoding. Similarly, to generate false positives, neutral prompts are used to generate neutral responses, and then the probability of the toxic class is maximized during decoding. With this approach, we’re essentially trying to reveal weaknesses in a specific content moderation tool by guiding the LLM to produce statements that we know the tool will misidentify. The generated data can then be used to improve the performance and coverage of the targeted content moderation tool. Our ToxiGen dataset includes data generated by both demonstration-based prompting and our proposed adversarial decoding approach. Through empirical study on three existing human-written datasets, we found that starting with an existing content moderation tool and fine-tuning it on ToxiGen can improve the tool’s performance significantly, demonstrating the quality of the machine-generated data in ToxiGen.  

Human evaluation: Better understanding the data

Human language is complex, particularly when it comes to harmful statements. To better understand different aspects of the data in ToxiGen—its perceived harmfulness and intent and whether it presents as fact or opinion, for example—we conducted human evaluations on the data generated by both regular decoding (top-k), used in the demonstration-based prompting, and the proposed adversarial decoding. The human evaluation also allowed us to test the quality of the output of these methods and gauge how effective these methods were in guiding the generation of the data we sought. 

For the human evaluation, three annotators were used for each statement from a pool of 156 prequalified annotators with prior experience annotating toxic language. About 4,500 samples were randomly selected for each of the decoding methods with coverage across all 13 minority identity groups for each split. We found the following: 

  1. For both decoding methods, minority identity group mentions included in the prompt also exist in the generated statements. This means that both data generation methods reliably produce the data they were designed to produce—hateful and neutral statements with explicit reference to the specified minority identity group.
  2. In the neutral case, the label of the prompt matches the generated text more often than in the toxic case, as shown in Figure 3a. 
  3. The proposed decoding approach generates a higher percentage of adversarial text compared to regular decoding—that is, it produces data that is more likely to fool a given content moderation tool—as illustrated in Figure 3b. 
Two bar charts side by side. The one on the left, titled “Prompt-Response Matching,” shows that top-k decoding produces non-toxic responses 95.2 percent of the time when given a non-toxic prompt compared with 92.1 percent for (De)ToxiGen and that top-k decoding produces toxic responses 67.7 percent of the time when given a toxic prompt compared with 40.3 percent for (De)ToxiGen. The bar chart on the right, titled “Adversarial Power,” shows that statements generated by (De)ToxiGen fool HateBERT 26.4 percent of the time compared with 16.8 percent for statements generated via top-k decoding.
Figure 3a (left) and 3b (right): Human evaluations on the data generated by regular decoding (top-k) and the proposed adversarial decoding showed that the toxicity labels for the prompt and the generated response match more often for non-toxic prompts compared to toxic ones (left). It was also observed that (De)ToxiGen generates a higher percentage of adversarial text compared to regular decoding (right). 
  1. 90.5 percent of machine-generated examples were thought to be human-written by the majority of annotators.
  2. Perceived harmfulness with respect to human- or AI-authored text is similar. 

Looking ahead: Societal implications and opportunities

As advances continue to be made in large language models, we remain vigilant in our pursuit of AI systems that align with our commitment to technology that benefits society as a whole and empowers everyone to achieve more. We’re beginning to ask better questions to more deeply understand the risks associated with LLMs and build processes and methods for addressing them. Existing content moderation tools tend to be only good at flagging overt inappropriate or harmful language. Our work aims to create data that can better target the challenge. While our work here specifically explores hate speech, our proposed methods could be applied to a variety of content moderation challenges, such as flagging potential misinformation content. By releasing the source codes and prompt seeds for this work, we hope to encourage the research community to contribute to it by, for example, adding prompt seeds and generating data for minority identity groups that aren’t covered in our dataset. 

As with many technologies, the solutions we develop to make them stronger, more secure, and less vulnerable also have the potential to be used in unintended ways. While the methods described here may be used to generate inappropriate or harmful language, we believe that they provide far greater value in helping to combat such language, resulting in content moderation tools that can be used alongside human guidance to support fairer, safer, more reliable, and more inclusive AI systems.  

Considerations for responsible use

There is still a lot that this dataset is not capturing about what constitutes problematic language, and before utilizing the dataset, its limitations should be acknowledged. Our annotations might not capture the full complexity of these issues, given problematic language is context-dependent, dynamic, and can manifest in different forms and different severities. Content moderation tools aren’t a silver bullet to address harmful online content. Problematic language is fundamentally a human-centric problem. It should be studied in conjunction with human experience, and tools to address this problem should be developed and deployed with human expertise and well-informed regulatory processes and policy. Multidisciplinary work is needed to better understand the aspects of this challenge.  

Also, this dataset only captures implicit toxicity (more precisely hate speech) for 13 minority identity groups and due to its large scale can naturally have imperfections. Our goal in this project is to provide the community with means to improve hate speech detection on implicit toxic language for the identified minority identity groups, and there exist limitations to this dataset and models trained on it that can potentially be the subject of future research, for example, including more minority identity groups, a combination of them, and so on that are not covered in our work. Stronger content moderation tools and systems can contribute to mitigating fairness-related harms in AI systems. For example, systems that don’t over-flag neutral statements with minority identity group mentions can help ensure better representation of diverse perspectives and experiences, while systems that can better flag implicit hate speech can support more inclusive technology.   

Acknowledgment 

This work was conducted by PhD students Thomas Hartvigsen and Saadia Gabriel during their internships at Microsoft Azure and Microsoft Research. Hamid Palangi, Dipankar Ray, Maarten Sap, and Ece Kamar served as advisors on the work. A special thanks to Misha Bilenko from Azure ML for making the compute resources available and to Microsoft Research for supporting our large-scale human study. 

Platform models—large-scale models trained on vast amounts of data—are making it easier and faster to develop AI systems. (De)ToxiGen and other tools and resources like it are being developed by researchers at Microsoft to help developers get the most out of these platform models while also understanding, measuring, and mitigating the risks they pose.

Please note: This research, the GitHub repository, and examples from our work included in this blog contain and discuss content that is offensive or upsetting. All materials are intended to support research that improves hate speech detection methods. Included examples of hate speech don’t represent how the authors or sponsors feel about any minority identity groups. Hate speech applies to a range of minority identity groups; for the purposes of this research, we focus on 13 of them (as shown in Figure 1). Content moderation tools are part of larger content moderation systems. These systems also include human expertise and thoughtful policy and regulatory development. Even the most robust content moderation tools and datasets require systems with human supervision. 

The post (De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools appeared first on Microsoft Research.

Read More

FLUTE: A scalable federated learning simulation platform

This diagram shows a payload exchange between a server, inside Worker 0, and clients that live inside Workers 2 and 3. First, the server pushes the central ML model plus the clients’ data to Workers 2 and 3. Then, each client trains the model with their local data. Finally, the clients send the pseudo-gradients of this new model back to the server for aggregation and the creation of a new global model.

Federated learning has become a major area of machine learning (ML) research in recent years due to its versatility in training complex models over massive amounts of data without the need to share that data with a centralized entity. However, despite this flexibility and the amount of research already conducted, it’s difficult to implement due to its many moving parts—a significant deviation from traditional ML pipelines.

The challenges in working with federated learning result from the diversity of local data and end-node hardware, privacy concerns, and optimization constraints. These are compounded by the sheer volume of federated learning clients and their data and necessitates a wide skill set, significant interdisciplinary research efforts, and major engineering resources to manage. In addition, federated learning applications often need to scale the learning process to millions of clients to simulate a real-world environment. All of these challenges underscore the need for a simulation platform, one that enables researchers and developers to perform proof-of-concept implementations and validate performance before building and deploying their ML models. 

A versatile framework for federated learning

Today, the Privacy in AI team at Microsoft Research is thrilled to introduce Federated Learning Utilities and Tools for Experimentation (FLUTE) as a framework for running large-scale offline federated learning simulations, which we discuss in detail in the paper, “FLUTE: A Scalable, Extensible Framework for High-Performance Federated Learning Simulations.” In creating FLUTE, our goal was to develop a high-performance simulation platform that enables quick prototyping of federated learning research and makes it easier to implement federated learning applications.

There has been a lot of research in the last few years directed at tackling the many challenges in working with federated learning, including setting up learning environments, providing privacy guarantees, implementing model-client updates, and lowering communication costs. FLUTE addresses many of these while providing enhanced customization and enabling new research on a realistic scale. It also allows developers and researchers to test and experiment with certain scenarios, such as data privacy, communication strategies, and scalability, before implementing their ML model in a production framework.

One of FLUTE’s main benefits is its native integration with Azure ML workspaces, leveraging the platform’s features to manage and track experiments, parameter sweeps, and model snapshots. Its distributed nature is based on Python and PyTorch, and the flexibly designed client-server architecture helps researchers and developers quickly prototype novel approaches to federated learning. However, FLUTE’s key innovation and technological differentiator is the ease it provides in implementing new scenarios for experimentation in core areas of active research in a robust high-performance simulator. 

FLUTE offers a platform where all clients are implemented as isolated object instances, as shown in Figure 1. The interface between the server and the remaining workers relies on messages that contain client IDs and training information, with MPI as the main communication protocol. Local data on each client stays within local storage boundaries and is never aggregated with other local sources. Clients only communicate gradients to the central server.

This diagram shows server-client communication under FLUTE’s architecture. Worker 0 that acts as the server and contains the global model, client training data, the configuration, and the optimizer. Worker i receives a copy of the global model plus the task configuration. It also contains clients that are composed of the trainer and the optimizer. Each client sends the payload back to Worker 0.
Figure 1: FLUTE’s client-server architecture and workflow. First, the server pushes the initial global model to the clients and sends training information. Then, the clients train their instances of the global model with locally available data. Finally, all clients return the information to the server to aggregate the pseudo-gradients and produce a new global model that will be updated to the clients. This three-step process repeats for all rounds of training.

The following features contribute to FLUTE’s versatile framework and enable experimentation with new federated learning approaches: 

  • Scalability: Scale is a critical factor in understanding practical metrics, such as convergence and privacy-utility tradeoffs. Researchers and developers can run large-scale experiments using tens of thousands of clients with a reasonable turnaround time. 
  • Flexibility: FLUTE supports diverse federated learning configurations, including standardized implementations such as DGA and FedAvg.
  • Versatility: FLUTE’s generic API helps researchers and developers easily implement new models, datasets, metrics, and experimentation features, while its open architecture helps them add new algorithms in such areas as optimization, privacy, and robustness.

Available as an open-source platform

As part of this announcement, we’re making FLUTE available as a versatile open-source platform for rapid prototyping and experimentation. It comes with a set of basic tools to help kickstart experiments. We hope researchers and developers take advantage of this framework by exploring new approaches to federated learning.

Looking ahead

FLUTE’s innovative framework offers a new paradigm for implementing federated learning algorithms at scale, and this is just the beginning. We’re making improvements with the view toward making FLUTE the standard federated learning simulation platform. Future releases will include algorithmic enhancements in optimization and support for additional communication protocols. We’re also adding features to make it easier to set up experiments when including tailored features in new tasks and the ability to easily incorporate FLUTE as a library into Azure ML pipelines.

Additional resources 

Check out this video for a deep dive into FLUTE architecture and a tutorial on how to use it. Our documentation also explains how to implement FLUTE.  

You can learn more about the FLUTE project by visiting our project page, and discover more about our current federated learning research as well as other projects related to privacy in AI on our group page

Explore More

  • Download

    FLUTE


    FLUTE (Federated Learning Utilities for Testing and Experimentation) is a platform for conducting high-performance federated learning simulations.

The post FLUTE: A scalable federated learning simulation platform appeared first on Microsoft Research.

Read More

Azure Quantum innovation: Efficient error correction of topological qubits with Floquet codes

Qubits arranged in a square array on a two-dimensional surface. Measurements are done on the qubits in a sequence of checks, shown as a repeating pattern of three steps. In each step, one measures a check on each pair of neighboring qubits, shown as a line connecting those qubits, with the lines moving in a repeating pattern over the three steps.
This graphic shows the repeating three-step sequence of checks used in Floquet codes. Each circle represents a qubit, and a line between a pair of circles indicates that that check is measured on that time step. The colors indicate the type of operator measured in each check, either XX, YY, or ZZ, so that the type of check measured also changes with time. Learn more about this sequence of checks in the section “Unlocking a new class of quantum codes” below. 

Technological innovation that enables scaling of quantum computing underpins the Microsoft Azure Quantum program. In March of this year, we announced our demonstration of the underlying physics required to create a topological qubit—qubits that are theorized to be inherently more stable than existing ones without sacrificing size or speed. However, our quest to deliver a general-purpose quantum computer capable of addressing industrial-scale problems will require innovation across every layer of the quantum stack, from materials at the nanoscale to algorithms and applications. At Azure Quantum, our full-stack approach and broad expertise across all areas of quantum computation allows us to drive innovation in this space through tight collaboration across theory, hardware, software and systems teams. 

One of the greatest challenges in building a quantum computer is that quantum states are intrinsically fragile and are quickly destroyed when a qubit couples to its environment, leading to noise. A crucial technology to overcome this fragility, which is also used in classical digital computing, is error correction. By encoding the state of a single logical qubit into many physical qubits, quantum error correction (QEC) has the ability to detect and correct most errors that occur on the physical qubits. Indeed, such error correction needs to be at the heart of any scalable quantum system. Without it, no known qubit technology can protect quantum states sufficiently long enough to perform a calculation that can deliver real-world impact. However, quantum error correction also comes at a significant cost: depending on the quality of the physical qubits, error correction can increase the space requirements of a computation by a factor of several thousand and the time requirements more than tenfold. Therefore, any improvements on error correction have enormous positive ripple effects across the entire stack.

In this post, we’ll share some exciting implications from our recent innovations toward scale—specifically how to perform quantum error correction in our topological quantum computation stack— published in the series of papers listed below. Topological qubits promise lower error rates than conventional qubits, and as such can perform scalable quantum computation at lower overhead. On top of that, in these papers we introduce a new class of quantum error correction codes, called Floquet codes, which are particularly suited to topological qubits. Our new approaches culminate in an additional tenfold or more reduction to the overhead needed for error correction on topological qubits compared to previous state of the art, opening a viable path toward scaling to a million qubits and beyond. 

Unlocking a new class of quantum codes 

To optimize performance on any quantum computing platform, the circuits must be adapted to the capabilities of the hardware. This is particularly true for error correction schemes, which must be tailor-made to exploit the strengths of a given hardware platform. Unlike most other qubits, our topological qubits employ a measurement-based scheme, where direct measurements between adjacent qubits are the native set of operations. While all quantum error correction schemes use frequent measurements to identify errors, the state-of-the-art schemes require complex multi-qubit measurements that can’t be implemented directly in the hardware and must be compiled into native operations at the expense of additional auxiliary qubits and additional timesteps. The outcomes of these measurements are used to infer the occurrence of errors without destroying the encoded quantum state. 

Our recent breakthroughs overcome this issue through a conceptually new perspective on quantum codes (put forward in “Dynamically Generated Logical Qubits” and “Boundaries for the Honeycomb code”), where the encoding of the quantum information is not static but rather allowed to periodically evolve in time. Many examples of physical systems are known where such periodic evolution allows new phenomena to occur (see, for example, the well-known Kapitza pendulum). The study of such systems falls under the term Floquet systems, which gives this new class of codes its name. 

These codes are built entirely from two-qubit measurements referred to as “check measurements.” Just like measurements in a conventional code, these are used to check for errors. The simplicity of these checks, however, means that each time we measure a check, we change the encoding of the quantum information, leading to the Floquet nature of the code. As a consequence, the outcomes of these measurements cannot be used directly to infer which errors have occurred, but rather the full history of measurement outcomes over time must be taken into account. 

The physical qubits are arranged in a lattice (such as that shown in Figure 1), represented as black dots on the vertices of this graph. Each check is associated with an edge of the graph, and one sequentially measures checks of different colors. The code state changes as the different checks are measured. There are several possible lattice arrangements of the qubits that allow for a natural implementation of a Floquet code. The lattices should have the following two properties: 1) each vertex should be attached to three edges and 2) using only three colors, it should be possible to color the plaquettes in such a way that no adjacent plaquettes have the same color (that is, the plaquettes should be “three-colorable”). While many such arrangements remain to be explored and the optimal choice will depend on details of the physical hardware, Figure 1 shows two possible Floquet-code arrangements. 

Two different ways of tiling a surface.  In the 4.8.8 code configuration on the left, the surface is tiled with octagons and squares, and in the honeycomb code configuration it is tiled with hexagons.  Each shows a possible arrangement of qubits in a Floquet code, with the qubits at the vertices of the tiling. The tiling displays some more complicated features at the boundary, but in the middle it is a regular tiling.
Figure 1: Lattice of qubits used for two different Floquet codes, the 4.8.8 code (left) and the honeycomb code (right). The optimal choice of code depends on the level of noise present and on correlations in the noise. 

Error correction tailor-made for topological qubits 

In the realm of our measurement-based topological architecture, we have identified the two arrangements shown in Figure 1 as particularly appealing when combined with a particular design of topological qubit—a “tetron” qubit—which is also a scalable design. The connectivity of these two layouts can be naturally mapped onto the connectivity of an array of such tetrons, which is shown in Figure 2. Furthermore, the majority of the two-qubit check operators that are used to construct these codes are exactly those native operations between tetrons that can be implemented with minimal error, as shown in the lower panel of Figure 2. The details of these codes, their implementation with topological qubits, and numerical studies of their performance are discussed in “Performance of planar Floquet codes with Majorana-based qubits.”

Top panel: an array of qubits.  Each qubit is shown as a sideways “H,” with the long edges of the “H” being topological wires supporting Majorana modes, giving four Majorana modes on each qubit at the points of the “H.” The bottom panel shows different loops connecting different qubits to measure checks of the code.
Figure 2: Upper panel: Physical array of tetron qubits that can be used to implement either the honeycomb or 4.8.8 Floquet code. Lower panel: Mapping of measurement operations into physical interference loops that are used for two-qubit measurements. 

Our numerical simulations show that our Floquet codes and architecture implemented with topological “tetron” qubits help secure the path to a scalable quantum system in several ways. First, the very favorable threshold of these codes, which we estimate to be close to 1 percent, allows us to achieve quantum error correction earlier and demonstrate tangible steps on our journey toward quantum advantage. Second, in the longer run, we find that these codes reduce the overhead required for quantum error correction on topological qubits roughly tenfold compared to the previous state-of-the-art approach, which means that our scalable system can be built from fewer physical qubits and can run at a faster clock speed (see Figure 3 below).

A plot of the overhead due to error correction as a function of the performance of the physical qubits.  As the physical qubits are improved (lower noise, on the left side of the plot), the overhead is reduced. The plot shows that the Floquet codes outperform other codes by an order of magnitude.
Figure 3: Comparison of the spacetime overhead between the previous state-of-the-art (blue, dashed line) and the newly developed Floquet codes (black, solid line), both for an implementation on topological qubits. See Figure 8 in “Performance of planar Floquet codes with Majorana-based qubits” for more details. 

Approaching quantum computation from the unique topological perspective requires synchronized advancements across the entire Azure Quantum stack. Along with our recent demonstration of the building blocks for topological qubits, optimizing quantum error correction using Floquet codes represents a critical piece of the scientific foundation needed to achieve scaled quantum computation. These breakthroughs help establish a path and architecture for the industrial quantum machine.

The post Azure Quantum innovation: Efficient error correction of topological qubits with Floquet codes appeared first on Microsoft Research.

Read More

MoLeR: Creating a path to more efficient drug design

Drug discovery has come a long way from its roots in serendipity. It is now an increasingly rational process, in which one important phase, called lead optimization, is the stepwise search for promising drug candidate compounds in the lab. In this phase, expert medicinal chemists work to improve “hit” molecules—compounds that demonstrate some promising properties, as well as some undesirable ones, in early screening. In subsequent testing, chemists try to adapt the structure of hit molecules to improve their biological efficacy and reduce potential side effects. This process combines knowledge, creativity, experience, and intuition, and often lasts for years. Over many decades, computational modelling techniques have been developed to help predict how the molecules will fare in the lab, so that costly and time-consuming experiments can focus on the most promising compounds.

Diagram illustrating the process of drug discovery. It uses icons for the various stages, and arrows to show how drug discovery projects progress. The bottom section of the diagram shows the human-led approach, which includes
Figure 1: Classic human-led drug design (bottom) is an iterative process of proposing new compounds and testing them in vitro. As this process requires synthesis in the lab, it is very costly and time consuming. By using computational modelling (top), molecule design can be rapidly performed in silico, with only the most promising molecules promoted to be made in the lab and then eventually tested in vivo.

The Microsoft Generative Chemistry team is working with Novartis to improve these modelling techniques with a new model called MoLeR. 

“MoLeR illustrates how generative models based on deep learning can help transform the drug discovery process and enable our colleagues at Novartis to increase the efficiency in finding new compounds.”

Christopher Bishop, Technical Fellow and Laboratory Director, Microsoft Research Cambridge

We recently focused on predicting molecular properties using machine learning methods in the FS-Mol project. To further support the drug discovery process, we are also working on methods that can automatically design compounds that better fit project requirements than existing candidate compounds. This is an extremely difficult task, as only a few promising molecules exist in the vast and largely unexplored chemical space—estimated to contain up to 1060 drug-like molecules. Just how big is that number? It would be enough molecules to reproduce the Earth billions of times. Finding them requires creativity and intuition that cannot be captured by fixed rules or hand-designed algorithms. This is why learning is crucial not only for the predictive task, as done in FS-Mol, but also for the generative task of coming up with new structures. 

In our earlier work, published at the 2018 Conference on Neural Information Processing Systems (NeurIPS), we described a generative model of molecules called CGVAE. While that model performed well on simple, synthetic tasks, we noted then that further improvements required the expertise of drug discovery specialists. In collaboration with experts at Novartis, we identified two issues limiting the applicability of the CGVAE model in real drug discovery projects: it cannot be naturally constrained to explore only molecules containing a particular substructure (called the scaffold), and it struggles to reproduce key structures, such as complex ring systems, due to its low-level, atom-by-atom generative procedure. To remove these limitations, we built MoLeR, which we describe in our new paper, “Learning to Extend Molecular Scaffolds with Structural Motifs,” published at the 2022 International Conference on Learning Representations (ICLR)

The MoLeR model

In the MoLeR model, we represent molecules as graphs, in which atoms appear as vertices that are connected by edges corresponding to the bonds. Our model is trained in the auto-encoder paradigm, meaning that it consists of an encoder—a graph neural network (GNN) that aims to compress an input molecule into a so-called latent code—and a decoder, which tries to reconstruct the original molecule from this code. As the decoder needs to decompress a short encoding into a graph of arbitrary size, we design the reconstruction process to be sequential. In each step, we extend a partially generated graph by adding new atoms or bonds. A crucial feature of our model is that the decoder makes predictions at each step solely based on a partial graph and a latent code, rather than in dependence on earlier predictions. We also train MoLeR to construct the same molecule in a variety of different orders, as the construction order is an arbitrary choice. 

Animation showing a
Figure 2: Given a latent code, that may either come from encoding a molecule or sampling from the prior distribution, MoLeR learns to decode it step-by-step. In each step, it extends a given partial molecule by adding atoms, bonds, or entire structural motifs. These choices are guided by graph neural networks (GNNs) trained on construction sequences for molecules in the training dataset. 

As we alluded to earlier, drug molecules are not random combinations of atoms. They tend to be composed of larger structural motifs, much like sentences in a natural language are compositions of words, and not random sequences of letters. Thus, unlike CGVAE, MoLeR first discovers these common building blocks from data, and is then trained to extend a partial molecule using entire motifs (rather than single atoms). Consequently, MoLeR not only needs fewer steps to construct drug-like molecules, but its generation procedure also occurs in steps that are more akin to the way chemists think about the construction of molecules. 

Diagram with two parts (left and right), with an arrow pointing from left to right. The left part shows a molecule, while the right part shows the same molecule divided into chunks representing groups of atoms, which are formed by removing some of the bonds from the original molecule. Each chunk in the right part of the figure has a box around it.
Figure 3: Motif extraction strategy applied to Imatinib (a drug developed by Novartis, shown on the left) converts it into a collection of common building blocks and individual atoms (shown on the right, with motifs in red boxes and remaining atoms in blue ones). 

Drug-discovery projects often focus on a specific subset of the chemical space, by first defining a scaffold—a central part of the molecule that has already shown promising properties—and then exploring only those compounds that contain the scaffold as a subgraph. The design of MoLeR’s decoder allows us to seamlessly integrate an arbitrary scaffold by using it as an initial state in the decoding loop. As we randomize the generation order during training, MoLeR implicitly learns to complete arbitrary subgraphs, making it ideal for focused scaffold-based exploration. 

Diagram showing a 5x5 grid, with each cell depicting one molecule. The molecule in the middle has a box around it. All the molecules are different, but relatively similar, and all contain a particular substructure, which is marked in red.
Figure 4: Given a molecule (shown in the box in the center) containing a particular scaffold of interest (highlighted in red), MoLeR can traverse its scaffold-constrained latent space, and propose “neighbors” of the given molecule that have similar structure and properties. 

Optimization with MoLeR

Even after training our model as discussed above, MoLeR has no notion of “optimization” of molecules. However, like related approaches, we can perform optimization in the space of latent codes using an off-the-shelf black-box optimization algorithm. This was not possible with CGVAE, which used a much more complicated encoding of graphs. In our work, we opted for using Molecular Swarm Optimization (MSO), which shows state-of-the-art results for latent space optimization in other models, and indeed we found it to work very well for MoLeR. In particular, we evaluated optimization with MSO and MoLeR on new benchmark tasks that are similar to realistic drug discovery projects using large scaffolds and found this combination to outperform existing models. 

Outlook

We continue to work with Novartis to focus machine learning research on problems relevant to the real-world drug discovery process. The early results are substantially better than those of competing methods, including our earlier CGVAE model. With time, we hope MoLeR-generated compounds will reach the final stages of drug-discovery projects, eventually contributing to new useful drugs that benefit humanity. 

The post MoLeR: Creating a path to more efficient drug design appeared first on Microsoft Research.

Read More

PPE: A fast and provably efficient RL algorithm for exogenous noise

A view of a park with an agent walking along a trail. Sources of exogenous noise surround the agent, including ducks gliding on a pond, people in the background, and reflections on the water.

Picture a person walking in a park by a pond. The surrounding environment contains a number of moving objects that change the quality of the environment: clouds moving to hide the sun, altering the quality of light; ducks gliding across the pond, causing its surface to ripple; people walking along a path, their images reflecting on the water. If we’re creating an AI model for navigating to a given goal, for example, a robot navigating to a specific location in a park to deliver a package, we want this model to recognize the robot and any obstacle in its way, but not the changes in its surrounding environment that occur independently of the agent, which we define as exogenous noise.

Although reinforcement learning (RL) has proven to be a successful paradigm for training AI models in navigation tasks, often used in gaming, existing RL methods are not yet robust enough to handle exogenous noise. While they may be able to heuristically solve certain problems, such as helping a robot navigate to a specific destination in a particular environment, there is no guarantee that they can solve problems in environments they have not seen.

In this post, we introduce Path Predictive Elimination (PPE), the first RL algorithm that can solve the problem of exogenous noise with a mathematical guarantee. Specifically, for any problem that satisfies certain assumptions, the algorithm succeeds in solving the problem using a small number of episodes. We discuss this algorithm in detail in our paper, “Provable RL with Exogenous Distractors via Multistep Inverse Dynamics.”

A view of a park with an agent walking along a trail. Sources of exogenous noise surround the agent, including ducks gliding on a pond, people in the background, and reflections on the water. 
Figure 1: A robot walking in a park to a specific destination. The environment has many sources of exogenous noise, such as people walking in the background as their reflections appear on the water and ducks gliding along the surface of the pond.

Real-world RL and exogenous noise

To understand how PPE works, it’s important to first discuss how a real-world RL agent (the decision-maker) operates. Agents have an action space with (A) number of actions and receive information about the world in the form of an observation. In our example, the robot is the agent, and its action space contains four actions: a step forward, backward, left, or right.

After an agent takes a single action, it gets a new observation—that is, it receives more information about its environment—along with a reward. If the robot observes the park through a camera, the observation takes the form of an image. When an agent has a task to solve, such as reaching a specific destination, it must take a sequence of actions, each resulting in a reward. Its goal is to maximize the sum of rewards. When the robot takes a step forward, the camera generates a new observation of the park, and it receives a reward for this action. It may get a reward of 1 for the first action that takes it toward its goal and 0 otherwise. 

Key challenges in real-world RL include how to handle complex observations and very large observation spaces. In our example, the robot in the park will have to work with an image that contains relevant information, such as the position of the destination, but this information is not directly accessible due to the exogenous noise and camera-generated image noise in the observation.

An image can be in a 500 x 500 x 3 pixel space, where each pixel takes 255 values. This would give us 255500 x 500 x 3 the number of different images which is an extremely large number of possibilities. However, the environment is much simpler to describe than this number suggests. This means the observation in an RL environment is generated from a much more compact but hidden endogenous state. In our park example, the endogenous state contains the position of the agent, the destination, and any obstacles around the agent.

In our paper, we assume that the endogenous state dynamics are near-deterministic. That is, taking a fixed action in an endogenous state always leads to the same next endogenous state in most cases. We also require that it is possible to extract the endogenous state from an observation. However, we make no assumptions about dynamics of exogenous noise or how observations are generated.

Most existing RL algorithms are either unable to solve problems containing complex observations or lack a mathematical guarantee for working on new, untried problems. This guarantee is desirable because the cost of failure in the real world can be potentially high. Many existing algorithms require an impractically large amount of data to succeed, requiring the agent to perform a large number of actions before it solves the task.

PPE takes an approach called hidden state decoding, where the agent learns a type of ML model called a decoder to extract the hidden endogenous state from an observation. It does this in a self-supervised manner, meaning it does not require a human to provide it with labels. For example, PPE can learn a decoder to extract the robot and any obstacle’s position in the park. PPE is the first provable algorithm that can extract the endogenous state and use it to perform RL efficiently.

Path Prediction and Elimination: An RL algorithm that is robust to exogenous noise

PPE is simple to implement and is fast to run. It works by learning a small set of paths that can take the agent to all possible endogenous states. The agent can technically consider all possible paths of length (h), enabling it to visit every endogenous state. However, as there are (A^h) possible paths of length (h), the number of paths will overwhelm the agent as (h) increases. The more paths the agent has to work with, the more data it needs to solve a given task. Ideally, if there are (S) number of endogenous states, we need just (S) number of paths, with only one unique path going to each endogenous state. PPE works by eliminating redundant paths that visit the same endogenous state by solving a novel self-supervised classification task.

PPE is similar in structure to the breadth-first search algorithm in that it runs a for-loop, where, in iteration (h) of the loop, the agent learns to visit all endogenous states that can be reached by taking (h) actions. At the start of iteration, the agent maintains a list of paths of length (h). This list has a path to visit every endogenous state that’s reachable after taking (h) actions. However, this list may also contain redundant paths, i.e., multiple paths that reach the same endogenous state. When this list is simply all paths of length 1, it corresponds to every action in the agent’s action space.

The top of Figure 2 shows agent’s initial list of paths, which contains at least three paths: ( pi_1), (pi_2), and (pi_3). The first two paths reach the same destination, denoted by the endogenous state (s_1). In contrast, the last path (pi_3) reaches a different endogenous state (s_2). Figure 2 shows a sampled observation (or image) for each endogenous state.

Because PPE wants to learn a small set of paths to visit all endogenous states, it seeks to eliminate the redundant paths by collecting a dataset of observations coupled with the path that was followed to observe them. In Figure 2, both (pi_1) or (pi_2) reach the same endogenous state, so one of them can be eliminated. This is done by randomly selecting a path in its list, following this path to the end, and saving the last observation. For example, our dataset can contain a tuple ((pi_1, x)) where (pi_1) is the policy in our list and (x) is the image in top-right of Figure 2. PPE collects a dataset of many such tuples.

This animation shows an iteration for a PPE algorithm. At the start of iteration, the algorithm contains a list of paths to visit endogenous states, including three redundant paths, two of which visit the same endogenous state, while a third visits a different endogenous state. It also shows two sampled observations for these endogenous states. PPE eliminates the redundant path while keeping the other paths.
Figure 2: Execution of the PPE algorithm at a given for-loop iteration. For each iteration, PPE starts with a list of paths to visit endogenous states and then eliminates redundant paths—those that visit an endogenous state that can also be reached by an existing path. The extra path, (pi_2) is eliminated because it reaches an endogenous state that can also be reached by an existing path (pi_1).

PPE then solves a multiclass classification problem to predict the index of the path from the last observation. The index of a path is computed with respect to the original list. This classification problem can be solved with any appropriate model class, such as deep neural networks, using PyTorch, TensorFlow, or a library of your choice. If two different paths, (pi_1) and (pi_2), reach the same endogenous state, the learned classifier won’t be able to deterministically predict which path was used to visit observations from this state. That is, the learned classifier predicts a high probability for both paths given an observation from this endogenous state. PPE uses this confusion signal to eliminate one of these paths because both paths reach the same endogenous state. PPE also learns a decoder as a result solving the classification problem described above, which maps an observation to the index of the leftover path with the highest probability under the learned classifier.

At the end of iteration (h) of the for-loop, PPE will have found a list of leftover paths that includes a unique path for every endogenous state that’s reachable after taking (h) actions. It then expands these leftover paths to create the list for the next iteration of the for-loop. For every path that’s left over, PPE creates (A) number of new paths by concatenating every action to the end of the path. The for-loop then continues with the next iteration.

Note that the above steps of PPE can be computed even in the absence of rewards. The output of these steps, namely the decoder and the learned leftover paths, can be cached and used to optimize any reward functions provided later. We discuss various strategies to optimize any given reward function in our paper, including both model-free and model-based approaches.

Proof, experiment, and code

The paper also provides a mathematical proof that PPE efficiently solves a large class of RL problems. Using a small amount of data, it can accurately explore, find a policy that achieves maximum sum of rewards, recover a decoder that maps the observation to its hidden endogenous state, and recover the dynamics of the endogenous state with a high probability. We describe various experiments where PPE successfully performs these tasks in line with its mathematical guarantee and outperforms various prior methods.

This is illustrated in Figure 3. It depicts a visual grid-world where the agent’s goal is to navigate to the slice of pizza on the other side of the pond, populated by two ducks that move independently of agent’s actions and are the source of exogenous noise. The endogenous state will consist of the position of the agent. The figure shows what PPE is expected to do in this task. It will gradually learn longer paths that reach various endogenous states in the environment. It will also learn a decoder and use it to extract the dynamics of the latent endogenous state, shown on the right.

This animation shows an agent navigating in a grid-world task to reach a goal on the opposite side. Sources of exogenous noise appear between the agent and its goal. These change in position independent of the agent. The PPE learning paths of longer length explore the environment and finally reach the goal. On the right of the animation, we show the dynamics of the endogenous state that is being extracted by PPE. The dynamics are represented by green circles that denote endogenous states. Arrows between two circles shows whether it is possible for the agent to move between the corresponding endogenous states. The endogenous state in the dynamics corresponds to the position of the agent in the grid-world.
Figure 3: The area on the left shows a visual grid-world navigation task where an agent is trying to reach a slice of pizza. The motion of the ducks is a source of exogenous noise. PPE allows the agent to learn a small set of paths to visit every endogenous state. On the right, PPE also learns a decoder and uses it to extract the dynamics of the latent endogenous state. The circles denote an endogenous state and the arrows denote possible ways to navigate from one endogenous state to another.

The road ahead

While PPE is the first RL algorithm that offers a mathematical guarantee in the presence of exogenous noise, there is still work to do before we can solve every RL problem that includes exogenous noise. Some of the unanswered questions that we are pursuing include:

  1. How can we eliminate the assumption that PPE makes, that latent endogenous state dynamics are near-deterministic?
  2. Can we extend PPE to work in nonepisodic settings, where the agent generates a single long episode?
  3. How does PPE perform on real-world problems?
  4. Can we make PPE a truly online algorithm, eliminating the need to collect large datasets before it improves?

RL algorithms hold great promise for improving applications in a diverse range of fields, from robotics, gaming, and software debugging, to healthcare. However, exogenous noise presents a serious challenge in unlocking the full potential of RL agents in the real world. We’re hopeful that PPE will motivate further research in RL in the presence of exogenous noise.

The post PPE: A fast and provably efficient RL algorithm for exogenous noise appeared first on Microsoft Research.

Read More

Don’t let data drift derail edge compute machine learning models

Diagram showing Ekya’s architecture. Video data flows from a series of cameras into specialized, lightweight inference models and shared resource pools before reaching the edge.

Edge computing has come of age, with deployments enabling many applications that process data from IoT sensors and cameras. In 2017, we identified the symbiotic relationship between edge computing and video analytics in an article, noting that live video analytics is the “killer app” for edge computing. Edge devices come in various shapes and sizes but are inherently resource-constrained relative to the cloud. 

These resource constraints necessitate lightweight machine learning (ML) models at the edge. Using techniques for model specialization and compression, the community has obtained edge models whose compute and memory footprints are substantially lower (by 96x for object detector models). Such models are super amenable to deploy at the edge. 

Smooth going so far, but the villain in the story is data drift! This is the phenomenon where the live data in the field diverges significantly from the initial training data. We achieved the phenomenally low compute footprints for edge models only because we specialized the models to be specific to the camera streams. But in the bargain, they lost their ability to generalize much beyond what they have seen during training. This lack of generality comes back to bite us when data drifts and accuracy of the models drop – by as much as 22% – when they are deployed in the field. 

Ekya is a solution, developed with collaborators at University of California, Berkeley and University of Chicago, that addresses the problem of data drift on the edge compute box. Instead of sending video data to the cloud for periodic retraining of models, which is costly in its bandwidth usage and can raise privacy questions, Ekya enables both retraining and inference to co-exist on the edge box. For more details, take a look at our paper: Ekya: Continuous Learning of Video Analytics Models on Edge Compute Servers, which has been published at NSDI 2022. We are excited to release the code for Ekya as well. 

Not only can you use the code to reproduce all experiments in our paper, we also hope that the code can help you easily build a continuous learning system for your edge deployment. Oh, and one more thing—we are also pointing to the raw video datasets released by the City of Bellevue. This includes 101 hours of video from five traffic intersections, all of which have also been labeled with our golden YOLOv3 model. We hope that the videos from the City of Bellevue as well as the other datasets included in the repository will aid in the building of new edge models as well as improving our pre-trained specialized models to significantly advance the state of the art.

Please reach out to Ganesh Ananthanarayanan with any questions.

Explore More

  • Video

    Video Analytics for Smart Cities


    Microsoft Research has an on-going pilot in Bellevue, Washington for active traffic monitoring of traffic intersections live 24X7. This project is focused on is video streams from cameras at traffic intersections. Traffic-related accidents are among the top 10 reasons […]

The post Don’t let data drift derail edge compute machine learning models appeared first on Microsoft Research.

Read More