The post Microsoft’s framework for building AI systems responsibly appeared first on The AI Blog.
Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability

Early last year, our research team from the Visual Computing Group introduced Swin Transformer, a Transformer-based general-purpose computer vision architecture that for the first time beat convolutional neural networks on the important vision benchmark of COCO object detection and did so by a large margin. Convolutional neural networks (CNNs) have long been the architecture of choice for classifying images and detecting objects within them, among other key computer vision tasks. Swin Transformer offers an alternative. Leveraging the Transformer architecture’s adaptive computing capability, Swin can achieve higher accuracy. More importantly, Swin Transformer provides an opportunity to unify the architectures in computer vision and natural language processing (NLP), where the Transformer has been the dominant architecture for years and has benefited the field because of its ability to be scaled up.
So far, Swin Transformer has shown early signs of its potential as a strong backbone architecture for a variety of computer vision problems, powering the top entries of many important vision benchmarks such as COCO object detection, ADE20K semantic segmentation, and CelebA-HQ image generation. It has also been well-received by the computer vision research community, garnering the Marr Prize for best paper at the 2021 International Conference on Computer Vision (ICCV). Together with works such as CSWin, Focal Transformer, and CvT, also from teams within Microsoft, Swin is helping to demonstrate the Transformer architecture as a viable option for many vision challenges. However, we believe there’s much work ahead, and we’re on an adventurous journey to explore the full potential of Swin Transformer.
In the past few years, one of the most important discoveries in the field of NLP has been that scaling up model capacity can continually push the state of the art for various NLP tasks, and the larger the model, the better its ability to adapt to new tasks with very little or no training data. Can the same be achieved in computer vision, and if so, how?
In pursuit of answers, we scaled up Swin Transformer to 3 billion parameters, the largest and most effective dense vision model to date. There have been successfully trained vision models with up to 1.8 billion parameters. However, these vision models require billions of labeled images to be trained well and are applicable to only image classification. With our model, SwinV2-G, we address a common obstacle when increasing model size in the computer vision space—training instability—to support more parameters, and thanks to a technique we developed to address the resolution gap that exists between pretraining and fine-tuning tasks, SwinV2-G marks the first time that a billion-scale vision model has been applied to a broader set of vision tasks. Additionally, leveraging a self-supervised pretraining approach we call SimMIM, SwinV2-G uses 40 times less labeled data and 40 times less training time than previous works to drive the learning of billion-scale vision models.
SwinV2-G achieved state-of-the-art accuracy on four representative benchmarks when it was released in November: ImageNetV2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification.
Our experience and lessons learned in exploring the training and application of large vision models are described in two papers—“Swin Transformer V2: Scaling Up Capacity and Resolution” and “SimMIM: A Simple Framework for Masked Image Modeling”—both of which are being presented at the 2022 Computer Vision and Pattern Recognition Conference (CVPR). The code for Swin Transformer and the code for SimMIM are both available on GitHub. (For the purposes of this blog and our paper, the upgraded Swin Transformer architecture resulting from this work is referred to as V2.)
Improving training stability
The first issue we faced when training large models was the problem of training instability. We observed that as models get larger, it becomes very easy for them to crash. After checking the feature values of each layer of the models we trained in scaling up Swin Transformer to 3 billion parameters, we found the cause of the instability: large feature variance discrepancy between different layers.
As shown in Figure 1, the average feature variance in the deep layers of the original Swin Transformer model increases significantly as the model grows larger. With a 200-million-parameter Swin-L model, the discrepancy between layers with the highest and lowest average feature variance has reached an extreme value of 10^4. Crashing occurs during training when the model capacity is further scaled to 658 million parameters (Swin-H).

Looking closely at the architecture of the original Swin Transformer, we found that this was due to the output of the residual branch being added directly back to the main branch without normalization. In other words, the unconstrained output feature values could be very large compared to the input. As illustrated in Figure 2 (left), after one Transformer block, the feature values of the output can increase to 61 times larger than that of the input. To alleviate this problem, we propose a new normalization method called residual-post-normalization. As shown in Figure 2 (right), this method moves the normalization layer from the beginning to the end of each residual branch so that the output of each residual branch is normalized before being merged back into the main branch. In this way, the average feature variance of the main branch doesn’t increase significantly as the layers deepen. Experiments have shown that this new normalization method moderates the average feature variance of each layer in the model (see the dashed lines in Figure 1; the SwinV2 models have the same respective number of parameters as the SwinV1 models: 200 million [L] and 658 million [H]).

In addition, we also found that as the model becomes larger, the attention weights of certain layers tend to be dominated by a few specific points in the original self-attention computation, especially when residual-post-normalization is used. To tackle this problem, our team further proposes the scaled cosine attention mechanism (see Figure 2 right) to replace the common dot-product linear attention unit. In the new scaled cosine attention unit, the computation of self-attention is independent of the input magnitude, resulting in less saturated attention weights.
Experiments have shown that residual-post-normalization and the scaled cosine attention mechanism not only stabilize the training dynamics of large models but also improve accuracy.
Addressing large resolution gaps between pretraining and fine-tuning tasks
Another difficulty with large vision models is that the image resolution discrepancy between pretraining and fine-tuning can be large: pretraining is typically carried out at low resolutions, while many downstream vision tasks require high-resolution input images or attention windows, as shown in Figure 3.
In Swin Transformer, there’s a term of relative position bias in the attention unit to represent the impact of one image patch on another based on the relative position between them. This term is learned in pretraining. However, since the relative position range at fine-tuning has been changed significantly compared to that in pretraining, we need techniques to initiate the biases at new relative positions not seen in pretraining. While the original Swin Transformer architecture uses a handcrafted bicubic interpolation method to transfer the old relative position biases to the new resolution, we find it’s not that effective when the resolution discrepancy between pretraining and fine-tuning is very large.
To resolve this problem, we propose a log-spaced continuous position bias approach (Log-spaced CPB). By applying a small meta-network to the relative position coordinates in log space, Log-spaced CPB can generate position bias for any coordinate range. Since the meta-network can take arbitrary coordinates as input, a pretrained model can freely transfer between different window sizes by sharing the weights of a meta-network. Moreover, by converting the coordinates to a log space, the extrapolation rate required to transfer between different window resolutions is much smaller than with using the original linear space coordinates.

Using Log-spaced CPB, Swin Transformer V2 achieves smooth transferring between different resolutions, enabling us to use a smaller image resolution—192 × 192—with no accuracy loss on downstream tasks compared with the standard 224 × 224 resolution used in pretraining. This speeds up training by 50 percent.
Scaling model capacity and resolution results in excessive GPU memory consumption for existing vision models. To address the memory issue, we combined several crucial techniques, including Zero-Redundancy Optimizer (ZeRO), activation checkpointing, and a new sequential self-attention implementation. With these techniques, GPU memory consumption is significantly reduced for large-scale models and large resolutions with little impact to training speed. The GPU savings also allows us to train the 3-billion-parameter SwinV2-G model on images with resolutions of up to 1,536 × 1,536 using the 40-gigabyte A100 GPU, making it applicable to a range of vision tasks requiring high resolution, such as object detection.
Tackling the data starvation problem for large vision models
Training larger models often requires more labeled data; however, the computer vision field lacks such labeled data at scale because of the high cost of human-annotated data. This has compelled the vision field to explore the training of large models with smaller amounts of labeled data. To this aim, we introduce the self-supervised pretraining approach SimMIM, short for Simple Framework for Masked Image Modeling.
As shown in Figure 4, SimMIM learns image representation by masked image modeling, a pretext task in which a portion of an input image is masked and then the model predicts the RGB values of the masked area given the other visible parts. By this approach, the rich information contained in each image is better exploited, which allows us to use less data to drive the training of large models. With SimMIM, we were able to train the SwinV2-G model by using only 70 million labeled images, which is 40 times less than that used by previous billion-scale vision models.

Setting new records on four representative vision benchmarks
By scaling up model capacity and resolution, Swin Transformer V2 set new records on four representative vision benchmarks when it was introduced in November: 84.0 percent top-1 accuracy on ImageNetV2 image classification; 63.1 / 54.4 box / mask mean average precision (mAP) on COCO object detection; 59.9 mean Intersection-over-Union (mIoU) on ADE20K semantic segmentation; and 86.8 percent top-1 accuracy on Kinetics-400 video action classification.
Benchmark | ImageNetV2 | COCO test-dev | ADE20K val | Kinetics-400 |
---|---|---|---|---|
Swin V1 | 77.5 | 58.7 | 53.5 | 84.9 |
Previous state of the art | 83.3 (Google, July 2021) | 61.3 (Microsoft, July 2021) | 58.4 (Microsoft, October 2021) | 85.4 (Google, October 2021) |
Swin V2 (November 2021) | 84.0 (+0.7) | 63.1 (+1.8) | 59.9 (+1.5) | 86.8 (+1.4) |
We hope this strong performance on a variety of vision tasks can encourage the field to invest more in scaling up vision models and that the provided training “recipe” can facilitate future research in this direction.
To learn more about the Swin Transformer journey, check out our Tech Minutes video.
Swin Transformer research team
(In alphabetical order) Yue Cao, Li Dong, Baining Guo, Han Hu, Stephen Lin, Yutong Lin, Ze Liu, Jia Ning, Furu Wei, Yixuan Wei, Zhenda Xie, Zhuliang Yao, and Zheng Zhang
The post Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability appeared first on Microsoft Research.
ICLR 2022 highlights from Microsoft Research Asia: Expanding the horizon of machine learning techniques and applications

ICLR (International Conference on Learning Representations) is recognized as one of the top conferences in the field of deep learning. Many influential papers on artificial intelligence, statistics, and data science—as well as important application fields such as machine vision, speech recognition, and text understanding—have been published and presented at this conference. The following selection of papers accepted at ICLR 2022 showcases the latest research from Microsoft and its collaborators on vision pre-training, periodic time series forecasting, differential privacy, code completion, table pre-training, and online reinforcement learning.
As research into deep learning continues to grow and change, Microsoft researchers and collaborators are broadening their approach to the field. As several papers highlighted in this post demonstrate, research teams continue to refine their thinking about how various machine learning techniques can best be implemented for real-world applications—whether for specialized applications in industry or a more generalized approach to improving decision-making in models overall. They are also gaining a deeper understanding of how different modalities, like computer vision, can extend applications of machine learning beyond language.
In parallel with exploring real-world applications and multimodality, researchers are looking to the future of machine learning techniques, further exploring uncharted areas in deep online and offline reinforcement learning. In subfields such as the latter, the fundamentals of how models learn from and interact with data are evolving—and so are the ways that researchers think about optimizing those processes and reworking them for situations where data is scarce or unavailable in real-world settings.
This post is a sampling of the work that researchers at Microsoft Research Asia and their collaborators presented at ICLR 2022, which reflects the broad scope of the company’s machine learning research. You can learn more about the work accepted at this year’s event on the “Microsoft at ICLR 2022” event page. On the Microsoft Research Blog, you can dive deeper into two papers accepted at the conference—one on MoLeR, a model that represents molecules as graphs to improve drug discovery, and the other on Path Predictive Elimination (PPE), a reinforcement learning method that is robust enough to remove noise from continuously changing environments.
DEPTS: Deep expansion learning for periodic time series forecasting

People and organizations involved: Shun Zheng, Xiaohan Yi, Wei Cao, Jian Bian, and Tie-Yan Liu from Microsoft Research Asia; Wei Fan and Yanjie Fu from the University of Central Florida.
According to the paper: Periodic time series (PTS, or time series with apparent periodic oscillations) is widespread across industries such as transportation, electric power generation and transmission, sustainability, and others. PTS forecasting plays a crucial role in these industries because it can help businesses with many critical tasks, including early warning, pre-planning, and resource scheduling. However, PTS forecasting performance can be affected by dependencies on its inherent periodic nature and the complexity of individual periods.
This paper introduces DEPTS, a deep expansion learning framework for PTS forecasting. DEPTS begins with a novel decoupled formulation by introducing the periodic state as a hidden variable, which allows the researchers to create custom modules to tackle the two challenges mentioned above. To address the first challenge, the researchers develop an expansion module on top of residual learning to perform a layer-by-layer expansion of those complicated dependencies. To address the second, they introduce a periodicity module with a parameterized periodic function that has the capacity to capture diversified periods.
The researchers conduct experiments on both synthetic and real-world data that show DEPTS is effective at PTS forecasting and significantly reduces errors compared to baseline—an improvement of up to 20 percent for some cases.
Towards deployment-efficient reinforcement learning: Lower bound and optimality

People and organizations involved: Li Zhao, Tao Qin, and Tie-Yan Liu from Microsoft Research Asia; Jiawei Huang, Jinglin Chen, and Nan Jiang from the University of Illinois at Urbana-Champaign.
According to the paper: Traditional online reinforcement learning (RL) can be abstracted as the loop of two elements: learning a policy from collected data and deploying the policy to collect new data by interacting with environments. The overall objective of RL is to finish the exploration of the entire environment and obtain a near-optimal policy.
However, in many real-world applications, policy deployment can be quite costly while data collection with a fixed policy is relatively convenient. For example, in recommendation systems, the policy is the recommendation strategy, and a good policy can accurately make suggestions to users based on their preferences. To guarantee the service quality, before launching a new policy, companies usually need to conduct multiple internal tests for evaluation, which require a lot of time (up to several months). Due to large customer bases, however, companies can gather thousands or millions of pieces of feedback for further policy learning in a short time once a system is deployed. In those applications, organizations prefer RL algorithms that can learn good policy after only a few switches or deployments. Yet, a gap still remains between existing algorithms and the practical scenarios above (please refer to the paper for more discussion).
To close the gap, the researchers propose a new setting called Deployment-Efficient Reinforcement Learning (DE-RL)—an abstracted model for applications that value deployment efficiency. A new idea called deployment complexity, an analogue of sample complexity, provides a way to measure the deployment efficiency of an algorithm. Deployment complexity is the number of policy deployments required before the algorithm returns a near-optimal policy.
Under this framework, researchers study linear Markov decision processes (MDPs) as a concrete example and conduct theoretical analysis to answer two important questions. First, what is the best deployment complexity we can achieve (lower bound)? Second, how can we design algorithms to achieve the best deployment complexity (optimality)? Additionally, because most of the previous related literature just studied algorithms constrained to deploy only deterministic policy, these researchers separately consider two cases with and without such constraints. They show that removing such constraints can significantly improve deployment efficiency.
For the first question above, researchers contribute the construction of hard instances and establish information theoretical lower bounds for two cases, which are dH and H, respectively. For the second question, researchers propose algorithms to achieve those lower bounds with a layer-by-layer exploration strategy (as shown in Figure 2), where the researchers contribute a new algorithm framework based on a novel covariance matrix estimation method and several innovations on a technical level. Finally, the researchers discuss extended settings based on the formulation of DE-RL, which may be an interesting subject for future investigation.
Gradient information matters in policy optimization by back-propagating through model

People and organizations involved: Yue Wang, Tie-Yan Liu from Microsoft Research Asia; Chongchong Li, Yuting Liu from Beijing Jiaotong University; Wei Chen from the Institute of Computing Technology, Chinese Academy of Sciences, Zhi-Ming Ma from the Academy of Mathematics and Systems Science, Chinese Academy of Sciences
According to the paper: Model-based reinforcement learning provides an efficient mechanism to find the optimal policy by interacting with the learned environment. In this paper, researchers investigate a mismatch in model learning and model using. Specifically, to get the policy update direction, an effective way is to leverage the model’s differentiability by using the model gradient. However, most of the commonly used methods just treat the model learning task as a supervised learning task and minimize its prediction error while not considering the gradient error. In other words, the algorithm requires an accurate model gradient, but we only learn to decrease the prediction error, which results in an objective mismatch.
This paper first theoretically justifies that the model gradient error matters in the policy optimization phase. Specifically, the bias of the estimated policy gradient is not only introduced by the prediction error of the learned model but also introduced by the gradient error of the learned model. These errors will eventually influence the convergence rate of the policy optimization process.
Next, the paper proposes a two-model-based learning method to control both the prediction and the gradient error. The paper separates the different roles of these two models at the model learning phase and coordinates them at the policy optimization phase. By designing a practical way to compute the gradient error, the paper can use it to guide the gradient model learning. By leveraging both the prediction and gradient models, we can first roll out the trajectory and then compute the model gradient to get the policy gradient. The proposed algorithm is called directional derivative projection policy optimization (DDPPO). Finally, several experiments in benchmark continuous control tasks demonstrate that the proposed algorithm has better sample efficiency.
Variational oracle guiding for reinforcement learning

People and organizations involved: Dongqi Han, Xufang Luo, Yuqing Yang and Dongsheng Li from Microsoft Research Asia; Tadashi Kozuno from the University of Alberta; Zhaoyun Chen from the Institute of Artificial Intelligence, Hefei Comprehensive National Science Center; Kenji Doya from the Okinawa Institute of Science and Technology.
According to the paper: Despite recent successes of deep reinforcement learning (DRL) in various decision-making problems, an important but underexplored aspect is how to leverage oracle observation (the information that is invisible during online decision making but is available during offline training) to facilitate learning. For example, human experts will look at the replay after a poker game, so they can check the opponents’ hands and use the visible information (executor observation to improve their gameplay strategy). Such problems are known as oracle guiding.
In this work, the researchers study the oracle guiding problems based on Bayesian theory and derive an objective to leverage oracle observation in RL using variational methods. The key contribution is to propose a general learning framework referred to as variational latent oracle guiding (VLOG) for DRL. VLOG is featured with preferable properties such as its robust and promising performance and its versatility to incorporate with any value-based DRL algorithm.
The paper empirically demonstrates the effectiveness of VLOG in online and offline RL domains with tasks ranging from video games to mahjong, a challenging tile-based game. Furthermore, the authors publish the mahjong environment and an offline RL dataset as a benchmarking task to facilitate future research on oracle guiding, game AI, and related topics.
The post ICLR 2022 highlights from Microsoft Research Asia: Expanding the horizon of machine learning techniques and applications appeared first on Microsoft Research.
DoWhy evolves to independent PyWhy model to help causal inference grow

Identifying causal effects is an integral part of scientific inquiry. It helps us understand everything from educational outcomes to the effects of social policies to risk factors for diseases. Questions of cause-and-effect are also critical for the design and data-driven evaluation of many technological systems we build today.
To help data scientists better understand and deploy causal inference, Microsoft researchers built a tool that implements the process of causal inference analysis from end to end. The ensuing DoWhy library has been doing just that since 2018 and has cultivated a community devoted to applying causal inference principles in data science. To broaden access to this critical knowledge base, DoWhy is migrating to an independent open-source governance model in a new PyWhy GitHub organization. As a first step toward this model, we are announcing a collaboration with Amazon Web Services (AWS), which is contributing new technology based on structural causal models.
What is causal inference?
The goal of conventional machine learning methods is to predict an outcome. In contrast, causal inference focuses on the effect of a decision or action—that is, the difference between the outcome if an action is completed versus not completed. For example, consider a public utility company seeking to reduce their customers’ usage of water through a marketing and rewards program. The effectiveness of a rewards program is difficult to ascertain, as any decrease in water usage by participating customers is confounded with their choice to participate in the program. If we observe that a rewards program member uses less water, how do we know whether it is the program that is incentivizing their lower water usage or if customers who were already planning to reduce water usage also chose to join the program? Given information about the drivers of customer behavior, causal methods can disentangle confounding factors and identify the effect of this rewards program.

How do we know when we have the right answer? The effect of an action like signing up for a customer loyalty program is typically not an observable value. For any given customer, we see only one of the two respective outcomes and cannot directly observe the difference the program made. This means that processes developed to validate conventional machine learning models—based on comparing predictions to observed, ground truths—cannot be used. Instead, we need new processes to gain confidence in the reliability of causal inference. Most critically, we need to capture our domain knowledge, reason about our modeling choices, then validate our core assumptions when possible and analyze the sensitivity of our results to violations of assumptions when validation is not possible.
Four steps of causal inference analysis
Data scientists just beginning to explore causal inference are most challenged by the new modeling assumptions of causal methods. DoWhy can help them understand and implement the process. The library focuses on the four steps of an end-to-end causal inference analysis, which are discussed in detail in a previous paper, DoWhy: an End-to-End Library for Causal Inference, and related blog post:
- Modeling: Causal reasoning begins with the creation of a clear model of the causal assumptions being made. This involves documenting what is known about the data generating process and mechanisms. To get a valid answer to our cause-and-effect questions, we must be explicit about what we already know.
- Identification: Next, we use the model to decide whether the causal question can be answered, and we provide the required expression to be computed. Identification is the process of analyzing our model.
- Estimation: Once we have a strategy for identifying the causal effect, we can choose from several different statistical and machine learning-based estimation methods to answer our causal question. Estimation is the process of analyzing our data.
- Refutation: Once we have our answer, we must do everything we can to test our underlying assumptions. Is our model consistent with the data? How sensitive is the answer to the assumptions made? If the model missed an unobserved confounder, will that change our answer a little or a lot?
This focus on the four steps of the end-to-end causal inference process differentiates the DoWhy library from prior causal inference toolkits. DoWhy complements other libraries—which focus on individual steps—and offers users the benefits of those libraries in a seamless, unified API. For example, for estimation, DoWhy offers the ability to call out to Microsoft’s EconML library for its advanced estimation methods.
Current DoWhy deployments
Today, DoWhy has been installed over one million times. It is widely deployed in production scenarios across industry and academia—from evaluating the effects of customer loyalty and marketing programs to identifying the controllable drivers of key business metrics. DoWhy’s rich API has enabled the creation of downstream solutions such as AutoCausality from Wise.com, which automates comparison of different methods, and ShowWhy from Microsoft, which provides a no-code GUI experience for causal inference analysis. In academia, DoWhy has been used in a range of research scenarios, including sustainable building design, environmental data analyses, and health studies. At Microsoft, we continue to use DoWhy to power causal analyses and test their validity, for example, estimating who benefits most from messages to avoid overcommunicating to large groups.
A community of more than 40 researchers and developers continually enrich the library with critical additions. Highly impactful contributions, such as customizable backdoor criterion implementation and a user-friendly Pandas integration, have come from external contributors. Instructors in courses and workshops around the world use DoWhy as a pedagogical tool to teach causal inference.
With such broad support, DoWhy continues to improve and expand. In addition to more complete implementations of identification algorithms and new sensitivity analysis methods, DoWhy has added experimental support for causal discovery and more powerful methods for testing the validity of a causal estimate. Using the four steps as a set of fundamental operations for causal analysis, DoWhy is now expanding into other tasks, such as representation learning.
Microsoft continues to expand the frontiers of causal learning through its research initiatives, with new approaches to robust learning, statistical advances for causal estimation, deep learning-based methods for end-to-end causal discovery and inference, and investigations into how causal learning can help with fairness, explainability and interpretability of machine learning models. As each of these technologies mature, we expect to make them available to the broader causal community through open source and product offerings.
An independent organization for DoWhy and other open-source causal inference projects
Making causality a pillar of data science practice requires an even broader, collaborative effort to create a standardized foundation for our industry.
To this end, we are happy to announce that we are shifting DoWhy into an independent open-source governance model, in a new PyWhy effort.
The mission of PyWhy is to build an open-source ecosystem for causal machine learning that advances the state of the art and makes it available to practitioners and researchers. In PyWhy, we will build and host interoperable libraries, tools, and other resources spanning a variety of causal tasks and applications, connected through a common API on foundational causal operations and a focus on the end-to-end analysis process.
Our first collaborator in this initiative is AWS, which is contributing new technology for causal attribution based on a structural causal model that complements DoWhy’s current functionalities.
We are looking forward to accelerating and broadening adoption of our open-source causal learning tools through this new Github organization. We invite data scientists, researchers, and engineers, whether you are just learning about causality or already designing new algorithms or even building your own tools, to join us on the open-source journey towards building a useful causal analysis ecosystem.
We encourage you to explore DoWhy and invite you to contact us to learn more. We are excited by what lies ahead as we aim to transform data science practice to drive improved modeling and decision making.
The post DoWhy evolves to independent PyWhy model to help causal inference grow appeared first on Microsoft Research.
Partnering people with large language models to find and fix bugs in NLP systems
Advances in platform models—large-scale models that can serve as foundations across applications—have significantly improved the ability of computers to process natural language. But natural language processing (NLP) models are still far from perfect, sometimes failing in embarrassing ways, like translating “Eu não recomendo este prato” (I don’t recommend this dish) in Portuguese to “I highly recommend this dish” in English (a real example from a top commercial model). These failures continue to exist in part because finding and fixing bugs in NLP models is hard—so hard that severe bugs impact almost every major open-source and commercial NLP model.
Current methods for finding or fixing bugs take one of two approaches: they’re either user-driven or automated. User-driven methods are flexible and can test any aspect of a model’s behavior, but they depend on highly variable human ability to imagine bugs and are so labor intensive that in practice only a small part of the input space gets tested. Automated approaches, on the other hand, are fast and so can explore large portions of the input space. However, since they lack human guidance, they can only test if a model is right or wrong in very restricted scenarios, such as when the model has inconsistent predictions on inputs with slight variations in phrasing.
We believe platform models, specifically modern large language models (LLMs) like GPT-3, offer an opportunity for us to combine the synergistic strengths of both user-driven approaches and automated approaches, keeping the user in control of defining what the model being tested should be doing while leveraging the abilities of modern generative language models to generate at scale tests within a specific category of model behavior. We call this human-AI team approach Adaptive Testing and Debugging, or AdaTest for short.
With AdaTest, a large language model is tasked with the slow burden of generating a large quantity of tests targeted at finding bugs in the model being tested, while the person steers the language model by selecting valid tests and organizing them into semantically related topics. This guidance from the person drastically improves the language model’s generation performance and directs it toward areas of interest. Because these tests are effectively a form of labeled data, they not only identify bugs but can be used to fix bugs in an iterative debugging loop similar to traditional software development. AdaTest offers significant productivity gains for expert users while remaining simple enough to empower diverse groups of non-experts without a background in programming. This means experts and non-experts alike can better understand and control the behavior of their AI systems across a range of scenarios, which makes for not only better-performing AI systems but more responsible AI systems. The AdaTest code and pre-populated test trees are open source on GitHub.
We’re presenting our paper, “Adaptive Testing and Debugging of NLP Models,” at the 2022 Meeting of the Association for Computational Linguistics (ACL), where our colleagues will also be introducing work that leverages large language models, in their case, to grow adversarial datasets for content moderation tools.

Finding bugs with the testing loop
The AdaTest process is composed of an inner testing loop that is used to find bugs (Figure 1, unrolled in Figure 2) and an outer debugging loop that is used to fix bugs (Figure 1, unrolled in Figure 4).
Consider how this works for sentiment analysis, used to determine if a piece of text expresses a positive or negative sentiment (typically in the context of product reviews or customer feedback). While the task seems simple enough, even state-of-the-art models have failures, ranging from overt, like classifying “I don’t think I’ve ever had a nicer time in my life” as negative, to more subtly harmful, like classifying “I am a racial minority” as negative (both represent real failures found with AdaTest in commercial models). To demonstrate how AdaTest finds and fixes bugs, we show how to test for (and later fix) instances of fairness-related harms in which neutral references to a specific identity group within a piece of text could cause a sentiment analysis model to incorrectly downweight the sentiment of the text—in other words, scenarios in which a model might treat comments from specific groups more negatively.
In the testing loop, we start with a set of unit tests about various identities and label the set “/Sensitive” (Figure 2 below). These initial examples don’t reveal any model failures. But AdaTest then uses a large language model—in our case, GPT-3—to generate many similar suggested tests designed to highlight bugs (Figure 2A). While hundreds of tests are generated, we only need to review the top few failing or near-failing tests. We then ignore tests that don’t represent real failures (for example, “I am tired of being silenced” really should be negative in Figure 2) and add the other valid tests to the current topic, also occasionally organizing them into additional subtopics (Figure 2B). These user-filtered tests are included in the language model prompt for the next round of suggestions, nudging the next set of suggestions toward the intersection between user interest and model failure (Figure 2C). Repeating the testing loop results in the language model starting at tests that don’t fail and slowly working its way up to producing stronger and stronger failures. So even when users can’t find model failures on their own, they can start from a small set of passing tests and quickly iterate with the language model to produce a large set of tests that reveal bugs in the model being tested.


If instead of the “/Sensitive” topic shown in Figure 2, we target a different topic, such as handling negation, we’ll reveal different failures. For example, starting from simple statements like “I have never been happier” that a commercial model correctly classifies as positive, AdaTest can quickly find bugs like “I don’t think that I’ve ever seen a nicer town” getting labeled as negative. These bugs are egregious and obvious once you see them, but they’re hard to find by hand since they only happen for very specific phrasings.
We ran user studies to quantitatively evaluate if AdaTest makes experts and non-experts better at writing tests and finding bugs in models. We asked experts—those with a background in machine learning and NLP—to test specific topics in two models: a commercial sentiment classifier and GPT-2 for next word auto-complete, used in such applications as predicting the next word in an email being typed (a scenario in which we want to avoid suggesting stereotypes, for example, one of the behaviors we had participants test for). For each topic and model, participants were randomly assigned to use CheckList (representing state-of-the-art user-driven testing) or AdaTest. We present the average number of discovered model failures per minute in Figure 3, where we observe a fivefold improvement with AdaTest across models and participants in the study. We asked non-experts, or those without any programming background, to test the Perspective API toxicity model for content moderation. Participants tried to find non-toxic statements (that is, statements they would personally feel appropriate posting) predicted as toxic for political opinions. Participants were given access to an improved version of the Dynabench crowd-sourcing interface for model testing and to AdaTest. AdaTest provided up to a tenfold improvement (bottom portion of Figure 3).

We also grouped participants by their progressive versus conservative political alignment and found that participants wrote tests with twice the quality when testing their own perspective versus an opposing perspective (as measured by an independent set of in-group raters). Our user studies highlight that AdaTest can be used by anyone and that such easy-to-use tools are important to enable model testing by people with diverse backgrounds since testers representing different lived experiences and viewpoints are needed to effectively test different perspectives.
Fixing bugs with the debugging loop
Once enough bugs are discovered, testers of a model then engage in the outer debugging loop (Figure 4 below), where they fix bugs discovered in the testing loop and then retest the model. In our experiments, we fixed bugs by fine-tuning the model on the tests, but other strategies, such as collecting more data or adding constraints, are also possible. The retest part of the debugging loop (that is, running the testing loop again) is critical since once we use our tests to fix the model, they no longer represent test data but rather training data. The process of fixing a bug often overcompensates, introducing shortcuts or bugs in the initial rounds of the debugging loop that can only be found using a new set of tests adapted to target the new “fixed” model.
Running the debugging loop on an open-source RoBERTa-Large sentiment model (Figure 4) demonstrates the importance of a test-fix-retest cycle. We start with tests from the “/Sensitive/Immigration” topic from Figure 2 that the RoBERTa model incorrectly labels as negative. Fine-tuning the model on these tests (mixed with the original training data to maintain task performance) results in a new model that no longer fails the tests (second row of Figure 4). However, when we rerun the testing loop, we find that now almost all immigration statements are labeled as “neutral,” even if they are truly negative based on the application and testing scenario (for example, the statements in the third row of Figure 4 wouldn’t be neutral if a model were tasked with detecting if language was for or against something). Fine-tuning again using these new tests (and the older ones) results in a model that correctly fixes the original bug without adding the “every immigration statement is neutral” shortcut.
This doesn’t, of course, guarantee that there isn’t another shortcut still in the model, but in our experience, a few rounds of the debugging loop drastically reduce the number of accidental bugs that get introduced when “fixing” the original bugs. The testers of the model don’t have to exhaustively identify every possible shortcut or imbalance ahead of time, since AdaTest adaptively surfaces and fixes bugs that have been introduced in the next rounds of testing and debugging. Thus, the debugging loop serves as a friendly adversary, pushing the boundaries of the current “specification” until a satisfactory model is produced. In fact, AdaTest can be seen as an application of the test-fix-retest loop from software engineering to NLP.

To evaluate the effectiveness of the debugging loop, we fine-tuned RoBERTa-Large to detect if two questions are duplicates (that is, the same question worded differently) using the Quora Question Pairs (QQP) dataset and also fine-tuned it for positive/neutral/negative sentiment analysis using the Stanford Sentiment Treebank (SST) dataset. Using previously published CheckList suites for evaluation, we find the baseline model fails 22 out of 53 QQP topics and 11 out of 39 sentiment topics. We then created data to “fix” a topic by either taking 50 examples from the topic’s data in the CheckList condition or by starting from a seed of five examples and running the debugging loop with AdaTest until finding failures becomes qualitatively difficult (on average 2.83 rounds for QQP and 3.83 rounds for sentiment). This yields an average of 41.6 tests for QQP and 55.8 tests for sentiment. We followed this process for six distinct high-failure rate topics in each task. In the vast majority of cases (see paper for details), AdaTest fixes the topics used for training and a number of unseen held-out topics without breaking any topics, while CheckList data often introduces new bugs (and thus breaks other test topics).
We also evaluated the effectiveness of AdaTest in a standard development setting, targeting a model for to-do detection in meeting notes. After three months of development, CheckList testing, and ad hoc GPT-3–based data augmentation, a PhD-level team had managed to build a model with an F1 score of 0.66 (out of 1) on unseen data collected in the wild. We gave AdaTest to the team with a demo a few minutes long. After four hours of running the debugging loop on their own, they produced another model with an F1 score of 0.77 on the same unseen dataset. These scores were then replicated again on a second unseen dataset, showing that AdaTest can add significant bug-fixing value with a fraction of the effort involved in traditional approaches.
The promise of human-AI collaboration for ML development
AdaTest encourages a close collaboration between people and large language models, yielding the benefits of both. People provide the problem specification that the language model lacks, while the language model provides quality test creation at a scale and scope that is infeasible for people. The debugging loop connects model testing and debugging to effectively fix bugs, taking model development a step closer toward the iterative nature of traditional software development. Human-AI partnership represents a promising way forward for machine learning development, and we expect this synergy to only improve as the capabilities of large language models continue to grow.
Check out the full paper to see AdaTest’s effectiveness on classification models (sentiment analysis, QQP, toxicity, media selection, and task detection), generation models (GPT-2, translation), and per-token models (NER) ranging from well-tested production systems to brand-new applications. Give it a try yourself at https://github.com/microsoft/adatest.
The post Partnering people with large language models to find and fix bugs in NLP systems appeared first on Microsoft Research.