Introducing the Microsoft Climate Research Initiative

Introducing the Microsoft Climate Research Initiative

climate research - photo of a man taking a photo of the Northern Lights

Addressing and mitigating the effects of climate change requires a collective effort, bringing our strengths to bear across industry, government, academia, and civil society. As we continue to explore the role of technology to advance the art of the possible, we are launching the Microsoft Climate Research Initiative (MCRI). This community of multi-disciplinary researchers is working together to accelerate cutting-edge research and transformative innovation in climate science and technology.

MCRI enables us to bring Microsoft’s research skills and compute capacities to deep and continuous collaboration with domain experts. For the kickoff of this initiative, we are focusing on three critical areas in climate research where computational advances can drive key scientific transformations: Overcoming constraints to decarbonization, reducing uncertainties in carbon accounting, and assessing climate risks in more detail.

Through these collaborative research projects, we hope to develop and sustain a highly engaged research ecosystem comprising a diversity of perspectives. Researchers will offer transdisciplinary and diverse expertise, particularly in areas beyond traditional computer science, such as environmental science, chemistry, and a variety of engineering disciplines. All results of this initiative are expected to be made public and freely available to spark even broader research and progress on these important climate issues.  

“As researchers, we’re excited to work together on projects specifically selected for their potential impact on global climate challenges. With Microsoft’s computational capabilities and the domain expertise from our collaborators, our complementary strengths can accelerate progress in incredible ways.”  

– Karin Strauss, Microsoft

Microsoft researchers will be working with collaborators globally to co-investigate priority climate-related topics and bring innovative, world-class research to influential journals and venues.

Phase one collaborations

Carbon accounting  

Real-time Monitoring of Carbon Control Progress from CO2 and Air Pollutant Observations with a Physically informed Transformer-based Neural Network 

Jia Xing, Tsinghua University; Siwei Li, Wuhan University; Shuxin Zheng, Chang Liu, Shun Zheng, and Wei Cao, Microsoft 

Understanding the change in CO2 emissions from the measurement of CO2 concentrations such as that done by satellites is very useful in tracking the real-time progress of carbon reduction actions. Current CO2 observations are relatively limited: numerical model-based methods have very low calculation efficiency. The proposed study aims to develop a novel method that combines atmospheric numerical modeling and machine learning to infer the CO2 emissions from satellite observations and ground monitor sensor data. 

AI based Near-real-time Global Carbon Budget (ANGCB) 

Zhu Liu, Tsinghua University; Biqing Zhu and Philippe Ciais, LSCE; Steven J. Davis, UC Irvine; Wei Cao, and Jiang Bian , Microsoft

Mitigation of climate change will depend upon a carbon emission trajectory that successfully achieves carbon neutrality by 2050. To that end, a global carbon budget assessment is essential. The AI-based, near-real-time Global Carbon Budget (ANGCB) project aims to provide the world’s first global carbon budget assessment based on Artificial Intelligence (AI) and other data science technologies.

Carbon reduction and removal  

Computational Discovery of Novel Metal–Organic Frameworks for Carbon Capture 

Jeffrey Long, UC Berkeley; Xiang Fu, Jake Smith, Bichlien Nguyen, Karin Strauss, Tian Xie, Daniel Zuegner, and Chi Chen, Microsoft

Removing CO2 from the environment is expected to be an integral component of keeping temperature rise below 1.5°C. However, today this is an inefficient and expensive undertaking. This project will apply generative machine learning to the design of new metal–organic frameworks (MOFs) to optimize for low-cost removal of CO2 from air and other dilute gas streams. 

An Assessment of Liquid Metal Catalyzed CO2 Reduction 

Michael D. Dickey, North Carolina State; Kourosh Kalantar-Zadeh, University of New South Wales; Kali Frost, Bichlien Nguyen, Karin Strauss, and Jake Smith, Microsoft

The CO2 reduction process can be used to convert captured carbon into a storable form as well as to manufacture sustainable fuels and materials with lower environmental impacts. This project will evaluate liquid metal-based reduction processes, identifying advantages, pinch-points, and opportunities for improvement needed to reach industrial-relevant scales. It will lay the foundation for improving catalysts and address scaling bottlenecks.  

Computational Design and Characterization of Organic Electrolytes for Flow Battery and Carbon Capture Applications 

David Kwabi, Anne McNeil, and Bryan Goldsmith, University of Michigan; Bichlien Nguyen, Karin Strauss, Jake Smith, Ziheng Lu, Yingce Xia, and Kali Frost, Microsoft

Energy storage is essential to enable 100% zero-carbon electricity generation. This work will use generative machine learning models and quantum mechanical modeling to drive the discovery and optimization of a new class of organic molecules for energy-efficient electrochemical energy storage and carbon capture.  

Property Prediction of Recyclable Polymers 

Aniruddh Vashisth, University of Washington; Bichlien Nguyen, Karin Strauss, Jake Smith, Kali Frost, Shuxin Zheng, and Ziheng Lu, Microsoft

Despite encouraging progress in recycling, many plastic polymers often end up being one-time-use materials. The plastics that compose printed circuit boards (PCBs), ubiquitous in every modern device, are amongst those most difficult to recycle. Vitrimers, a new class of polymers that can be recycled multiple times without significant changes in material properties, present a promising alternative. This project will leverage advances in machine learning to select vitrimer formulations that withstand the requirements imposed by their use in PCBs. 

Accelerated Green Cement Materials Discovery 

Eleftheria Roumeli, University of Washington; Kristen Severson, Yuan-Jyue Chen, Bichlien Nguyen, and Jake Smith, Microsoft

The concrete industry is a major contributor to greenhouse gas emissions, the majority of which can be attributed to cement. The discovery of alternative cements is a promising avenue for decreasing the environmental impacts of the industry. This project will employ machine learning methods to accelerate mechanical property optimization of “green” cements that meet application quality constraints while minimizing carbon footprint.

Environmental resilience

Causal Inference to Understand the Impact of Humanitarian Interventions on Food Security in Africa 

Gustau Camps-Valls, Universitat de Valencia; Ted Shepherd, University of Reading; Alberto Arribas Herranz, Emre Kiciman, and Lester Mackey, Microsoft

The Causal4Africa project will investigate the problem of food security in Africa from a novel causal inference standpoint. The project will illustrate the usefulness of causal discovery and estimation of effects from observational data by intervention analysis. Ambitiously, it will improve the usefulness of causal ML approaches for climate risk assessment by enabling the interpretation and evaluation of the likelihood and potential consequences of specific interventions.  

Improving Subseasonal Forecasting with Machine Learning 

Judah Cohen, Verisk; Dara Entekhabi and Sonja Totz, MIT; Lester Mackey , Alberto Arribas Herranz, and Bora Ozaltun, Microsoft

Water and fire managers rely on subseasonal forecasts two to six weeks in advance to allocate water, manage wildfires, and prepare for droughts and other weather extremes. However, skillful forecasts for the subseasonal regime are lacking due to a complex dependence on local weather, global climate variables, and the chaotic nature of weather. To address this need, this project will use machine learning to adaptively correct the biases in traditional physics-based forecasts and adaptively combine the forecasts of disparate models. 


Get progress updates and new resources for accelerating sustainability science research >

The post Introducing the Microsoft Climate Research Initiative appeared first on Microsoft Research.

Read More

GODEL: Combining goal-oriented dialog with real-world conversations

GODEL: Combining goal-oriented dialog with real-world conversations

Diagram showing GODEL’s architecture. The environment of the dialog system consists of both structured and unstructured content, which it uses to retrieve information. This source content, which we term “grounding,” is updated and repeatedly used by GODEL to produce a new response after each user input.

They make restaurant recommendations, help us pay bills, and remind us of appointments. Many people have come to rely on virtual assistants and chatbots to perform a wide range of routine tasks. But what if a single dialog agent, the technology behind these language-based apps, could perform all these tasks and then take the conversation further? In addition to providing on-topic expertise, such as recommending a restaurant, it could engage in a conversation about the history of the neighborhood or a recent sports game, and then bring the conversation back on track. What if the agent’s responses continually reflect the latest world events? And what if it could do all of this without the need for any additional work by the designer?   

With GODEL, this may not be far off. GODEL stands for Grounded Open Dialogue Language Model, and it ushers in a new class of pretrained language models that enable both task-oriented and social conversation and are evaluated by the usefulness of their responses.  

Pretrained language models are among the engines that power conversational AI, the technology that underlies these dialog agents. They can either be task-oriented (“give me a job, and I’ll do it”) or engage in a conversation without a specified outcome, known as open-domain or chit-chat. GODEL combines both these capabilities, giving dialog agents the ability to generate responses based not just on the context of the conversation, but also on external information, content that was not part of the dataset when the model was trained. This includes both structured content, such as information stored in databases, and unstructured content, such as restaurant reviews, Wikipedia articles, and other publicly available material found on the web. This explains how a simple task-based query about restaurant recommendations can evolve into a dialog about ingredients, food, and even cooking techniques—the kind of winding path that real-world conversations take.  

In 2019, the Deep Learning and Natural Language Processing groups at Microsoft Research released DialoGPT, the first large-scale pretrained language model designed specifically for dialog. This helped make conversational AI more accessible and easier to work with, and it enabled the research community to make considerable progress in this area. With GODEL, our goal is to help further this progress by empowering researchers and developers to create dialog agents that are unrestricted in the types of queries they can respond to and the sources of information they can draw from. We also worked to ensure those responses are useful to the person making the query.    

In our paper, “GODEL: Large-Scale Pre-training for Goal-Directed Dialog,” we describe the technical details underlying GODEL, and we have made the code available on GitHub

A grounded model

One of GODEL’s key features is the flexibility it provides users in defining their model’s grounding—the sources from which their dialog agents retrieve information. This flexibility informs GODEL’s versatility in diverse conversational settings. If someone were to inquire about a local restaurant for example, GODEL would be able to provide specific and accurate responses even though that venue may not have been included in the data used to train it. Responses would vary depending on whether the grounding information is empty, a snippet of a document, a search result (unstructured text), or information drawn from a database about the restaurant (structured text). However, each response would be appropriate and useful. 

In addition to specificity, grounded generation helps keep models up to date, as the grounded text can incorporate information that may not have been available at the time the model was trained. For example, if a model were developed before the 2022 Winter Olympics, GODEL would be able to provide details on those games and a list of winners even though all the data available to train it predates that event.

Broad application of GODEL

Another main feature of GODEL is its wide range of dialog applications. While its predecessor, DialoGPT, and other prior pretrained models for dialog have mostly focused on social bots, GODEL can be applied to a variety of dialogs, including those that are task-oriented, question-answering, and grounded chit-chat. In the same conversation, GODEL can produce reasonable responses for a variety of query types, including general questions or requests for specific actions.  

In addition, GODEL’s responses have been evaluated for their helpfulness. In our paper, we show that evaluation is done more reliably on datasets that are goal-directed, and that people generally agree on which responses are better when asked to judge their utility towards achieving certain goals. Equipped with this robust evaluation setup, we compared our model against several strong baselines and state-of-the-art approaches and show that GODEL is superior in terms of both human and automatic evaluation, as indicated in Figure 1. The paper describes extensive experiments against other state-of-the-art pretrained language models and demonstrates that performance gains are even larger in these cases. 

Two bar graphs showing that GODEL outperforms the baseline, in terms of both human and automated dialog evaluation. For human evaluation, GODEL received much higher human ratings (47, 41, and 27), while the human ratings for the best baseline were low (30, 22, and 17). For automatic evaluation, differences are smaller yet still statistically significant.
Figure 1: These charts illustrate GODEL’s performance against T5, a pretrained model that performed best in our evaluation. They compare the aggregate performance of models fine-tuned from GODEL against that of models fine-tuned from T5. They show that GODEL performs much better in human evaluations and makes appreciable gains in the automatic evaluation. The test set for these experiments combines a variety of dialog genres, including task-oriented dialog, conversational question-answering, and grounded chit-chat.

The following examples illustrate different dialog scenarios where GODEL uses a variety of sources to respond to identical user queries. 

  • This example illustrates how GODEL responds in an open-ended scenario in which the user asks a question that is completely unrelated to the initial question. Despite the lack of relevance, GODEL responds appropriately while trying to bring the conversation back on track. 

    Figure showing how GODEL responds to a user who just changed the topic, demonstrating that it can bring the conversation back on track. While the initial query is about a restaurant, the user suddenly mentions a series of tornadoes that have recently affected the area. GODEL uses grounding from a recent news article to provide information about the tornadoes, as requested by the user. Finally, it asks the user if there is anything else it can help with.

  • This example illustrates how GODEL responds in a task-oriented setting in which the model is connected to the components of a traditional goal-oriented dialog systems, such as a database. In this case, the relevant environment contains structured information, a database returning two restaurants relevant to the current conversation.  

    Figure showing how GODEL responds appropriately to a user's request for a restaurant reservation. The user expresses a preference for a restaurant named Lucky Star, and GODEL extracts information from a database about that restaurant and retrieves relevant information, such as a reference number, to generate a response that flows naturally with the rest of the conversation.

  • This example illustrates how GODEL responds in a task-oriented setting in which traditional components of task-oriented dialog systems are not available. In this case, GODEL retrieves a restaurant review via a search engine. The response reflects both the context of the conversation and a snippet of the retrieved text, a restaurant review.  

    Figure showing how GODEL responds appropriately to a user's request for information about a specific restaurant. The user asks whether a given restaurant is good for groups, and GODEL uses text originating from restaurant reviews to infer that the restaurant is indeed good for groups. Also, GODEL provides additional information to address a concern with larger groups—that food is typically served quickly.

  •  This example illustrates how GODEL responds in a question-answering scenario, where the user asks a general question and the context provides the dialog agent with the words it needs to search for the relevant information on the web. 

    Figure showing how GODEL responds appropriately when asked to give an example of a popular Chinese dish. GODEL uses grounding originating from search results to respond to the question while focusing on the most relevant information of the retrieved document.

GODEL available as open source

To advance research, we believe it is crucial to make code and models publicly available, and we have released GODEL as fully open source. We have made three versions of GODEL available: base, large, and extra-large. We are also including the code needed to retrain all pretrained models and to fine-tune models for specific tasks: the CoQA dataset, intended for conversational question-answering; the Wizard of Wikipedia and Wizard of the Internet datasets, aimed at information-seeking chats; and MultiWOZ is for task-completion dialogs.

We hope GODEL helps numerous academic research teams advance the field of conversational AI with innovative dialog models while eliminating the need for significant GPU resources. We plan to continuously improve GODEL and make more models available to the research community. Please visit our project page to learn more about the GODEL project and new releases.

Acknowledgements

We would like to thank our fellow colleagues at Microsoft Research who contributed to this work and blog post: Bill Dolan, Pengcheng He, Elnaz Nouri, Clarisse Simoes Ribeiro. 

The post GODEL: Combining goal-oriented dialog with real-world conversations appeared first on Microsoft Research.

Read More

Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability

Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability

On the left, a diagram with three layers, each of which contains a half-transparent image processed from the same image. The processed images are partitioned into several grids, and each grid contains 4 x 4 image patches. From bottom to top, the number of grids in each layer are 4 x 4, 2 x 2, and 1 x 1, respectively. The layers are labeled “4x,” “8x,” and “16x,” respectively, from bottom to top. An arrow joining the three layers points upward to the words “Segmentation” and “Detection” and an ellipsis. Another arrow points from the top layer to the word “classification.” On the right, a bar chart with a blue bar labeled “Swin V1” and an orange bar labeled “Swin V2.” The orange bar is much taller and labeled “3 billion (1,536 x 1,536 resolution)”; the blue bar is labeled “197 million.” An arrow labeled “15x” points upward from the blue bar, indicating the orange bar is 15 times higher than the blue one.
Swin Transformer, a Transformer-based general-purpose vision architecture, was further evolved to address challenges specific to large vision models. As a result, Swin Transformer is capable of training with images at higher resolutions, which allows for greater task applicability (left), and scaling models up to 3 billion parameters (right).

Early last year, our research team from the Visual Computing Group introduced Swin Transformer, a Transformer-based general-purpose computer vision architecture that for the first time beat convolutional neural networks on the important vision benchmark of COCO object detection and did so by a large margin. Convolutional neural networks (CNNs) have long been the architecture of choice for classifying images and detecting objects within them, among other key computer vision tasks. Swin Transformer offers an alternative. Leveraging the Transformer architecture’s adaptive computing capability, Swin can achieve higher accuracy. More importantly, Swin Transformer provides an opportunity to unify the architectures in computer vision and natural language processing (NLP), where the Transformer has been the dominant architecture for years and has benefited the field because of its ability to be scaled up.

So far, Swin Transformer has shown early signs of its potential as a strong backbone architecture for a variety of computer vision problems, powering the top entries of many important vision benchmarks such as COCO object detection, ADE20K semantic segmentation, and CelebA-HQ image generation. It has also been well-received by the computer vision research community, garnering the Marr Prize for best paper at the 2021 International Conference on Computer Vision (ICCV). Together with works such as CSWin, Focal Transformer, and CvT, also from teams within Microsoft, Swin is helping to demonstrate the Transformer architecture as a viable option for many vision challenges. However, we believe there’s much work ahead, and we’re on an adventurous journey to explore the full potential of Swin Transformer.

In the past few years, one of the most important discoveries in the field of NLP has been that scaling up model capacity can continually push the state of the art for various NLP tasks, and the larger the model, the better its ability to adapt to new tasks with very little or no training data. Can the same be achieved in computer vision, and if so, how?

In pursuit of answers, we scaled up Swin Transformer to 3 billion parameters, the largest and most effective dense vision model to date. There have been successfully trained vision models with up to 1.8 billion parameters. However, these vision models require billions of labeled images to be trained well and are applicable to only image classification. With our model, SwinV2-G, we address a common obstacle when increasing model size in the computer vision space—training instability—to support more parameters, and thanks to a technique we developed to address the resolution gap that exists between pretraining and fine-tuning tasks, SwinV2-G marks the first time that a billion-scale vision model has been applied to a broader set of vision tasks. Additionally, leveraging a self-supervised pretraining approach we call SimMIM, SwinV2-G uses 40 times less labeled data and 40 times less training time than previous works to drive the learning of billion-scale vision models.

SwinV2-G achieved state-of-the-art accuracy on four representative benchmarks when it was released in November: ImageNetV2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification.

Our experience and lessons learned in exploring the training and application of large vision models are described in two papers—“Swin Transformer V2: Scaling Up Capacity and Resolution” and “SimMIM: A Simple Framework for Masked Image Modeling”—both of which are being presented at the 2022 Computer Vision and Pattern Recognition Conference (CVPR). The code for Swin Transformer and the code for SimMIM are both available on GitHub. (For the purposes of this blog and our paper, the upgraded Swin Transformer architecture resulting from this work is referred to as V2.)

Improving training stability

The first issue we faced when training large models was the problem of training instability. We observed that as models get larger, it becomes very easy for them to crash. After checking the feature values of each layer of the models we trained in scaling up Swin Transformer to 3 billion parameters, we found the cause of the instability: large feature variance discrepancy between different layers.

As shown in Figure 1, the average feature variance in the deep layers of the original Swin Transformer model increases significantly as the model grows larger. With a 200-million-parameter Swin-L model, the discrepancy between layers with the highest and lowest average feature variance has reached an extreme value of 10^4. Crashing occurs during training when the model capacity is further scaled to 658 million parameters (Swin-H).

A line graph with the x-axis labeled
Figure 1: The average feature variance of Swin V1 models (solid lines) and Swin V2 models (dashed lines) per layer (x-axis). The discrepancy between layers with the highest and lowest average feature variance is very large for Swin V1 models while much milder for Swin V2 models.

Looking closely at the architecture of the original Swin Transformer, we found that this was due to the output of the residual branch being added directly back to the main branch without normalization. In other words, the unconstrained output feature values could be very large compared to the input. As illustrated in Figure 2 (left), after one Transformer block, the feature values of the output can increase to 61 times larger than that of the input. To alleviate this problem, we propose a new normalization method called residual-post-normalization. As shown in Figure 2 (right), this method moves the normalization layer from the beginning to the end of each residual branch so that the output of each residual branch is normalized before being merged back into the main branch. In this way, the average feature variance of the main branch doesn’t increase significantly as the layers deepen. Experiments have shown that this new normalization method moderates the average feature variance of each layer in the model (see the dashed lines in Figure 1; the SwinV2 models have the same respective number of parameters as the SwinV1 models: 200 million [L] and 658 million [H]).

The left is a block diagram labeled “Swin V1 (pre-norm + linear attention)” with input and output labeled “x superscript (ell minus 1)” and “x superscript (ell),” respectively. There are four boxes between the input and output; they’re labeled, from bottom to top, “Layer Norm,” “Linear Attention,” “Layer Norm,” and “MLP.” Upward arrows connect the boxes. On top of every two boxes is a plus symbol with two inputs: one the output arrow of the preceding box and the other an arrow from under the preceding two boxes, indicating a skip connection. There are five numbers listed vertically, from bottom to top: 1, 10, 11, 50, and 61. An arrow points right from the block diagram to a second block diagram, labeled “Swin V2 (res-post-norm + cosine attention).” The block diagram is similar to the left one with the following differences: the labels of each box are different; from bottom to top, they’re “Cosine Attention,” “Layer Norm,” “MLP,” and “Layer Norm.” The “Cosine Attention” and “Layer Norm” boxes are in red. There are five different numbers listed vertically (from bottom to top): 1, 1, 2, 1, and 3.
Figure 2: To accommodate larger vision models, the normalization configuration of the original Swin Transformer was adjusted. The original architecture (left) uses the pre-norm configuration, in which normalization occurs at the beginning of each residual branch. This configuration results in an increase in the feature values (from 1 to 61), leading to crashing during training. In Swin V2 (right), two changes were made: firstly, normalization is moved to the end of the residual branch in a new method called residual-post-normalization, which sees a much milder increase in value (from 1 to 3). Secondly, the linear dot-product function in the attention unit is replaced with a cosine function.

In addition, we also found that as the model becomes larger, the attention weights of certain layers tend to be dominated by a few specific points in the original self-attention computation, especially when residual-post-normalization is used. To tackle this problem, our team further proposes the scaled cosine attention mechanism (see Figure 2 right) to replace the common dot-product linear attention unit. In the new scaled cosine attention unit, the computation of self-attention is independent of the input magnitude, resulting in less saturated attention weights.

Experiments have shown that residual-post-normalization and the scaled cosine attention mechanism not only stabilize the training dynamics of large models but also improve accuracy.

Addressing large resolution gaps between pretraining and fine-tuning tasks

Another difficulty with large vision models is that the image resolution discrepancy between pretraining and fine-tuning can be large: pretraining is typically carried out at low resolutions, while many downstream vision tasks require high-resolution input images or attention windows, as shown in Figure 3.

In Swin Transformer, there’s a term of relative position bias in the attention unit to represent the impact of one image patch on another based on the relative position between them. This term is learned in pretraining. However, since the relative position range at fine-tuning has been changed significantly compared to that in pretraining, we need techniques to initiate the biases at new relative positions not seen in pretraining. While the original Swin Transformer architecture uses a handcrafted bicubic interpolation method to transfer the old relative position biases to the new resolution, we find it’s not that effective when the resolution discrepancy between pretraining and fine-tuning is very large.

To resolve this problem, we propose a log-spaced continuous position bias approach (Log-spaced CPB). By applying a small meta-network to the relative position coordinates in log space, Log-spaced CPB can generate position bias for any coordinate range. Since the meta-network can take arbitrary coordinates as input, a pretrained model can freely transfer between different window sizes by sharing the weights of a meta-network. Moreover, by converting the coordinates to a log space, the extrapolation rate required to transfer between different window resolutions is much smaller than with using the original linear space coordinates.

On the left, a tightly cropped image of a dog with an indicated resolution of 224 x 224 labeled
Figure 3: In computer vision, many downstream tasks, such as object detection (right), require high-resolution input, but pretraining tasks, such as image classification (left), are generally done at low resolutions, creating another challenge in training and applying large-scale vision models.

Using Log-spaced CPB, Swin Transformer V2 achieves smooth transferring between different resolutions, enabling us to use a smaller image resolution—192 × 192—with no accuracy loss on downstream tasks compared with the standard 224 × 224 resolution used in pretraining. This speeds up training by 50 percent.

Scaling model capacity and resolution results in excessive GPU memory consumption for existing vision models. To address the memory issue, we combined several crucial techniques, including Zero-Redundancy Optimizer (ZeRO), activation checkpointing, and a new sequential self-attention implementation. With these techniques, GPU memory consumption is significantly reduced for large-scale models and large resolutions with little impact to training speed. The GPU savings also allows us to train the 3-billion-parameter SwinV2-G model on images with resolutions of up to 1,536 × 1,536 using the 40-gigabyte A100 GPU, making it applicable to a range of vision tasks requiring high resolution, such as object detection.

Tackling the data starvation problem for large vision models

Training larger models often requires more labeled data; however, the computer vision field lacks such labeled data at scale because of the high cost of human-annotated data. This has compelled the vision field to explore the training of large models with smaller amounts of labeled data. To this aim, we introduce the self-supervised pretraining approach SimMIM, short for Simple Framework for Masked Image Modeling.

As shown in Figure 4, SimMIM learns image representation by masked image modeling, a pretext task in which a portion of an input image is masked and then the model predicts the RGB values of the masked area given the other visible parts. By this approach, the rich information contained in each image is better exploited, which allows us to use less data to drive the training of large models. With SimMIM, we were able to train the SwinV2-G model by using only 70 million labeled images, which is 40 times less than that used by previous billion-scale vision models.

An image of a building surrounded by a playground and trees is partitioned into a 3 x 3 grid. The grid is unfolded into its nine individual patches and five of them are replaced by mask patches; an arrow labeled “mask” points right from the gird to the patches. An arrow points upward from the unfolded patches to two boxes: the first is labeled “Encoder (e.g., ViT, Swin)” and the second is labeled “One-layer Prediction Head.” On top of the second box are two rows of the original patches that were replaced by mask patches, with double-headed arrows in between the top patches and bottom patches across the row and an “ell one loss” indicator at left.
Figure 4: In the proposed self-supervised pretraining method SimMIM, models are tasked with predicting the RGB of hidden portions of an input image based on the visible portions. The method significantly reduces the number of labeled images required in large model training. With SimMIM, a 3-billion-parameter Swin V2 model was trained by using 40 times less labeled data than that used in previous billion-scale vision models.

Setting new records on four representative vision benchmarks

By scaling up model capacity and resolution, Swin Transformer V2 set new records on four representative vision benchmarks when it was introduced in November: 84.0 percent top-1 accuracy on ImageNetV2 image classification; 63.1 / 54.4 box / mask mean average precision (mAP) on COCO object detection; 59.9 mean Intersection-over-Union (mIoU) on ADE20K semantic segmentation; and 86.8 percent top-1 accuracy on Kinetics-400 video action classification.

Benchmark ImageNetV2 COCO test-dev ADE20K val Kinetics-400
Swin V1 77.5 58.7 53.5 84.9
Previous state of the art 83.3 (Google, July 2021) 61.3 (Microsoft, July 2021) 58.4 (Microsoft, October 2021) 85.4 (Google, October 2021)
Swin V2 (November 2021) 84.0 (+0.7) 63.1 (+1.8) 59.9 (+1.5) 86.8 (+1.4)
Table 1: Swin Transformer (V1) was modified to address the challenges of pretraining and applying large vision models. The adapted architecture (V2) achieved state of the art on several representative benchmarks when it was introduced.

We hope this strong performance on a variety of vision tasks can encourage the field to invest more in scaling up vision models and that the provided training “recipe” can facilitate future research in this direction.

To learn more about the Swin Transformer journey, check out our Tech Minutes video.

Swin Transformer research team

(In alphabetical order) Yue Cao, Li Dong, Baining Guo, Han Hu, Stephen Lin, Yutong Lin, Ze Liu, Jia Ning, Furu Wei, Yixuan Wei, Zhenda Xie, Zhuliang Yao, and Zheng Zhang

The post Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability appeared first on Microsoft Research.

Read More

ICLR 2022 highlights from Microsoft Research Asia: Expanding the horizon of machine learning techniques and applications

ICLR 2022 highlights from Microsoft Research Asia: Expanding the horizon of machine learning techniques and applications

A lightbulb-shaped image, set against a background that represents computer circuitry, is divided into seven sections, with a small research-related icon in each section. Moving clockwise from upper right to upper left, the seven icons are a set of scales, a padlock, a badge, two hands clasped in a handshake, a human brain, a group of three people, and a human eye.

ICLR (International Conference on Learning Representations) is recognized as one of the top conferences in the field of deep learning. Many influential papers on artificial intelligence, statistics, and data science—as well as important application fields such as machine vision, speech recognition, and text understanding—have been published and presented at this conference. The following selection of papers accepted at ICLR 2022 showcases the latest research from Microsoft and its collaborators on vision pre-training, periodic time series forecasting, differential privacy, code completion, table pre-training, and online reinforcement learning. 

As research into deep learning continues to grow and change, Microsoft researchers and collaborators are broadening their approach to the field. As several papers highlighted in this post demonstrate, research teams continue to refine their thinking about how various machine learning techniques can best be implemented for real-world applications—whether for specialized applications in industry or a more generalized approach to improving decision-making in models overall. They are also gaining a deeper understanding of how different modalities, like computer vision, can extend applications of machine learning beyond language. 

In parallel with exploring real-world applications and multimodality, researchers are looking to the future of machine learning techniques, further exploring uncharted areas in deep online and offline reinforcement learning. In subfields such as the latter, the fundamentals of how models learn from and interact with data are evolving—and so are the ways that researchers think about optimizing those processes and reworking them for situations where data is scarce or unavailable in real-world settings.

This post is a sampling of the work that researchers at Microsoft Research Asia and their collaborators presented at ICLR 2022, which reflects the broad scope of the company’s machine learning research. You can learn more about the work accepted at this year’s event on the “Microsoft at ICLR 2022” event page. On the Microsoft Research Blog, you can dive deeper into two papers accepted at the conference—one on MoLeR, a model that represents molecules as graphs to improve drug discovery, and the other on Path Predictive Elimination (PPE), a reinforcement learning method that is robust enough to remove noise from continuously changing environments.

DEPTS: Deep expansion learning for periodic time series forecasting

In the image on the right, the overall data flows for DEPTS is visualized. The middle image shows how the researchers plot the integral structure of three layer-by-layer expansion branches in the expansion module. On the left, the detailed residual connections within a single layer are depicted.
Figure 1: The image on the right shows the overall data flows for DEPTS. The middle image shows how researchers plot the integral structure of three layer-by-layer expansion branches in the expansion module. The image on the left depicts the detailed residual connections within a single layer.

People and organizations involved: Shun Zheng, Xiaohan Yi, Wei Cao, Jian Bian, and Tie-Yan Liu from Microsoft Research Asia; Wei Fan and Yanjie Fu from the University of Central Florida. 

According to the paper: Periodic time series (PTS, or time series with apparent periodic oscillations) is widespread across industries such as transportation, electric power generation and transmission, sustainability, and others. PTS forecasting plays a crucial role in these industries because it can help businesses with many critical tasks, including early warning, pre-planning, and resource scheduling. However, PTS forecasting performance can be affected by dependencies on its inherent periodic nature and the complexity of individual periods. 

This paper introduces DEPTS, a deep expansion learning framework for PTS forecasting. DEPTS begins with a novel decoupled formulation by introducing the periodic state as a hidden variable, which allows the researchers to create custom modules to tackle the two challenges mentioned above. To address the first challenge, the researchers develop an expansion module on top of residual learning to perform a layer-by-layer expansion of those complicated dependencies. To address the second, they introduce a periodicity module with a parameterized periodic function that has the capacity to capture diversified periods. 

The researchers conduct experiments on both synthetic and real-world data that show DEPTS is effective at PTS forecasting and significantly reduces errors compared to baseline—an improvement of up to 20 percent for some cases.

Towards deployment-efficient reinforcement learning: Lower bound and optimality

An animation shows an initial state, S zero. After a few deployments, Layer one shows S one/one, S two/one, and S three/one. After a few deployments, Layer two shows S  one/two, S two/two, and S three/two. After a few deployments, Layer three shows S  one/three, S two/three, and S three/three. After each deployment, the layers move from a known state to an unknown state.
Figure 2: This shows a high-level visualization of our algorithms: a layer-by-layer strategy (with a three-layer tabular MDP as an example).

People and organizations involved: Li Zhao, Tao Qin, and Tie-Yan Liu from Microsoft Research Asia; Jiawei Huang, Jinglin Chen, and Nan Jiang from the University of Illinois at Urbana-Champaign. 

According to the paper: Traditional online reinforcement learning (RL) can be abstracted as the loop of two elements: learning a policy from collected data and deploying the policy to collect new data by interacting with environments. The overall objective of RL is to finish the exploration of the entire environment and obtain a near-optimal policy. 

However, in many real-world applications, policy deployment can be quite costly while data collection with a fixed policy is relatively convenient. For example, in recommendation systems, the policy is the recommendation strategy, and a good policy can accurately make suggestions to users based on their preferences. To guarantee the service quality, before launching a new policy, companies usually need to conduct multiple internal tests for evaluation, which require a lot of time (up to several months). Due to large customer bases, however, companies can gather thousands or millions of pieces of feedback for further policy learning in a short time once a system is deployed. In those applications, organizations prefer RL algorithms that can learn good policy after only a few switches or deployments. Yet, a gap still remains between existing algorithms and the practical scenarios above (please refer to the paper for more discussion). 

To close the gap, the researchers propose a new setting called Deployment-Efficient Reinforcement Learning (DE-RL)—an abstracted model for applications that value deployment efficiency. A new idea called deployment complexity, an analogue of sample complexity, provides a way to measure the deployment efficiency of an algorithm. Deployment complexity is the number of policy deployments required before the algorithm returns a near-optimal policy. 

Under this framework, researchers study linear Markov decision processes (MDPs) as a concrete example and conduct theoretical analysis to answer two important questions. First, what is the best deployment complexity we can achieve (lower bound)? Second, how can we design algorithms to achieve the best deployment complexity (optimality)? Additionally, because most of the previous related literature just studied algorithms constrained to deploy only deterministic policy, these researchers separately consider two cases with and without such constraints. They show that removing such constraints can significantly improve deployment efficiency. 

For the first question above, researchers contribute the construction of hard instances and establish information theoretical lower bounds for two cases, which are dH and H, respectively. For the second question, researchers propose algorithms to achieve those lower bounds with a layer-by-layer exploration strategy (as shown in Figure 2), where the researchers contribute a new algorithm framework based on a novel covariance matrix estimation method and several innovations on a technical level. Finally, the researchers discuss extended settings based on the formulation of DE-RL, which may be an interesting subject for future investigation. 

Gradient information matters in policy optimization by back-propagating through model

This image illustrates a problem inherent in model-based reinforcement learning and how it can be solved. The diagram on the left shows the mismatch between learning and using the model. The diagram on the right shows how the proposed two-model-based learning method can control both the prediction and gradient errors, separating the different roles of the two models at the model learning phase and coordinating them at the policy optimization phase.
Figure 3: (a)This shows the mismatch between learning and using the model. Here the model means the transition and reward function. (b) This illustrates the DDPPO algorithm. The DDPPO algorithm constructs the prediction and the gradient model separately. DDPPO takes the different losses to train different models and then uses them appropriately.

People and organizations involved: Yue Wang, Tie-Yan Liu from Microsoft Research Asia; Chongchong Li, Yuting Liu from Beijing Jiaotong University; Wei Chen from the Institute of Computing Technology, Chinese Academy of Sciences, Zhi-Ming Ma from the Academy of Mathematics and Systems Science, Chinese Academy of Sciences 

According to the paper: Model-based reinforcement learning provides an efficient mechanism to find the optimal policy by interacting with the learned environment. In this paper, researchers investigate a mismatch in model learning and model using. Specifically, to get the policy update direction, an effective way is to leverage the model’s differentiability by using the model gradient. However, most of the commonly used methods just treat the model learning task as a supervised learning task and minimize its prediction error while not considering the gradient error. In other words, the algorithm requires an accurate model gradient, but we only learn to decrease the prediction error, which results in an objective mismatch. 

This paper first theoretically justifies that the model gradient error matters in the policy optimization phase. Specifically, the bias of the estimated policy gradient is not only introduced by the prediction error of the learned model but also introduced by the gradient error of the learned model. These errors will eventually influence the convergence rate of the policy optimization process. 

Next, the paper proposes a two-model-based learning method to control both the prediction and the gradient error. The paper separates the different roles of these two models at the model learning phase and coordinates them at the policy optimization phase. By designing a practical way to compute the gradient error, the paper can use it to guide the gradient model learning. By leveraging both the prediction and gradient models, we can first roll out the trajectory and then compute the model gradient to get the policy gradient. The proposed algorithm is called directional derivative projection policy optimization (DDPPO). Finally, several experiments in benchmark continuous control tasks demonstrate that the proposed algorithm has better sample efficiency.

Variational oracle guiding for reinforcement learning

The image shows two diagrams to illustrate variational latent oracle guiding, abbreviated as VLOG, for deep reinforcement learning. The first diagram shows VLOG during learning, the second during execution. Both diagrams use Q-learning as the example.
Figure 4: Diagrams of VLOG during learning and execution, using Q-learning as an example. Left: During learning when oracle observations are available. A Bayesian latent variable z is estimated using the executor observation (prior) and the oracle observation (posterior), respectively. The whole model is trained by maximizing the VLOG variational lower bound, which is the RL objective function of the posterior model minus the KL-divergence between the posterior and prior z. Right: During execution, when only executor observations are available. 

People and organizations involved: Dongqi Han, Xufang Luo, Yuqing Yang and Dongsheng Li from Microsoft Research Asia; Tadashi Kozuno from the University of Alberta; Zhaoyun Chen from the Institute of Artificial Intelligence, Hefei Comprehensive National Science Center; Kenji Doya from the Okinawa Institute of Science and Technology. 

According to the paper: Despite recent successes of deep reinforcement learning (DRL) in various decision-making problems, an important but underexplored aspect is how to leverage oracle observation (the information that is invisible during online decision making but is available during offline training) to facilitate learning. For example, human experts will look at the replay after a poker game, so they can check the opponents’ hands and use the visible information (executor observation to improve their gameplay strategy). Such problems are known as oracle guiding. 

In this work, the researchers study the oracle guiding problems based on Bayesian theory and derive an objective to leverage oracle observation in RL using variational methods. The key contribution is to propose a general learning framework referred to as variational latent oracle guiding (VLOG) for DRL. VLOG is featured with preferable properties such as its robust and promising performance and its versatility to incorporate with any value-based DRL algorithm. 

The paper empirically demonstrates the effectiveness of VLOG in online and offline RL domains with tasks ranging from video games to mahjong, a challenging tile-based game. Furthermore, the authors publish the mahjong environment and an offline RL dataset as a benchmarking task to facilitate future research on oracle guiding, game AI, and related topics. 

The post ICLR 2022 highlights from Microsoft Research Asia: Expanding the horizon of machine learning techniques and applications appeared first on Microsoft Research.

Read More

DoWhy evolves to independent PyWhy model to help causal inference grow

DoWhy evolves to independent PyWhy model to help causal inference grow

A flowchart showing the DoWhy library process. Input Data and Domain Knowledge are injected into the library, where they go through four steps: Model causal mechanisms; Identify target estimands; Estimate causal effect; and Refute estimate. The process produces the output labelled Causal effect.

Identifying causal effects is an integral part of scientific inquiry. It helps us understand everything from educational outcomes to the effects of social policies to risk factors for diseases. Questions of cause-and-effect are also critical for the design and data-driven evaluation of many technological systems we build today. 

To help data scientists better understand and deploy causal inference, Microsoft researchers built a tool that implements the process of causal inference analysis from end to end. The ensuing DoWhy library has been doing just that since 2018 and has cultivated a community devoted to applying causal inference principles in data science. To broaden access to this critical knowledge base, DoWhy is migrating to an independent open-source governance model in a new PyWhy GitHub organization. As a first step toward this model, we are announcing a collaboration with Amazon Web Services (AWS), which is contributing new technology based on structural causal models. 

What is causal inference?

The goal of conventional machine learning methods is to predict an outcome. In contrast, causal inference focuses on the effect of a decision or action—that is, the difference between the outcome if an action is completed versus not completed. For example, consider a public utility company seeking to reduce their customers’ usage of water through a marketing and rewards program. The effectiveness of a rewards program is difficult to ascertain, as any decrease in water usage by participating customers is confounded with their choice to participate in the program. If we observe that a rewards program member uses less water, how do we know whether it is the program that is incentivizing their lower water usage or if customers who were already planning to reduce water usage also chose to join the program? Given information about the drivers of customer behavior, causal methods can disentangle confounding factors and identify the effect of this rewards program. 

Figure 1: A public utility introduces a program that rewards water usage reduction. Are people who sign up using less water than they would have otherwise?
Figure 1: A public utility introduces a program that rewards water usage reduction. Are people who sign up using less water than they would have otherwise? 

How do we know when we have the right answer? The effect of an action like signing up for a customer loyalty program is typically not an observable value. For any given customer, we see only one of the two respective outcomes and cannot directly observe the difference the program made. This means that processes developed to validate conventional machine learning models—based on comparing predictions to observed, ground truths—cannot be used. Instead, we need new processes to gain confidence in the reliability of causal inference. Most critically, we need to capture our domain knowledge, reason about our modeling choices, then validate our core assumptions when possible and analyze the sensitivity of our results to violations of assumptions when validation is not possible. 

Four steps of causal inference analysis

Data scientists just beginning to explore causal inference are most challenged by the new modeling assumptions of causal methods. DoWhy can help them understand and implement the process. The library focuses on the four steps of an end-to-end causal inference analysis, which are discussed in detail in a previous paper, DoWhy: an End-to-End Library for Causal Inference, and related blog post

  1. Modeling: Causal reasoning begins with the creation of a clear model of the causal assumptions being made. This involves documenting what is known about the data generating process and mechanisms. To get a valid answer to our cause-and-effect questions, we must be explicit about what we already know. 
  1. Identification: Next, we use the model to decide whether the causal question can be answered, and we provide the required expression to be computed. Identification is the process of analyzing our model. 
  1. Estimation: Once we have a strategy for identifying the causal effect, we can choose from several different statistical and machine learning-based estimation methods to answer our causal question. Estimation is the process of analyzing our data. 
  1. Refutation: Once we have our answer, we must do everything we can to test our underlying assumptions. Is our model consistent with the data? How sensitive is the answer to the assumptions made? If the model missed an unobserved confounder, will that change our answer a little or a lot? 

This focus on the four steps of the end-to-end causal inference process differentiates the DoWhy library from prior causal inference toolkits. DoWhy complements other libraries—which focus on individual steps—and offers users the benefits of those libraries in a seamless, unified API. For example, for estimation, DoWhy offers the ability to call out to Microsoft’s EconML library for its advanced estimation methods. 

Current DoWhy deployments

Today, DoWhy has been installed over one million times. It is widely deployed in production scenarios across industry and academia—from evaluating the effects of customer loyalty and marketing programs to identifying the controllable drivers of key business metrics. DoWhy’s rich API has enabled the creation of downstream solutions such as AutoCausality from Wise.com, which automates comparison of different methods, and ShowWhy from Microsoft, which provides a no-code GUI experience for causal inference analysis. In academia, DoWhy has been used in a range of research scenarios, including sustainable building design, environmental data analyses, and health studies. At Microsoft, we continue to use DoWhy to power causal analyses and test their validity, for example, estimating who benefits most from messages to avoid overcommunicating to large groups. 

A community of more than 40 researchers and developers continually enrich the library with critical additions. Highly impactful contributions, such as customizable backdoor criterion implementation and a user-friendly Pandas integration, have come from external contributors. Instructors in courses and workshops around the world use DoWhy as a pedagogical tool to teach causal inference. 

With such broad support, DoWhy continues to improve and expand. In addition to more complete implementations of identification algorithms and new sensitivity analysis methods, DoWhy has added experimental support for causal discovery and more powerful methods for testing the validity of a causal estimate. Using the four steps as a set of fundamental operations for causal analysis, DoWhy is now expanding into other tasks, such as representation learning. 

Microsoft continues to expand the frontiers of causal learning through its research initiatives, with new approaches to robust learning, statistical advances for causal estimation, deep learning-based methods for end-to-end causal discovery and inference, and investigations into how causal learning can help with fairness, explainability and interpretability of machine learning models. As each of these technologies mature, we expect to make them available to the broader causal community through open source and product offerings. 

An independent organization for DoWhy and other open-source causal inference projects

Making causality a pillar of data science practice requires an even broader, collaborative effort to create a standardized foundation for our industry. 

To this end, we are happy to announce that we are shifting DoWhy into an independent open-source governance model, in a new PyWhy effort. 

The mission of PyWhy is to build an open-source ecosystem for causal machine learning that advances the state of the art and makes it available to practitioners and researchers. In PyWhy, we will build and host interoperable libraries, tools, and other resources spanning a variety of causal tasks and applications, connected through a common API on foundational causal operations and a focus on the end-to-end analysis process.

Our first collaborator in this initiative is AWS, which is contributing new technology for causal attribution based on a structural causal model that complements DoWhy’s current functionalities. 

We are looking forward to accelerating and broadening adoption of our open-source causal learning tools through this new Github organization. We invite data scientists, researchers, and engineers, whether you are just learning about causality or already designing new algorithms or even building your own tools, to join us on the open-source journey towards building a useful causal analysis ecosystem. 

We encourage you to explore DoWhy and invite you to contact us to learn more. We are excited by what lies ahead as we aim to transform data science practice to drive improved modeling and decision making. 

The post DoWhy evolves to independent PyWhy model to help causal inference grow appeared first on Microsoft Research.

Read More