DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization

Three bar plots. The first plot shows that the model size of XTC-BERT is 32 times smaller than that of BERT, and two dots show the accuracy of BERT and XTC-BERT, which are 83.95 and 83.44, respectively.  The second one shows that INT8 using ZeroQuant can be 2.6 times faster than Baseline with FP16 using PyTorch and ZeoQuant can reduce the number of GPUs for inference from 2 to 1, which in total provides 5.2 times efficiency. It also shows that ZeroQuant has 50.4 accuracy compared to 50.5 using Baseline PyTorch. The third plot shows that ZeroQuant is more than 5000 times cheaper than baseline to compress a model, and the accuracy of ZeroQuant is 42.26 compared to 42.35 of baseline.

Large-scale models are revolutionizing deep learning and AI research, driving major improvements in language understanding, generating creative texts, multi-lingual translation and many more. But despite their remarkable capabilities, the models’ large size creates latency and cost constraints that hinder the deployment of applications on top of them. In particular, increased inference time and memory consumption inhibit deployment of models on latency-sensitive and resource-constrained applications on both server and client devices. To address these deployment challenges, the DeepSpeed team, as part of Microsoft’s AI at Scale initiative, has been exploring innovations in system optimization and model compression. On the former, we released the DeepSpeed inference system, which consists of a diverse set of optimizations, such as highly optimized CUDA kernels and inference-adapted parallelism to accelerate model inference speed, as well as ZeRO-Inference, which breaks the GPU memory wall and fits large models across heterogeneous memories to address hardware accessibility limitations. These optimizations target improving the inference system efficiency while preserving the model sizes, the amount of computation, and model accuracy: the total work remains the same, but the processing capability and speed are higher. On the latter, the emerging compression algorithms show great potential in reducing model size and inference computation. These algorithms use condensed format to represent, store, communicate, and compute DNN models, reducing the total work needed for inference with little or no loss in accuracy. System optimizations and model compression are very much complementary, and they can be synergistically combined to provide a multiplicative reduction on inference latency and cost. Motivated by combining the best of both worlds, we are proud to announce DeepSpeed Compression—a composable library that combines novel compression technologies and highly efficient system optimizations to make DL model size smaller and inference speed faster, all with much lowered compression cost.

Challenges of compressing large deep learning models

Although there have been numerous efforts to compress model sizes and reduce inference computation, applying existing compression techniques to large scale models still has many challenges in practice:

Complex pipeline for achieving high compression ratio. Various strategies have been proposed to overcome optimization difficulty and accuracy degradation when compressing large models. However, no systematic study on best practices for extreme compression exists, such as using aggressive quantization methods and layer reduction. This leaves the underlying question unanswered: do we really need those ad-hoc tricks to recover the accuracy loss or do simpler yet more effective methods exist?

High compression cost. Existing methods for compressing large models incur high training costs. For example, popular compression methods such as quantize-aware training (QAT) and multi-stage distillation methods lead to long training time and large hardware resource requirement as the model size grows into multi-billion parameters or at even larger scale, making compressing these models costly and difficult. For example, the 20B GPT-NeoX model was pre-trained using 96 NVIDIA A100 GPUs in three months. Performing QAT even with 10% of training samples would still require large amounts of computational resources, which many practitioners cannot afford.

Lack of tailored system optimizations for compressed models. To maximize the benefits of compressed models, specialized system optimizations are often required, e.g., quantized and sparsified models need optimized low-bit arithmetic computation and sparse matrix multiplication to boost the inference speed on commodity hardware. Existing methods often focus on reducing theoretical computation overhead but miss the opportunities to offer the best inference latency reduction via tailored system optimizations for the compressed models.

Limited composability. Existing methods have limited composability from two aspects. First, there is limited composability among multiple compression methods. Although well-performing compression solutions have been proposed independently, combining multiple methods together for the best outcome is still a laborious process, requiring building a complex compression pipeline. Second, there is a lack of composability between compression techniques and system optimizations. As we just mentioned, compressed models require specialized system optimizations to maximize latency and cost reduction. However, few existing methods take an end-to-end approach of composing compressions with system optimizations, as it requires significant efforts to bring modeling, algorithm, and system areas of deep learning to work synergistically together.

DeepSpeed Compression overcomes these challenges by offering novel state-of-the-art compression techniques, such as XTC for 32x smaller model size and ZeroQuant for 5000x lower compression cost reduction. It also takes an end-to-end approach to improve the computation efficiency of compressed models via a highly optimized inference engine. Furthermore, our library has multiple built-in state-of-the-art compression methods and supports synergistic composition of these methods together with the system optimizations, offering the best of both worlds while allowing a seamless and easy-to-use pipeline for efficient DL model inference. Each of these features is explained further below.

Smaller model size: 32x smaller transformer models via simple yet effective binarized extreme compression

Reducing the size of large models is critical when deploying them on both servers and client devices. In DeepSpeed Compression, we provide extreme compression techniques to reduce model size by 32x with almost no accuracy loss or to achieve 50x model size reduction while retaining 97% of the accuracy. We do this through two main techniques: extreme quantization and layer reduction. Extreme quantization via ternarization/binarization reduces the model size significantly but is considered a particularly challenging task due to the large quantization error resulting in model performance degradation. To improve the accuracy of binarized/ternarized models, existing methods often adopt complicated and computationally expensive compression pipelines, such as multi-stage distillation. However, it remains unclear how different components in extreme quantization affect the resulting performance. To tease apart their effects, we perform a systematic study on the impacts of various techniques currently used for extreme compression.

In this process, we have identified several best practices for extreme compression:

  1. A longer training iteration with learning rate decay is highly preferred for closing the accuracy gap of extreme quantization;
  2. Single-stage knowledge distillation with more training budgets is sufficient to match or even exceed accuracy from multi-stage ones;
  3. Training without data augmentation hurts performance on downstream tasks for various compression tasks, especially on smaller tasks;
  4. Lightweight layer reduction matches or even exceeds expensive pre-training distillation for task-specific compression.

Based on these findings, we greatly simplify the procedure of extreme compression and propose a new extreme compression technique, XTC, that compresses a model to its limit with lightweight layer reduction and robust binarization. XTC produces models with little loss in accuracy yet up to 50x model size reduction, as shown in Figure 1. XTC reduces the model size by 32x with almost no loss in the average score on the GLUE tasks via simple yet effective binarization technique. By combining extreme quantization and lightweight layer reduction, we can further improve the binarized model, achieving 50x model size reduction while retaining 97% of the accuracy. Given that transformers are becoming the standard architecture choice for AI, we believe the investigation and the proposed solution could be highly impactful to power large-scale models on resource-constrained devices. If you are interested in XTC, you can also find more details in our technical report “Extreme Compression for Pre-trained Transformers Made Simple and Efficient.”

A pareto frontier plot showing multiple compression methods, including XTC-BERT, BinanyBERT with TWN, BinaryBERT with BWN, TernaryBERT, TernaryTinyBERT, 3-bit BERT, 3-bit TinyBERT, 8-bit BERT, 8-bit TinyBERT, original BERT teacher. The x-axis shows the model size, and the y-axis shows the GLUE score. Different settings of the proposed XTC-BERT sit on the left top corner of the plot, advancing the Pareto frontier with smaller model sizes and better GLUE scores.
Figure 1: Comparison between XTC and other state-of-the-art compression results on BERT

Lower compression cost: Quantizing models with >5000x compression cost reduction and no training data

Large-scale transformer models with hundreds of billions of parameters are usually challenging to quantize due to the lack of training resources and/or data access. To resolve those issues, we propose a method called ZeroQuant, which quantizes large-scale models with little or no fine-tuning cost on limited resources. Under the hood, ZeroQuant contains two major parts: 1) a hardware friendly fine-grained quantization scheme that allows us to quantize weights and activations into low-bit values with minimal errors while still empowering fast inference speed on commodity hardware with low quantization/dequantization cost; and 2) a layer-by-layer knowledge distillation pipeline, which fine-tunes the quantized model to close the accuracy gap from low-precision (e.g., INT4) quantization.

The benefits of ZeroQuant are threefold: First, unlike previous quantization-aware training that requires expensive retraining and parameter tuning, ZeroQuant enables quantizing BERT and GPT-style models from FP32/FP16 into INT8 weight and activations to retain accuracy without incurring any retraining cost, as shown in Figure 2. Second, by loading only one layer for low-precision (e.g., INT4) quantization at a time, the maximum memory footprint required to quantize the model depends solely on the size of individual layer instead of the entire model, allowing one to quantize gigantic models with as little as one GPU. Third, our quantization method is data-free, which means that it does not require the original training data of the model to obtain a quantized model. This is especially useful when the data is not available due to privacy related reasons, for example.

A graph demonstrating two ways to convert an FP16 or FP32 model to INT8. On top, it shows quantization aware training with a crying emoji, which needs training data and training GPUs to perform a retraining-evaluation-parameter tuning loop in order to get the INT8 model. At the bottom, it shows ZeroQuant (with a smile emoji) which does not require training data and GPUs, and it can directly convert the FP16 or FP32 model to INT8.
Figure 2: Comparison between ZeroQuant and standard Quantization Aware Training. ZeroQuant can significantly reduce training resources and time cost, without requiring the original training data.

We demonstrated the scalability of ZeroQuant on a GPT-3-style model with 1.3B parameters (GPT-3-1.3B) and one of the largest open-source language models, GPT-NeoX (20B). Particularly, thanks to the fine-grained quantization scheme, ZeroQuant can convert GPT-3-1.3B (trained with 128 NVIDIA A100 with five days) and GPT-NeoX (trained with 96 A100 with three months) to INT8 without any cost or training data while delivering comparable accuracy. Furthermore, with the lightweight layer-by-layer knowledge distillation, ZeroQuant can quantize GPT-3-1.3B with mixed INT4/INT8 precision in three hours on a single GPU, which leads to 5000x compression cost reduction compared to quantization-aware training. To find more details about ZeroQuant, refer to “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”.

Faster inference speed: Latency reduction via highly optimized DeepSpeed Inference system

System optimizations play a key role in efficiently utilizing the available hardware resources and unleashing their full capability through inference optimization libraries like ONNX runtime and DeepSpeed. We build our work on top of DeepSpeed inference, which provides high-performance model serving with inference optimized kernels, parallelism, and memory optimizations, covering a wide variety of models for both latency sensitive and throughput-oriented applications. Besides leveraging these, we also extend the inference capability to support models in compressed formats. For example, we developed variations of efficient low-bit computation such as INT8 GeMM kernels. These kernels load INT8 parameters and activations from GPU device memory to the registers and use the customized INT8 GeMM implemented on top of CUTLASS tuned for different batch sizes to deliver faster GeMM computation. The kernels also fuse quantization and dequantization operations before and after GeMM, further reducing the kernel invocation overhead and improving the memory bandwidth utilization. Furthermore, our inference engine supports many-GPU transformer layers for serving transformer models across GPUs using inference-adapted parallelism strategies. For compressed models that have a smaller memory footprint, the inference engine can automatically shrink the number of GPUs required to serve a model, leading to reduced cross-GPU communication and hardware cost. For example, DeepSpeed compression leverages INT8 for GPT-NeoX (20B) and reduces the GPU requirement of serving the model from two to one, reducing latency from 65ms to 25ms, and achieving a 5.2x cost reduction. As shown in Figure 3, DeepSpeed INT8 kernels can boost performance by up to 2x compared to our own FP16 kernels, and they achieve 2.8-5.2x latency cost reduction compared to the baseline FP16 in PyTorch, significantly reducing the latency and cost of large-scale model inference.

A bar plot comparing the inference efficiency among three methods (PyTorch FP16, DeepSpeed Inference FP16, and DeepSpeed Inference INT8) on four models (GPT 2-XL with 1.5 billion parameters, GPT-Neo with 2.7 billion parameters, GPT-J with 6 billion parameters and GPT-NeoX with 20 billion parameters). Overall, it shows that DeepSpeed Inference FP16 is 2-4 times more efficient than PyTorch FP16, and DeepSpeed Inference INT8 is 3-5x more efficient than PyTorch FP16.
Figure 3: Inference efficiency improvements for large-scale transformer models with publicly available checkpoints using optimized DeepSpeed Inference engine. Inference efficiency is calculated by the inference latency speedup divided by the hardware cost reduction rate. DeepSpeed Inference achieves 2.8-4.8x latency reduction and up to 5.2x inference efficiency improvements.

A library that synergistically composes compression algorithms and system optimizations

DeepSpeed Compression proposes a seamless pipeline to address the compression composability challenges, as shown in Figure 4. The core piece of DeepSpeed Compression is a component called compression composer, which includes several significant features:

A graph about the DeepSpeed Compression library. It has two levels. On top, it shows a trained model (with certain accuracy, latency, and model size) getting passed into the compression composer and becoming the compressed model. Then the compressed model gets into the optimized inference engine to get the final faster or smaller model depending on the requirement on accuracy and latency. At the bottom, it shows how the compression composer makes compression decisions based on the network architecture of the model and composition of compression techniques, including distillation, pruning, and quantization; and it shows optimized inference engine improving inference performance by parallelization and efficient kernels.
Figure 4: The DeepSpeed Compression library
  1. It offers multiple cutting-edge compression methods, as shown in Table 1, including extreme quantization, head/row/channel pruning, and knowledge distillation, that can effectively reduce model size and inference cost. The list will expand as we continually integrate more state-of-the-art compression methods.
Category Methods Targets
Quantization INT8/INT4 Activations
INT8/INT4/Ternary/Binary Weights
Sparsification Head pruning Attention head
(Transformer)
Sparse/Row pruning Weights
Channel pruning Conv2D weights
Layer Reduction Arbitrary subset of network layers Layers
Distillation Output logits, feature map, attn. map Layers
Table 1: Compression techniques supported in DeepSpeed Compression composer.
  1. It offers an easy-to-use API that automatically takes care of the complexities of assembling different compression techniques to deliver the compound benefits of multiple compression methods. For example, XTC requires composition of lightweight layer reduction, binarization, and knowledge distillation. However, composing them together is non-trivial. With our compression composer, applying extreme compression is as easy as adding two new API calls to enable compression and clean the compressed model.
  2. It is designed in a modular way so that it will be easy for users to add new compression schemes. For example, additional compression methods can be added through custom compression layers and, by registering them with the compression composer, the new methods can be composed with existing methods that are already managed by the composer.
  3. It seamlessly works with the existing DeepSpeed library. This has two benefits. First, DeepSpeed Compression can be specified and enabled the same way as DeepSpeed training and inference via a JSON file, where enabling different combination of compression techniques only requires a few lines of modification in the JSON file. Second, once the compression schemes have been configured, the compression composer automatically modifies the model layers and training to enable the compression process and does not require additional changes from the user to the model structure or the training procedure.

After the DNN model has been compressed, DeepSpeed Compression replaces the compressed layers with highly optimized kernels in the DeepSpeed Inference engine to maximize hardware efficiency. Together, the compression composer and inference engine achieve the best of both worlds of compression and system optimization, delivering a compound effect of inference cost reduction.

Use Cases of DeepSpeed Compression

Although we started DeepSpeed Compression quite recently, we have successfully leveraged it to optimize several large-scale open-source models and Microsoft production workloads. It delivers significant latency and cost reduction, widely applicable on both various NLP and CV tasks.

We applied INT8 quantization of DeepSpeed Compression to optimize two large-scale open-source models in GPT-3 style: GPT-J (6B) and GPT-NeoX (20B) on the Azure AI platform. As shown in Figure 5, our quantized models achieve similar accuracy as the original models on 19 zero-shot evaluation tasks and WikiText, while achieving 3.67x and 5.2x inference cost savings, respectively, compared with PyTorch FP16 baseline on ND A100 v4 Azure instances. Very importantly, we quantize these models without requiring any training data, expensive compression time or GPU resources, bringing huge training cost savings compared with QAT!

Figure 5: DeepSpeed Compression results of GPT-J (6B)/GPT-NeoX (20B) using ZeroQuant. Left table shows the results of model quality and inference latency for the FP16 baseline and ZeroQuant; Right figures show the compression cost comparison between Quantization-aware Training (QAT, estimated) and ZeroQuant for INT8 quantization.
Figure 5: DeepSpeed Compression results of GPT-J (6B)/GPT-NeoX (20B) using ZeroQuant. Left table shows the results of model quality and inference latency for the FP16 baseline and ZeroQuant; Right figures show the compression cost comparison between Quantization-aware Training (QAT, estimated) and ZeroQuant for INT8 quantization.

Beyond open-source models, DeepSpeed Compression has also demonstrated its effectiveness to optimize production workloads in Microsoft:

  • It reduces the Microsoft Turing Image Super Resolution model (T-ISR) model size by 3.1x together with 1.85x latency reduction by composing different compression schemes like pruning and distillation with efficient system optimizations. The model has also been deployed in Bing Maps and Microsoft Edge, where it automatically derives high-resolutions images from lower-resolution images, which can be seen in this blog post.
  • It also successfully compresses the Microsoft Relevance Fusion models—a Transformer-based ranking model used in Bing’s core search stack. Without DeepSpeed Compression, it took three days to quantize the model using QAT. With DeepSpeed Compression, we can quantize the model in a few minutes with improved accuracy and reduced latency compared to QAT.

DeepSpeed Compression release plan

DeepSpeed Compression is still at its early stage and under active development, but we’d like to share the results and tools to DeepSpeed users as soon as possible. At this first release, we open-source the core DeepSpeed Compression components, including the compression composer, which supports various compression methods consisting of INT8/INT4/Ternary/Binary quantization, lightweight layer reduction, pretraining and task specific knowledge distillation, head pruning, row pruning, and channel pruning, for compressing both NLP and computer vision models. Together with the compression composer, we are releasing the two novel technologies XTC and ZeroQuant introduced in this blog as part of the library.

We hope you will try DeepSpeed Compression. Please find the code, tutorial, and documents at the DeepSpeed GitHub, and website. We highly value your feedback and comments, so let us know what you think and how we can improve. As for the next steps, we plan to extend our offerings with more compression methods, an extended coverage of specialized kernels for compressed models, and an optimization module that automatically finds the best compression schemes. We believe that our composable library and new innovations will help close the gap between what is possible in AI and what is deployable as well as making DL inference faster, cheaper, and simpler.

Acknowledgement

We are a group of system and modeling researchers—Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Conglong Li, Reza Yazdani Aminabadi, Elton Zheng, Samyam Rajbhandari, Ammar Ahmad Awan, Jeff Rasley, Cheng Li, Olatunji Ruwase, Shaden Smith, Du Li, Michael Wyatt, Arash Bakhtiari, Guanhua Wang, Connor Holmes, Sam Ade Jacobs, Martin Cai, Yuxiong He (team lead)—who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing deep learning’s speed to train, speed to convergence, and speed to develop.

The post DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization appeared first on Microsoft Research.

Read More

Confidential Containers: Verifiably secure computation in the cloud

White lock within a geometric circle over top a blue to orange color gradient background

For many organizations, trusting their data to the cloud requires having a complete understanding of and control over the environment in which that data resides and how it’s being processed. Microsoft understands this, and we are committed to building a trustworthy cloud—one in which security, privacy, and transparency are built into its core. A key part of this vision is confidential computing—a set of hardware and software capabilities that give data owners visibility into the data environment and verifiable security protection of their data in use. 

The Confidential Computing team at Microsoft Research is collaborating with hardware developers to create trusted execution environments (TEEs), where data stays encrypted not just when stored (encryption at rest) and in transit, but also during use. This work underpins the Azure confidential cloud platform, where users can upload encrypted code and data and get encrypted results back with strong privacy. 

At Microsoft Build 2022, the company announced serverless confidential containers with lift-and-shift support, the next step in the evolution of confidential computing. This service builds on the Confidential Containers work conducted at Microsoft Research. Confidential Containers offers a verifiably secure container environment in Azure where users can confirm that the software performing computations on their data is exactly the software they expect to be running, that it will do what they want it to do with their data, and that they can trust the results it returns. Confidential Containers enables users to take existing container workloads, and with a small amount of configuration, use them in a confidential environment.

Smaller trusted computing base 

Confidential Containers decreases the size of the trusted computing base (TCB)—the totality of elements in a computing environment that must be trusted not to violate the confidentiality of computation. The TCB can include software, hardware, and human administrators, among other things. By removing elements from the TCB, the components that can be compromised are reduced, decreasing the attack surface. Confidential Containers removes Microsoft administrators from the TCB, minimizing it as much as possible while still enabling customers to run existing workloads without modifying them.

This reduced TCB provides an option for organizations that currently run computations on their data on premises because they are concerned about the security of their data in the cloud. Even though setting up a computation environment in the cloud offers flexibility, data can be exposed to anyone who operates the servers on which the system runs. With Confidential Containers, the individuals who can access the data can be tightly controlled. This can be a single designated employee of the organization that owns the data or the business partner that is processing the data. It is never a Microsoft employee or another third party. 

Encrypted, policy-constrained computing environment 

A secure hardware environment enables data protection in use. Confidential Containers runs on AMD processors backed by AMD Secure Encrypted Virtualization-Secure Nested Paging (SEV-SNP), which provides a TEE. This hardware-enforced security boundary provides a shield so that nothing outside the encrypted memory space can read the data.

Users of Confidential Containers create a policy defining precisely what can run in the confidential container environment and how. The AMD SEV-SNP hardware produces an attestation report, which provides a succinct representation of everything in the confidential environment, including information about the code that will be enforcing the policy. Users can request this attestation report any time before providing the container with a key to unlock the encrypted dataset for processing. 

A cloud outline within a security shield over top a blue to orange color gradient background.

Sensitive data handling in the cloud 

Before the development of HTTPS, businesses could not securely run a storefront on the public web because communication over the internet was not secure. In the same way, today individuals and organizations cannot run containerized computation over sensitive data in the public cloud. Confidential Containers addresses this need. 

This is a game-changer for organizations that must comply with local and international regulations on how sensitive data is handled. For example, healthcare organizations that store encrypted patient information in the cloud are required by HIPAA regulations to download that data to perform computations on premises. This multistep process entails decrypting the data once it has been downloaded to an organization’s servers, performing the required computations, and then re-encrypting the data before re-uploading it to the cloud. It also requires ensuring that the on-premises environment contains the security architecture necessary to comply with HIPAA and other regulations. 

Because Confidential Containers provides advanced security safeguards for data in use in Azure, organizations no longer need to perform these time-consuming steps. This also means they no longer need to maintain servers on premises. Moreover, Azure users can define even stricter policies for their container environment in the cloud than they have in place in their on-premises environment.

Secure multiparty computations 

Another benefit of Confidential Containers is they enable secure multiparty computations. A single organization can securely process multiple datasets that contain sensitive information, or multiple organizations with datasets that must remain secure can share those datasets with the assurance that their data will not leak. Organizations can perform computations on multiple datasets, such as for training a machine learning model, and gain better results than they would if performing computations on a single dataset, all without knowing what is in those datasets. 

Easy deployment and lift-and-shift of Linux containers 

Creating a confidential container is straightforward for Azure users who are currently using or getting ready to use containers, requiring a small amount of configuration to move existing workloads. Linux users can easily lift-and-shift their Linux containers to Confidential Containers on Azure. 

Unlimited potential with Confidential Containers 

We believe that in the future, all computing in the cloud will be confidential, and we’re excited to share Confidential Containers—a technology that plays a role in making this happen. The capabilities it provides will have implications that we have yet to imagine. We’re particularly excited by the potential of multiparty computations. The ability to perform computations in a protected environment on multiple datasets brings limitless possibilities, unlocking great value to Azure users. 

Confidential Containers is currently available for limited preview and will be available for public preview later this year. Sign up for the Confidential Containers preview. 

The post Confidential Containers: Verifiably secure computation in the cloud appeared first on Microsoft Research.

Read More

AI4Science to empower the fifth paradigm of scientific discovery

Christopher Bishop, Distinguished Scientist, Managing Director, Microsoft Research Cambridge Lab

Over the coming decade, deep learning looks set to have a transformational impact on the natural sciences. The consequences are potentially far-reaching and could dramatically improve our ability to model and predict natural phenomena over widely varying scales of space and time. Could this capability represent the dawn of a new paradigm of scientific discovery?

Jim Gray, a Turing Award winner, and former Microsoft Technical Fellow, characterised the historical evolution of scientific discovery through four paradigms. With origins dating back thousands of years, the first paradigm was purely empirical and based on direct observation of natural phenomena. While many regularities were apparent in these observations, there was no systematic way to capture or express them. The second paradigm was characterised by theoretical models of nature, such as Newton’s laws of motion in the seventeenth century, or Maxwell’s equations of electrodynamics in the nineteenth century. Derived by induction from empirical observation, such equations allowed generalization to a much broader range of situations than those observed directly. While these equations could be solved analytically for simple scenarios, it was not until the development of digital computers in the twentieth century that they could be solved in more general cases, leading to a third paradigm based on numerical computation. By the dawn of the twenty-first century computation was again transforming science, this time through the ability to collect, store and process large volumes of data, leading to the fourth paradigm of data-intensive scientific discovery. Machine learning forms an increasingly important component of the fourth paradigm, allowing the modelling and analysis of large volumes of experimental scientific data. These four paradigms are complementary and coexist. 

The pioneering quantum physicist Paul Dirac commented in 1929 that “The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble.” For example, Schrödinger’s equation describes the behaviour of molecules and materials at the subatomic level with exquisite precision, and yet numerical solution with high accuracy is only possible for very small systems consisting of a handful of atoms. Scaling to larger systems requires increasingly drastic approximations leading to a challenging trade-off between scale and accuracy. Even so, quantum chemistry calculations are already of such high practical value that they form one of the largest supercomputer workloads. 

However, over the last year or two, we have seen the emergence of a new way to exploit deep learning, as a powerful tool to address this speed-versus-accuracy trade-off for scientific discovery. This is a very different use of machine learning from the modelling of data that characterizes the fourth paradigm, because the data that is used to train the neural networks itself comes from numerical solution of the fundamental equations of science rather than from empirical observation. We can view the numerical solutions of scientific equations as simulators of the natural world that can be used, at high computational cost, to compute quantities of interest in applications such as forecasting the weather, modelling the collision of galaxies, optimizing the design of fusion reactors, or calculating the binding affinities of candidate drug molecules to a target protein. From a machine learning perspective, however, the intermediate details of the simulation can be viewed as training data which can be used to train deep learning emulators. Such data is perfectly labelled, and the quantity of data is limited only by computational budget. Once trained, the emulator can perform new calculations with high efficiency, achieving significant improvements in speed, sometimes by several orders of magnitude. 

This ‘fifth paradigm’ of scientific discovery represents one of the most exciting frontiers for machine learning as well as for the natural sciences. While there is a long way to go before these emulators are sufficiently fast, robust, and general-purpose to become mainstream, the potential for real-world impact is clear. For example, the number of small-molecule drug candidates alone is estimated at 1060, while the total number of stable materials is thought to be around 10180 (roughly the square of the number of atoms in the known universe). Finding more efficient ways to explore these vast spaces would transform our ability to discover new substances such as better drugs to treat disease, improved substrates for capturing atmospheric carbon dioxide, better materials for batteries, new electrodes for fuel cells to power the hydrogen economy, and myriad others.

AI4Science is an effort deeply rooted in Microsoft’s mission, applying the full breadth of our AI capabilities to develop new tools for scientific discovery so that we and others in the scientific community can confront some of humanity’s most important challenges. Microsoft Research has a 30+ year legacy of curiosity and discovery, and I believe that the AI4Science team – spanning geographies and scientific fields – has the potential to yield extraordinary contributions to that legacy.

Kevin Scott, Executive Vice President and Chief Technology Officer, Microsoft

I’m delighted to announce today that I will be leading a new global team in Microsoft Research, spanning the UK, China and the Netherlands, to focus on bringing this fifth paradigm to reality. Our AI4Science team encompasses world experts in machine learning, quantum physics, computational chemistry, molecular biology, fluid dynamics, software engineering, and other disciplines who are working together to tackle some of the most pressing challenges in this field.

An example project is Graphormer, led by my colleague Tie-Yan Liu in our China team. This is a deep learning package that allows researchers and developers to train custom models for molecule modelling tasks, such as materials science, or drug discovery. Recently, Graphormer won the Open Catalyst Challenge, a molecular dynamics competition that aims to model the catalyst-absorbate reaction system by AI, and has more than 0.66 million catalyst-absorbate relaxation systems (144 million structure-energy frames) simulated by density functional theory (DFT) software. Another project, from our team in Cambridge, in collaboration with Novartis, is Generative Chemistry, where together we are empowering scientists with AI to speed up the discovery and development of break-through medicines.

As Iya Khalil, Global Head of the AI Innovation Lab at Novartis, recently noted, the work is no longer science fiction but science-in-action:

“Not only can AI learn from our past experiments, but, with each new iteration of designing and testing in the lab, the machine learning algorithms can identify new patterns and help guide the early drug discovery and development process. Hopefully in doing this we can augment our human scientists’ expertise so they can design better molecules faster.”

The team has since used the platform to generate several promising early-stage molecules which have been synthesised for further exploration.

Alongside our teams in China and the UK, we have been growing a team in the Netherlands, including hiring the world-renowned machine learning expert, Max Welling. I am also excited to be able to announce today that our brand-new Lab in Amsterdam will be housed in Matrix One, which is currently under construction on the Amsterdam Science Park. This purpose-built space is in close proximity to the University of Amsterdam and the Vrije Universiteit Amsterdam, and we will maintain strong affiliations with both institutions through the co-supervision of PhD students.

Image of Amsterdam office
Matrix One building in Amsterdam

It is with pride and excitement that we take this next step to come together as a cross-geographical team and follow in the footsteps of pioneers before us, to contribute to this next paradigm of scientific discovery, and in doing so impact many important societal challenges. If you share our excitement and ambition, and would like to join us, I encourage you to look at our open positions or get in touch to talk to anyone on the team.

The post AI4Science to empower the fifth paradigm of scientific discovery appeared first on Microsoft Research.

Read More

Introducing the Microsoft Climate Research Initiative

climate research - photo of a man taking a photo of the Northern Lights

Addressing and mitigating the effects of climate change requires a collective effort, bringing our strengths to bear across industry, government, academia, and civil society. As we continue to explore the role of technology to advance the art of the possible, we are launching the Microsoft Climate Research Initiative (MCRI). This community of multi-disciplinary researchers is working together to accelerate cutting-edge research and transformative innovation in climate science and technology.

MCRI enables us to bring Microsoft’s research skills and compute capacities to deep and continuous collaboration with domain experts. For the kickoff of this initiative, we are focusing on three critical areas in climate research where computational advances can drive key scientific transformations: Overcoming constraints to decarbonization, reducing uncertainties in carbon accounting, and assessing climate risks in more detail.

Through these collaborative research projects, we hope to develop and sustain a highly engaged research ecosystem comprising a diversity of perspectives. Researchers will offer transdisciplinary and diverse expertise, particularly in areas beyond traditional computer science, such as environmental science, chemistry, and a variety of engineering disciplines. All results of this initiative are expected to be made public and freely available to spark even broader research and progress on these important climate issues.  

“As researchers, we’re excited to work together on projects specifically selected for their potential impact on global climate challenges. With Microsoft’s computational capabilities and the domain expertise from our collaborators, our complementary strengths can accelerate progress in incredible ways.”  

– Karin Strauss, Microsoft

Microsoft researchers will be working with collaborators globally to co-investigate priority climate-related topics and bring innovative, world-class research to influential journals and venues.

Phase one collaborations

Carbon accounting  

Real-time Monitoring of Carbon Control Progress from CO2 and Air Pollutant Observations with a Physically informed Transformer-based Neural Network 

Jia Xing, Tsinghua University; Siwei Li, Wuhan University; Shuxin Zheng, Chang Liu, Shun Zheng, and Wei Cao, Microsoft 

Understanding the change in CO2 emissions from the measurement of CO2 concentrations such as that done by satellites is very useful in tracking the real-time progress of carbon reduction actions. Current CO2 observations are relatively limited: numerical model-based methods have very low calculation efficiency. The proposed study aims to develop a novel method that combines atmospheric numerical modeling and machine learning to infer the CO2 emissions from satellite observations and ground monitor sensor data. 

AI based Near-real-time Global Carbon Budget (ANGCB) 

Zhu Liu, Tsinghua University; Biqing Zhu and Philippe Ciais, LSCE; Steven J. Davis, UC Irvine; Wei Cao, and Jiang Bian , Microsoft

Mitigation of climate change will depend upon a carbon emission trajectory that successfully achieves carbon neutrality by 2050. To that end, a global carbon budget assessment is essential. The AI-based, near-real-time Global Carbon Budget (ANGCB) project aims to provide the world’s first global carbon budget assessment based on Artificial Intelligence (AI) and other data science technologies.

Carbon reduction and removal  

Computational Discovery of Novel Metal–Organic Frameworks for Carbon Capture 

Jeffrey Long, UC Berkeley; Xiang Fu, Jake Smith, Bichlien Nguyen, Karin Strauss, Tian Xie, Daniel Zuegner, and Chi Chen, Microsoft

Removing CO2 from the environment is expected to be an integral component of keeping temperature rise below 1.5°C. However, today this is an inefficient and expensive undertaking. This project will apply generative machine learning to the design of new metal–organic frameworks (MOFs) to optimize for low-cost removal of CO2 from air and other dilute gas streams. 

An Assessment of Liquid Metal Catalyzed CO2 Reduction 

Michael D. Dickey, North Carolina State; Kourosh Kalantar-Zadeh, University of New South Wales; Kali Frost, Bichlien Nguyen, Karin Strauss, and Jake Smith, Microsoft

The CO2 reduction process can be used to convert captured carbon into a storable form as well as to manufacture sustainable fuels and materials with lower environmental impacts. This project will evaluate liquid metal-based reduction processes, identifying advantages, pinch-points, and opportunities for improvement needed to reach industrial-relevant scales. It will lay the foundation for improving catalysts and address scaling bottlenecks.  

Computational Design and Characterization of Organic Electrolytes for Flow Battery and Carbon Capture Applications 

David Kwabi, Anne McNeil, and Bryan Goldsmith, University of Michigan; Bichlien Nguyen, Karin Strauss, Jake Smith, Ziheng Lu, Yingce Xia, and Kali Frost, Microsoft

Energy storage is essential to enable 100% zero-carbon electricity generation. This work will use generative machine learning models and quantum mechanical modeling to drive the discovery and optimization of a new class of organic molecules for energy-efficient electrochemical energy storage and carbon capture.  

Property Prediction of Recyclable Polymers 

Aniruddh Vashisth, University of Washington; Bichlien Nguyen, Karin Strauss, Jake Smith, Kali Frost, Shuxin Zheng, and Ziheng Lu, Microsoft

Despite encouraging progress in recycling, many plastic polymers often end up being one-time-use materials. The plastics that compose printed circuit boards (PCBs), ubiquitous in every modern device, are amongst those most difficult to recycle. Vitrimers, a new class of polymers that can be recycled multiple times without significant changes in material properties, present a promising alternative. This project will leverage advances in machine learning to select vitrimer formulations that withstand the requirements imposed by their use in PCBs. 

Accelerated Green Cement Materials Discovery 

Eleftheria Roumeli, University of Washington; Kristen Severson, Yuan-Jyue Chen, Bichlien Nguyen, and Jake Smith, Microsoft

The concrete industry is a major contributor to greenhouse gas emissions, the majority of which can be attributed to cement. The discovery of alternative cements is a promising avenue for decreasing the environmental impacts of the industry. This project will employ machine learning methods to accelerate mechanical property optimization of “green” cements that meet application quality constraints while minimizing carbon footprint.

Environmental resilience

Causal Inference to Understand the Impact of Humanitarian Interventions on Food Security in Africa 

Gustau Camps-Valls, Universitat de Valencia; Ted Shepherd, University of Reading; Alberto Arribas Herranz, Emre Kiciman, and Lester Mackey, Microsoft

The Causal4Africa project will investigate the problem of food security in Africa from a novel causal inference standpoint. The project will illustrate the usefulness of causal discovery and estimation of effects from observational data by intervention analysis. Ambitiously, it will improve the usefulness of causal ML approaches for climate risk assessment by enabling the interpretation and evaluation of the likelihood and potential consequences of specific interventions.  

Improving Subseasonal Forecasting with Machine Learning 

Judah Cohen, Verisk; Dara Entekhabi and Sonja Totz, MIT; Lester Mackey , Alberto Arribas Herranz, and Bora Ozaltun, Microsoft

Water and fire managers rely on subseasonal forecasts two to six weeks in advance to allocate water, manage wildfires, and prepare for droughts and other weather extremes. However, skillful forecasts for the subseasonal regime are lacking due to a complex dependence on local weather, global climate variables, and the chaotic nature of weather. To address this need, this project will use machine learning to adaptively correct the biases in traditional physics-based forecasts and adaptively combine the forecasts of disparate models. 


Get progress updates and new resources for accelerating sustainability science research >

The post Introducing the Microsoft Climate Research Initiative appeared first on Microsoft Research.

Read More

GODEL: Combining goal-oriented dialog with real-world conversations

Diagram showing GODEL’s architecture. The environment of the dialog system consists of both structured and unstructured content, which it uses to retrieve information. This source content, which we term “grounding,” is updated and repeatedly used by GODEL to produce a new response after each user input.

They make restaurant recommendations, help us pay bills, and remind us of appointments. Many people have come to rely on virtual assistants and chatbots to perform a wide range of routine tasks. But what if a single dialog agent, the technology behind these language-based apps, could perform all these tasks and then take the conversation further? In addition to providing on-topic expertise, such as recommending a restaurant, it could engage in a conversation about the history of the neighborhood or a recent sports game, and then bring the conversation back on track. What if the agent’s responses continually reflect the latest world events? And what if it could do all of this without the need for any additional work by the designer?   

With GODEL, this may not be far off. GODEL stands for Grounded Open Dialogue Language Model, and it ushers in a new class of pretrained language models that enable both task-oriented and social conversation and are evaluated by the usefulness of their responses.  

Pretrained language models are among the engines that power conversational AI, the technology that underlies these dialog agents. They can either be task-oriented (“give me a job, and I’ll do it”) or engage in a conversation without a specified outcome, known as open-domain or chit-chat. GODEL combines both these capabilities, giving dialog agents the ability to generate responses based not just on the context of the conversation, but also on external information, content that was not part of the dataset when the model was trained. This includes both structured content, such as information stored in databases, and unstructured content, such as restaurant reviews, Wikipedia articles, and other publicly available material found on the web. This explains how a simple task-based query about restaurant recommendations can evolve into a dialog about ingredients, food, and even cooking techniques—the kind of winding path that real-world conversations take.  

In 2019, the Deep Learning and Natural Language Processing groups at Microsoft Research released DialoGPT, the first large-scale pretrained language model designed specifically for dialog. This helped make conversational AI more accessible and easier to work with, and it enabled the research community to make considerable progress in this area. With GODEL, our goal is to help further this progress by empowering researchers and developers to create dialog agents that are unrestricted in the types of queries they can respond to and the sources of information they can draw from. We also worked to ensure those responses are useful to the person making the query.    

In our paper, “GODEL: Large-Scale Pre-training for Goal-Directed Dialog,” we describe the technical details underlying GODEL, and we have made the code available on GitHub

A grounded model

One of GODEL’s key features is the flexibility it provides users in defining their model’s grounding—the sources from which their dialog agents retrieve information. This flexibility informs GODEL’s versatility in diverse conversational settings. If someone were to inquire about a local restaurant for example, GODEL would be able to provide specific and accurate responses even though that venue may not have been included in the data used to train it. Responses would vary depending on whether the grounding information is empty, a snippet of a document, a search result (unstructured text), or information drawn from a database about the restaurant (structured text). However, each response would be appropriate and useful. 

In addition to specificity, grounded generation helps keep models up to date, as the grounded text can incorporate information that may not have been available at the time the model was trained. For example, if a model were developed before the 2022 Winter Olympics, GODEL would be able to provide details on those games and a list of winners even though all the data available to train it predates that event.

Broad application of GODEL

Another main feature of GODEL is its wide range of dialog applications. While its predecessor, DialoGPT, and other prior pretrained models for dialog have mostly focused on social bots, GODEL can be applied to a variety of dialogs, including those that are task-oriented, question-answering, and grounded chit-chat. In the same conversation, GODEL can produce reasonable responses for a variety of query types, including general questions or requests for specific actions.  

In addition, GODEL’s responses have been evaluated for their helpfulness. In our paper, we show that evaluation is done more reliably on datasets that are goal-directed, and that people generally agree on which responses are better when asked to judge their utility towards achieving certain goals. Equipped with this robust evaluation setup, we compared our model against several strong baselines and state-of-the-art approaches and show that GODEL is superior in terms of both human and automatic evaluation, as indicated in Figure 1. The paper describes extensive experiments against other state-of-the-art pretrained language models and demonstrates that performance gains are even larger in these cases. 

Two bar graphs showing that GODEL outperforms the baseline, in terms of both human and automated dialog evaluation. For human evaluation, GODEL received much higher human ratings (47, 41, and 27), while the human ratings for the best baseline were low (30, 22, and 17). For automatic evaluation, differences are smaller yet still statistically significant.
Figure 1: These charts illustrate GODEL’s performance against T5, a pretrained model that performed best in our evaluation. They compare the aggregate performance of models fine-tuned from GODEL against that of models fine-tuned from T5. They show that GODEL performs much better in human evaluations and makes appreciable gains in the automatic evaluation. The test set for these experiments combines a variety of dialog genres, including task-oriented dialog, conversational question-answering, and grounded chit-chat.

The following examples illustrate different dialog scenarios where GODEL uses a variety of sources to respond to identical user queries. 

  • This example illustrates how GODEL responds in an open-ended scenario in which the user asks a question that is completely unrelated to the initial question. Despite the lack of relevance, GODEL responds appropriately while trying to bring the conversation back on track. 

    Figure showing how GODEL responds to a user who just changed the topic, demonstrating that it can bring the conversation back on track. While the initial query is about a restaurant, the user suddenly mentions a series of tornadoes that have recently affected the area. GODEL uses grounding from a recent news article to provide information about the tornadoes, as requested by the user. Finally, it asks the user if there is anything else it can help with.

  • This example illustrates how GODEL responds in a task-oriented setting in which the model is connected to the components of a traditional goal-oriented dialog systems, such as a database. In this case, the relevant environment contains structured information, a database returning two restaurants relevant to the current conversation.  

    Figure showing how GODEL responds appropriately to a user's request for a restaurant reservation. The user expresses a preference for a restaurant named Lucky Star, and GODEL extracts information from a database about that restaurant and retrieves relevant information, such as a reference number, to generate a response that flows naturally with the rest of the conversation.

  • This example illustrates how GODEL responds in a task-oriented setting in which traditional components of task-oriented dialog systems are not available. In this case, GODEL retrieves a restaurant review via a search engine. The response reflects both the context of the conversation and a snippet of the retrieved text, a restaurant review.  

    Figure showing how GODEL responds appropriately to a user's request for information about a specific restaurant. The user asks whether a given restaurant is good for groups, and GODEL uses text originating from restaurant reviews to infer that the restaurant is indeed good for groups. Also, GODEL provides additional information to address a concern with larger groups—that food is typically served quickly.

  •  This example illustrates how GODEL responds in a question-answering scenario, where the user asks a general question and the context provides the dialog agent with the words it needs to search for the relevant information on the web. 

    Figure showing how GODEL responds appropriately when asked to give an example of a popular Chinese dish. GODEL uses grounding originating from search results to respond to the question while focusing on the most relevant information of the retrieved document.

GODEL available as open source

To advance research, we believe it is crucial to make code and models publicly available, and we have released GODEL as fully open source. We have made three versions of GODEL available: base, large, and extra-large. We are also including the code needed to retrain all pretrained models and to fine-tune models for specific tasks: the CoQA dataset, intended for conversational question-answering; the Wizard of Wikipedia and Wizard of the Internet datasets, aimed at information-seeking chats; and MultiWOZ is for task-completion dialogs.

We hope GODEL helps numerous academic research teams advance the field of conversational AI with innovative dialog models while eliminating the need for significant GPU resources. We plan to continuously improve GODEL and make more models available to the research community. Please visit our project page to learn more about the GODEL project and new releases.

Acknowledgements

We would like to thank our fellow colleagues at Microsoft Research who contributed to this work and blog post: Bill Dolan, Pengcheng He, Elnaz Nouri, Clarisse Simoes Ribeiro. 

The post GODEL: Combining goal-oriented dialog with real-world conversations appeared first on Microsoft Research.

Read More

Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability

On the left, a diagram with three layers, each of which contains a half-transparent image processed from the same image. The processed images are partitioned into several grids, and each grid contains 4 x 4 image patches. From bottom to top, the number of grids in each layer are 4 x 4, 2 x 2, and 1 x 1, respectively. The layers are labeled “4x,” “8x,” and “16x,” respectively, from bottom to top. An arrow joining the three layers points upward to the words “Segmentation” and “Detection” and an ellipsis. Another arrow points from the top layer to the word “classification.” On the right, a bar chart with a blue bar labeled “Swin V1” and an orange bar labeled “Swin V2.” The orange bar is much taller and labeled “3 billion (1,536 x 1,536 resolution)”; the blue bar is labeled “197 million.” An arrow labeled “15x” points upward from the blue bar, indicating the orange bar is 15 times higher than the blue one.
Swin Transformer, a Transformer-based general-purpose vision architecture, was further evolved to address challenges specific to large vision models. As a result, Swin Transformer is capable of training with images at higher resolutions, which allows for greater task applicability (left), and scaling models up to 3 billion parameters (right).

Early last year, our research team from the Visual Computing Group introduced Swin Transformer, a Transformer-based general-purpose computer vision architecture that for the first time beat convolutional neural networks on the important vision benchmark of COCO object detection and did so by a large margin. Convolutional neural networks (CNNs) have long been the architecture of choice for classifying images and detecting objects within them, among other key computer vision tasks. Swin Transformer offers an alternative. Leveraging the Transformer architecture’s adaptive computing capability, Swin can achieve higher accuracy. More importantly, Swin Transformer provides an opportunity to unify the architectures in computer vision and natural language processing (NLP), where the Transformer has been the dominant architecture for years and has benefited the field because of its ability to be scaled up.

So far, Swin Transformer has shown early signs of its potential as a strong backbone architecture for a variety of computer vision problems, powering the top entries of many important vision benchmarks such as COCO object detection, ADE20K semantic segmentation, and CelebA-HQ image generation. It has also been well-received by the computer vision research community, garnering the Marr Prize for best paper at the 2021 International Conference on Computer Vision (ICCV). Together with works such as CSWin, Focal Transformer, and CvT, also from teams within Microsoft, Swin is helping to demonstrate the Transformer architecture as a viable option for many vision challenges. However, we believe there’s much work ahead, and we’re on an adventurous journey to explore the full potential of Swin Transformer.

In the past few years, one of the most important discoveries in the field of NLP has been that scaling up model capacity can continually push the state of the art for various NLP tasks, and the larger the model, the better its ability to adapt to new tasks with very little or no training data. Can the same be achieved in computer vision, and if so, how?

In pursuit of answers, we scaled up Swin Transformer to 3 billion parameters, the largest and most effective dense vision model to date. There have been successfully trained vision models with up to 1.8 billion parameters. However, these vision models require billions of labeled images to be trained well and are applicable to only image classification. With our model, SwinV2-G, we address a common obstacle when increasing model size in the computer vision space—training instability—to support more parameters, and thanks to a technique we developed to address the resolution gap that exists between pretraining and fine-tuning tasks, SwinV2-G marks the first time that a billion-scale vision model has been applied to a broader set of vision tasks. Additionally, leveraging a self-supervised pretraining approach we call SimMIM, SwinV2-G uses 40 times less labeled data and 40 times less training time than previous works to drive the learning of billion-scale vision models.

SwinV2-G achieved state-of-the-art accuracy on four representative benchmarks when it was released in November: ImageNetV2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification.

Our experience and lessons learned in exploring the training and application of large vision models are described in two papers—“Swin Transformer V2: Scaling Up Capacity and Resolution” and “SimMIM: A Simple Framework for Masked Image Modeling”—both of which are being presented at the 2022 Computer Vision and Pattern Recognition Conference (CVPR). The code for Swin Transformer and the code for SimMIM are both available on GitHub. (For the purposes of this blog and our paper, the upgraded Swin Transformer architecture resulting from this work is referred to as V2.)

Improving training stability

The first issue we faced when training large models was the problem of training instability. We observed that as models get larger, it becomes very easy for them to crash. After checking the feature values of each layer of the models we trained in scaling up Swin Transformer to 3 billion parameters, we found the cause of the instability: large feature variance discrepancy between different layers.

As shown in Figure 1, the average feature variance in the deep layers of the original Swin Transformer model increases significantly as the model grows larger. With a 200-million-parameter Swin-L model, the discrepancy between layers with the highest and lowest average feature variance has reached an extreme value of 10^4. Crashing occurs during training when the model capacity is further scaled to 658 million parameters (Swin-H).

A line graph with the x-axis labeled
Figure 1: The average feature variance of Swin V1 models (solid lines) and Swin V2 models (dashed lines) per layer (x-axis). The discrepancy between layers with the highest and lowest average feature variance is very large for Swin V1 models while much milder for Swin V2 models.

Looking closely at the architecture of the original Swin Transformer, we found that this was due to the output of the residual branch being added directly back to the main branch without normalization. In other words, the unconstrained output feature values could be very large compared to the input. As illustrated in Figure 2 (left), after one Transformer block, the feature values of the output can increase to 61 times larger than that of the input. To alleviate this problem, we propose a new normalization method called residual-post-normalization. As shown in Figure 2 (right), this method moves the normalization layer from the beginning to the end of each residual branch so that the output of each residual branch is normalized before being merged back into the main branch. In this way, the average feature variance of the main branch doesn’t increase significantly as the layers deepen. Experiments have shown that this new normalization method moderates the average feature variance of each layer in the model (see the dashed lines in Figure 1; the SwinV2 models have the same respective number of parameters as the SwinV1 models: 200 million [L] and 658 million [H]).

The left is a block diagram labeled “Swin V1 (pre-norm + linear attention)” with input and output labeled “x superscript (ell minus 1)” and “x superscript (ell),” respectively. There are four boxes between the input and output; they’re labeled, from bottom to top, “Layer Norm,” “Linear Attention,” “Layer Norm,” and “MLP.” Upward arrows connect the boxes. On top of every two boxes is a plus symbol with two inputs: one the output arrow of the preceding box and the other an arrow from under the preceding two boxes, indicating a skip connection. There are five numbers listed vertically, from bottom to top: 1, 10, 11, 50, and 61. An arrow points right from the block diagram to a second block diagram, labeled “Swin V2 (res-post-norm + cosine attention).” The block diagram is similar to the left one with the following differences: the labels of each box are different; from bottom to top, they’re “Cosine Attention,” “Layer Norm,” “MLP,” and “Layer Norm.” The “Cosine Attention” and “Layer Norm” boxes are in red. There are five different numbers listed vertically (from bottom to top): 1, 1, 2, 1, and 3.
Figure 2: To accommodate larger vision models, the normalization configuration of the original Swin Transformer was adjusted. The original architecture (left) uses the pre-norm configuration, in which normalization occurs at the beginning of each residual branch. This configuration results in an increase in the feature values (from 1 to 61), leading to crashing during training. In Swin V2 (right), two changes were made: firstly, normalization is moved to the end of the residual branch in a new method called residual-post-normalization, which sees a much milder increase in value (from 1 to 3). Secondly, the linear dot-product function in the attention unit is replaced with a cosine function.

In addition, we also found that as the model becomes larger, the attention weights of certain layers tend to be dominated by a few specific points in the original self-attention computation, especially when residual-post-normalization is used. To tackle this problem, our team further proposes the scaled cosine attention mechanism (see Figure 2 right) to replace the common dot-product linear attention unit. In the new scaled cosine attention unit, the computation of self-attention is independent of the input magnitude, resulting in less saturated attention weights.

Experiments have shown that residual-post-normalization and the scaled cosine attention mechanism not only stabilize the training dynamics of large models but also improve accuracy.

Addressing large resolution gaps between pretraining and fine-tuning tasks

Another difficulty with large vision models is that the image resolution discrepancy between pretraining and fine-tuning can be large: pretraining is typically carried out at low resolutions, while many downstream vision tasks require high-resolution input images or attention windows, as shown in Figure 3.

In Swin Transformer, there’s a term of relative position bias in the attention unit to represent the impact of one image patch on another based on the relative position between them. This term is learned in pretraining. However, since the relative position range at fine-tuning has been changed significantly compared to that in pretraining, we need techniques to initiate the biases at new relative positions not seen in pretraining. While the original Swin Transformer architecture uses a handcrafted bicubic interpolation method to transfer the old relative position biases to the new resolution, we find it’s not that effective when the resolution discrepancy between pretraining and fine-tuning is very large.

To resolve this problem, we propose a log-spaced continuous position bias approach (Log-spaced CPB). By applying a small meta-network to the relative position coordinates in log space, Log-spaced CPB can generate position bias for any coordinate range. Since the meta-network can take arbitrary coordinates as input, a pretrained model can freely transfer between different window sizes by sharing the weights of a meta-network. Moreover, by converting the coordinates to a log space, the extrapolation rate required to transfer between different window resolutions is much smaller than with using the original linear space coordinates.

On the left, a tightly cropped image of a dog with an indicated resolution of 224 x 224 labeled
Figure 3: In computer vision, many downstream tasks, such as object detection (right), require high-resolution input, but pretraining tasks, such as image classification (left), are generally done at low resolutions, creating another challenge in training and applying large-scale vision models.

Using Log-spaced CPB, Swin Transformer V2 achieves smooth transferring between different resolutions, enabling us to use a smaller image resolution—192 × 192—with no accuracy loss on downstream tasks compared with the standard 224 × 224 resolution used in pretraining. This speeds up training by 50 percent.

Scaling model capacity and resolution results in excessive GPU memory consumption for existing vision models. To address the memory issue, we combined several crucial techniques, including Zero-Redundancy Optimizer (ZeRO), activation checkpointing, and a new sequential self-attention implementation. With these techniques, GPU memory consumption is significantly reduced for large-scale models and large resolutions with little impact to training speed. The GPU savings also allows us to train the 3-billion-parameter SwinV2-G model on images with resolutions of up to 1,536 × 1,536 using the 40-gigabyte A100 GPU, making it applicable to a range of vision tasks requiring high resolution, such as object detection.

Tackling the data starvation problem for large vision models

Training larger models often requires more labeled data; however, the computer vision field lacks such labeled data at scale because of the high cost of human-annotated data. This has compelled the vision field to explore the training of large models with smaller amounts of labeled data. To this aim, we introduce the self-supervised pretraining approach SimMIM, short for Simple Framework for Masked Image Modeling.

As shown in Figure 4, SimMIM learns image representation by masked image modeling, a pretext task in which a portion of an input image is masked and then the model predicts the RGB values of the masked area given the other visible parts. By this approach, the rich information contained in each image is better exploited, which allows us to use less data to drive the training of large models. With SimMIM, we were able to train the SwinV2-G model by using only 70 million labeled images, which is 40 times less than that used by previous billion-scale vision models.

An image of a building surrounded by a playground and trees is partitioned into a 3 x 3 grid. The grid is unfolded into its nine individual patches and five of them are replaced by mask patches; an arrow labeled “mask” points right from the gird to the patches. An arrow points upward from the unfolded patches to two boxes: the first is labeled “Encoder (e.g., ViT, Swin)” and the second is labeled “One-layer Prediction Head.” On top of the second box are two rows of the original patches that were replaced by mask patches, with double-headed arrows in between the top patches and bottom patches across the row and an “ell one loss” indicator at left.
Figure 4: In the proposed self-supervised pretraining method SimMIM, models are tasked with predicting the RGB of hidden portions of an input image based on the visible portions. The method significantly reduces the number of labeled images required in large model training. With SimMIM, a 3-billion-parameter Swin V2 model was trained by using 40 times less labeled data than that used in previous billion-scale vision models.

Setting new records on four representative vision benchmarks

By scaling up model capacity and resolution, Swin Transformer V2 set new records on four representative vision benchmarks when it was introduced in November: 84.0 percent top-1 accuracy on ImageNetV2 image classification; 63.1 / 54.4 box / mask mean average precision (mAP) on COCO object detection; 59.9 mean Intersection-over-Union (mIoU) on ADE20K semantic segmentation; and 86.8 percent top-1 accuracy on Kinetics-400 video action classification.

Benchmark ImageNetV2 COCO test-dev ADE20K val Kinetics-400
Swin V1 77.5 58.7 53.5 84.9
Previous state of the art 83.3 (Google, July 2021) 61.3 (Microsoft, July 2021) 58.4 (Microsoft, October 2021) 85.4 (Google, October 2021)
Swin V2 (November 2021) 84.0 (+0.7) 63.1 (+1.8) 59.9 (+1.5) 86.8 (+1.4)
Table 1: Swin Transformer (V1) was modified to address the challenges of pretraining and applying large vision models. The adapted architecture (V2) achieved state of the art on several representative benchmarks when it was introduced.

We hope this strong performance on a variety of vision tasks can encourage the field to invest more in scaling up vision models and that the provided training “recipe” can facilitate future research in this direction.

To learn more about the Swin Transformer journey, check out our Tech Minutes video.

Swin Transformer research team

(In alphabetical order) Yue Cao, Li Dong, Baining Guo, Han Hu, Stephen Lin, Yutong Lin, Ze Liu, Jia Ning, Furu Wei, Yixuan Wei, Zhenda Xie, Zhuliang Yao, and Zheng Zhang

The post Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability appeared first on Microsoft Research.

Read More

ICLR 2022 highlights from Microsoft Research Asia: Expanding the horizon of machine learning techniques and applications

A lightbulb-shaped image, set against a background that represents computer circuitry, is divided into seven sections, with a small research-related icon in each section. Moving clockwise from upper right to upper left, the seven icons are a set of scales, a padlock, a badge, two hands clasped in a handshake, a human brain, a group of three people, and a human eye.

ICLR (International Conference on Learning Representations) is recognized as one of the top conferences in the field of deep learning. Many influential papers on artificial intelligence, statistics, and data science—as well as important application fields such as machine vision, speech recognition, and text understanding—have been published and presented at this conference. The following selection of papers accepted at ICLR 2022 showcases the latest research from Microsoft and its collaborators on vision pre-training, periodic time series forecasting, differential privacy, code completion, table pre-training, and online reinforcement learning. 

As research into deep learning continues to grow and change, Microsoft researchers and collaborators are broadening their approach to the field. As several papers highlighted in this post demonstrate, research teams continue to refine their thinking about how various machine learning techniques can best be implemented for real-world applications—whether for specialized applications in industry or a more generalized approach to improving decision-making in models overall. They are also gaining a deeper understanding of how different modalities, like computer vision, can extend applications of machine learning beyond language. 

In parallel with exploring real-world applications and multimodality, researchers are looking to the future of machine learning techniques, further exploring uncharted areas in deep online and offline reinforcement learning. In subfields such as the latter, the fundamentals of how models learn from and interact with data are evolving—and so are the ways that researchers think about optimizing those processes and reworking them for situations where data is scarce or unavailable in real-world settings.

This post is a sampling of the work that researchers at Microsoft Research Asia and their collaborators presented at ICLR 2022, which reflects the broad scope of the company’s machine learning research. You can learn more about the work accepted at this year’s event on the “Microsoft at ICLR 2022” event page. On the Microsoft Research Blog, you can dive deeper into two papers accepted at the conference—one on MoLeR, a model that represents molecules as graphs to improve drug discovery, and the other on Path Predictive Elimination (PPE), a reinforcement learning method that is robust enough to remove noise from continuously changing environments.

DEPTS: Deep expansion learning for periodic time series forecasting

In the image on the right, the overall data flows for DEPTS is visualized. The middle image shows how the researchers plot the integral structure of three layer-by-layer expansion branches in the expansion module. On the left, the detailed residual connections within a single layer are depicted.
Figure 1: The image on the right shows the overall data flows for DEPTS. The middle image shows how researchers plot the integral structure of three layer-by-layer expansion branches in the expansion module. The image on the left depicts the detailed residual connections within a single layer.

People and organizations involved: Shun Zheng, Xiaohan Yi, Wei Cao, Jian Bian, and Tie-Yan Liu from Microsoft Research Asia; Wei Fan and Yanjie Fu from the University of Central Florida. 

According to the paper: Periodic time series (PTS, or time series with apparent periodic oscillations) is widespread across industries such as transportation, electric power generation and transmission, sustainability, and others. PTS forecasting plays a crucial role in these industries because it can help businesses with many critical tasks, including early warning, pre-planning, and resource scheduling. However, PTS forecasting performance can be affected by dependencies on its inherent periodic nature and the complexity of individual periods. 

This paper introduces DEPTS, a deep expansion learning framework for PTS forecasting. DEPTS begins with a novel decoupled formulation by introducing the periodic state as a hidden variable, which allows the researchers to create custom modules to tackle the two challenges mentioned above. To address the first challenge, the researchers develop an expansion module on top of residual learning to perform a layer-by-layer expansion of those complicated dependencies. To address the second, they introduce a periodicity module with a parameterized periodic function that has the capacity to capture diversified periods. 

The researchers conduct experiments on both synthetic and real-world data that show DEPTS is effective at PTS forecasting and significantly reduces errors compared to baseline—an improvement of up to 20 percent for some cases.

Towards deployment-efficient reinforcement learning: Lower bound and optimality

An animation shows an initial state, S zero. After a few deployments, Layer one shows S one/one, S two/one, and S three/one. After a few deployments, Layer two shows S  one/two, S two/two, and S three/two. After a few deployments, Layer three shows S  one/three, S two/three, and S three/three. After each deployment, the layers move from a known state to an unknown state.
Figure 2: This shows a high-level visualization of our algorithms: a layer-by-layer strategy (with a three-layer tabular MDP as an example).

People and organizations involved: Li Zhao, Tao Qin, and Tie-Yan Liu from Microsoft Research Asia; Jiawei Huang, Jinglin Chen, and Nan Jiang from the University of Illinois at Urbana-Champaign. 

According to the paper: Traditional online reinforcement learning (RL) can be abstracted as the loop of two elements: learning a policy from collected data and deploying the policy to collect new data by interacting with environments. The overall objective of RL is to finish the exploration of the entire environment and obtain a near-optimal policy. 

However, in many real-world applications, policy deployment can be quite costly while data collection with a fixed policy is relatively convenient. For example, in recommendation systems, the policy is the recommendation strategy, and a good policy can accurately make suggestions to users based on their preferences. To guarantee the service quality, before launching a new policy, companies usually need to conduct multiple internal tests for evaluation, which require a lot of time (up to several months). Due to large customer bases, however, companies can gather thousands or millions of pieces of feedback for further policy learning in a short time once a system is deployed. In those applications, organizations prefer RL algorithms that can learn good policy after only a few switches or deployments. Yet, a gap still remains between existing algorithms and the practical scenarios above (please refer to the paper for more discussion). 

To close the gap, the researchers propose a new setting called Deployment-Efficient Reinforcement Learning (DE-RL)—an abstracted model for applications that value deployment efficiency. A new idea called deployment complexity, an analogue of sample complexity, provides a way to measure the deployment efficiency of an algorithm. Deployment complexity is the number of policy deployments required before the algorithm returns a near-optimal policy. 

Under this framework, researchers study linear Markov decision processes (MDPs) as a concrete example and conduct theoretical analysis to answer two important questions. First, what is the best deployment complexity we can achieve (lower bound)? Second, how can we design algorithms to achieve the best deployment complexity (optimality)? Additionally, because most of the previous related literature just studied algorithms constrained to deploy only deterministic policy, these researchers separately consider two cases with and without such constraints. They show that removing such constraints can significantly improve deployment efficiency. 

For the first question above, researchers contribute the construction of hard instances and establish information theoretical lower bounds for two cases, which are dH and H, respectively. For the second question, researchers propose algorithms to achieve those lower bounds with a layer-by-layer exploration strategy (as shown in Figure 2), where the researchers contribute a new algorithm framework based on a novel covariance matrix estimation method and several innovations on a technical level. Finally, the researchers discuss extended settings based on the formulation of DE-RL, which may be an interesting subject for future investigation. 

Gradient information matters in policy optimization by back-propagating through model

This image illustrates a problem inherent in model-based reinforcement learning and how it can be solved. The diagram on the left shows the mismatch between learning and using the model. The diagram on the right shows how the proposed two-model-based learning method can control both the prediction and gradient errors, separating the different roles of the two models at the model learning phase and coordinating them at the policy optimization phase.
Figure 3: (a)This shows the mismatch between learning and using the model. Here the model means the transition and reward function. (b) This illustrates the DDPPO algorithm. The DDPPO algorithm constructs the prediction and the gradient model separately. DDPPO takes the different losses to train different models and then uses them appropriately.

People and organizations involved: Yue Wang, Tie-Yan Liu from Microsoft Research Asia; Chongchong Li, Yuting Liu from Beijing Jiaotong University; Wei Chen from the Institute of Computing Technology, Chinese Academy of Sciences, Zhi-Ming Ma from the Academy of Mathematics and Systems Science, Chinese Academy of Sciences 

According to the paper: Model-based reinforcement learning provides an efficient mechanism to find the optimal policy by interacting with the learned environment. In this paper, researchers investigate a mismatch in model learning and model using. Specifically, to get the policy update direction, an effective way is to leverage the model’s differentiability by using the model gradient. However, most of the commonly used methods just treat the model learning task as a supervised learning task and minimize its prediction error while not considering the gradient error. In other words, the algorithm requires an accurate model gradient, but we only learn to decrease the prediction error, which results in an objective mismatch. 

This paper first theoretically justifies that the model gradient error matters in the policy optimization phase. Specifically, the bias of the estimated policy gradient is not only introduced by the prediction error of the learned model but also introduced by the gradient error of the learned model. These errors will eventually influence the convergence rate of the policy optimization process. 

Next, the paper proposes a two-model-based learning method to control both the prediction and the gradient error. The paper separates the different roles of these two models at the model learning phase and coordinates them at the policy optimization phase. By designing a practical way to compute the gradient error, the paper can use it to guide the gradient model learning. By leveraging both the prediction and gradient models, we can first roll out the trajectory and then compute the model gradient to get the policy gradient. The proposed algorithm is called directional derivative projection policy optimization (DDPPO). Finally, several experiments in benchmark continuous control tasks demonstrate that the proposed algorithm has better sample efficiency.

Variational oracle guiding for reinforcement learning

The image shows two diagrams to illustrate variational latent oracle guiding, abbreviated as VLOG, for deep reinforcement learning. The first diagram shows VLOG during learning, the second during execution. Both diagrams use Q-learning as the example.
Figure 4: Diagrams of VLOG during learning and execution, using Q-learning as an example. Left: During learning when oracle observations are available. A Bayesian latent variable z is estimated using the executor observation (prior) and the oracle observation (posterior), respectively. The whole model is trained by maximizing the VLOG variational lower bound, which is the RL objective function of the posterior model minus the KL-divergence between the posterior and prior z. Right: During execution, when only executor observations are available. 

People and organizations involved: Dongqi Han, Xufang Luo, Yuqing Yang and Dongsheng Li from Microsoft Research Asia; Tadashi Kozuno from the University of Alberta; Zhaoyun Chen from the Institute of Artificial Intelligence, Hefei Comprehensive National Science Center; Kenji Doya from the Okinawa Institute of Science and Technology. 

According to the paper: Despite recent successes of deep reinforcement learning (DRL) in various decision-making problems, an important but underexplored aspect is how to leverage oracle observation (the information that is invisible during online decision making but is available during offline training) to facilitate learning. For example, human experts will look at the replay after a poker game, so they can check the opponents’ hands and use the visible information (executor observation to improve their gameplay strategy). Such problems are known as oracle guiding. 

In this work, the researchers study the oracle guiding problems based on Bayesian theory and derive an objective to leverage oracle observation in RL using variational methods. The key contribution is to propose a general learning framework referred to as variational latent oracle guiding (VLOG) for DRL. VLOG is featured with preferable properties such as its robust and promising performance and its versatility to incorporate with any value-based DRL algorithm. 

The paper empirically demonstrates the effectiveness of VLOG in online and offline RL domains with tasks ranging from video games to mahjong, a challenging tile-based game. Furthermore, the authors publish the mahjong environment and an offline RL dataset as a benchmarking task to facilitate future research on oracle guiding, game AI, and related topics. 

The post ICLR 2022 highlights from Microsoft Research Asia: Expanding the horizon of machine learning techniques and applications appeared first on Microsoft Research.

Read More