What would it take to get humanoid, bipedal robots to dance like Mick Jagger? Indeed, for something more mundane, what does it take to get them to simply stand still? Sit down? Walk? Move in myriads of other ways many people take for granted? Bipedalism provides unparalleled versatility in an environment designed for and by humans. By mixing and matching a wide range of basic motor skills, from walking to jumping to balancing on one foot, people routinely dance, play soccer, carry heavy objects, and perform other complex high-level motions. If robots are ever to reach their full potential as an assistive technology, mastery of diverse bipedal motion is a requirement, not a luxury. However, even the simplest of these skills can require a fine orchestration of dozens of joints. Sophisticated engineering can rein in some of this complexity, but endowing bipedal robots with the generality to cope with our messy, weakly structured world, or a metaverse that takes after it, requires learning. Training AI agents with humanoid morphology to match human performance across the entire diversity of human motion is one of the biggest challenges of artificial physical intelligence. Due to the vagaries of experimentation on physical robots, research in this direction is currently done mostly in simulation.
Unfortunately, it involves computationally intensive methods, effectively restricting participation to research institutions with large compute budgets. In an effort to level the playing field and make this critical research area more inclusive, Microsoft Research’s Robot Learning group is releasing MoCapAct, a large library of pre-trained humanoid control models along with enriched data for training new ones. This will enable advanced research on artificial humanoid control at a fraction of the compute resources currently required.
The reason why humanoid control research has been so computationally demanding is subtle and, at the first glance, paradoxical. The prominent avenue for learning locomotive skills is based on using motion capture (MoCap) data. MoCap is an animation technique widely used in the entertainment industry for decades. It involves recording the motion of several keypoints on a human actor’s body, such as their elbows, shoulders, and knees, while the actor is performing a task of interest, such as jogging. Thus, a MoCap clip can be thought of as a very concise and precise summary of an activity’s video clip. Thanks to this, useful information can be extracted from MoCap clips with much less computation than from the much more high-dimensional, ambiguous training data in other major areas of machine learning, which comes in the form of videos, images, and text. On top of this, MoCap data is widely available. Repositories such as CMU Motion Capture Dataset contain hours of clips for just about any common motion of a human body, with visualizations of several examples shown below. Why, then, is it so hard to make physical and simulated humanoid robots mimic a person’s movements?
The caveat is that MoCap clips don’t contain all the information necessary to imitate the demonstrated motions on a physical robot or in a simulation that models physical forces. They only show us what a motion skill looks like, not the underlying muscular movements that caused the actor’s muscles to yield that motion. Even if MoCap systems recorded these signals, it wouldn’t be of much help: simulated humanoids and real robots typically use motors instead of muscles, which is a dramatically different form of articulation. Nonetheless, actuation in artificial humanoids is also driven by a type of control signal. MoCap clips are a valuable aid in computing these control signals, if combined with additional learning and optimization methods that use MoCap data as guidance. The computational bottleneck that our MoCapAct release aims to remove is created exactly by these methods, collectively known as reinforcement learning (RL). In simulation, where much of AI locomotion research is currently focused, RL can recover the sequence of control inputs that takes a humanoid agent through the sequence of poses from a given MoCap clip. What results is a locomotion behavior that is indistinguishable from the clip’s. The availability of control policies for individual basic behaviors learned from separate MoCap clips can open the doors for fascinating locomotion research, e.g., in methods for combining these behaviors into a single “multi-skilled” neural network and training higher-level locomotion capabilities by switching among them. However, with thousands of basic locomotion skills to learn, RL’s expensive trial-and-error approach creates a massive barrier to entry on this research path. It is this scalability issue that our dataset release aims to address.
Figure 1: The MoCapAct dataset consists of policies that track individual MoCap clips and data from these agents.
Our MoCapAct dataset, designed to be compatible with the highly popular dm_control humanoid simulation environment and the extensive CMU Motion Capture Dataset, serves the research community in two ways:
For each of over 2500 MoCap clip snippets from the CMU Motion Capture Dataset, it provides an RL-trained “expert” control policy (represented as a PyTorch model) that enables dm_control’s simulated humanoid to faithfully recreate the skill depicted in that clip snippet, as shown in these videos of the experts’ behaviors:
Training this model zoo has taken the equivalent of 50 years over many GPU-equipped Azure NC6v2 virtual machines (excluding hyperparameter tuning and other required experiments) – a testament to the computational hurdle MoCapAct removes for other researchers.
For each of the trained skill policies above, MoCapAct supplies a set of recorded trajectories generated by executing that skill’s control policy on the dm_control’s humanoid agent. These trajectories can be thought of as MoCap clips of the trained experts but, in a crucial difference from the original MoCap data, they contain both low-level sensory measurements (e.g., touch measurements) and control signals for the humanoid agent. Unlike typical MoCap data, these trajectories are suitable for learning to match and improve on skill experts via direct imitation – a much more efficient class of techniques than RL.
We give two examples of how we used the MoCapAct dataset.
First, we train a hierarchical policy based on the neural probabilistic motor primitive. To achieve this, we combine the thousands of MoCapAct’s clip-specialized policies together into a single policy that is capable of executing many different skills. This agent has a high-level component that takes MoCap frames as input and outputs a learned skill. The low-level component takes the learned skill and sensory measurement from the humanoid as input and outputs the motor action.
Figure 2: The hierarchical policy consists of a high-level policy and low-level policy. The high-level policy maps the given MoCap frames to a learned skill. The low-level policy takes the skill and the humanoid observation and outputs an action that best realizes the skill.
This hierarchical structure offers an appealing benefit. If we keep the low-level component, we can instead control the humanoid by inputting different skills to the low-level policy (e.g., “walk” instead of the corresponding motor actions). Therefore, we can re-use the low-level policy to efficiently learn new tasks.
Figure 3: We can replace the high-level policy with a task policy that is trained to output skills required to achieve some new task, such as running to a target.
In light of that, we replace the high-level policy with a task policy that is then trained to steer the low-level policy towards achieving some task. As an example, we train a task policy to have the humanoid reach a target. Notice that the humanoid uses many low-level skills, like running, turning, and side-stepping.
Figure 4: Our GPT model takes in a sequence of observations from the humanoid (called the “context”) and outputs an action that it thinks best continues the observed motion.
Our second example centers on motion completion, which is inspired by the task of sentence completion. Here, we use the GPT architecture, which accepts a sequence of sensory measurements (the “context”) and outputs a motor action. We train a control policy to take one second of sensory measurements from the dataset and output the corresponding motor actions from the specialized expert. Then, before executing the policy on our humanoid, we first generate a “prompt” (red humanoid in the videos) by executing a specialized expert for one second. Afterwards, we let the policy control the humanoid (bronze humanoid in the videos), at each time step, where it constantly takes the previous second of sensory measurements and predicts the motor actions. We find that this policy can reliably repeat the underlying motion of the clip, which is demonstrated in the first two videos. On other MoCap clips, we find that the policy can deviate from the underlying clip in a plausible way, such as in the third video, where the humanoid transitions from side-stepping to walking backwards.
On top of the dataset, we also release the code used to generate the policies and results. We hope the community can build off of our dataset and work to do incredible research in the control of humanoid robots.
Our paper is available here. You can read more at our website.
The data used in this project was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217.
Large-scale models are revolutionizing deep learning and AI research, driving major improvements in language understanding, generating creative texts, multi-lingual translation and many more. But despite their remarkable capabilities, the models’ large size creates latency and cost constraints that hinder the deployment of applications on top of them. In particular, increased inference time and memory consumption inhibit deployment of models on latency-sensitive and resource-constrained applications on both server and client devices. To address these deployment challenges, the DeepSpeed team, as part of Microsoft’s AI at Scale initiative, has been exploring innovations in system optimization and model compression. On the former, we released the DeepSpeed inference system, which consists of a diverse set of optimizations, such as highly optimized CUDA kernels and inference-adapted parallelism to accelerate model inference speed, as well as ZeRO-Inference, which breaks the GPU memory wall and fits large models across heterogeneous memories to address hardware accessibility limitations. These optimizations target improving the inference system efficiency while preserving the model sizes, the amount of computation, and model accuracy: the total work remains the same, but the processing capability and speed are higher. On the latter, the emerging compression algorithms show great potential in reducing model size and inference computation. These algorithms use condensed format to represent, store, communicate, and compute DNN models, reducing the total work needed for inference with little or no loss in accuracy. System optimizations and model compression are very much complementary, and they can be synergistically combined to provide a multiplicative reduction on inference latency and cost. Motivated by combining the best of both worlds, we are proud to announce DeepSpeed Compression—a composable library that combines novel compression technologies and highly efficient system optimizations to make DL model size smaller and inference speed faster, all with much lowered compression cost.
Challenges of compressing large deep learning models
Although there have been numerous efforts to compress model sizes and reduce inference computation, applying existing compression techniques to large scale models still has many challenges in practice:
Complex pipeline for achieving high compression ratio. Various strategies have been proposed to overcome optimization difficulty and accuracy degradation when compressing large models. However, no systematic study on best practices for extreme compression exists, such as using aggressive quantization methods and layer reduction. This leaves the underlying question unanswered: do we really need those ad-hoc tricks to recover the accuracy loss or do simpler yet more effective methods exist?
High compression cost. Existing methods for compressing large models incur high training costs. For example, popular compression methods such as quantize-aware training (QAT) and multi-stage distillation methods lead to long training time and large hardware resource requirement as the model size grows into multi-billion parameters or at even larger scale, making compressing these models costly and difficult. For example, the 20B GPT-NeoX model was pre-trained using 96 NVIDIA A100 GPUs in three months. Performing QAT even with 10% of training samples would still require large amounts of computational resources, which many practitioners cannot afford.
Lack of tailored system optimizations for compressed models. To maximize the benefits of compressed models, specialized system optimizations are often required, e.g., quantized and sparsified models need optimized low-bit arithmetic computation and sparse matrix multiplication to boost the inference speed on commodity hardware. Existing methods often focus on reducing theoretical computation overhead but miss the opportunities to offer the best inference latency reduction via tailored system optimizations for the compressed models.
Limited composability. Existing methods have limited composability from two aspects. First, there is limited composability among multiple compression methods. Although well-performing compression solutions have been proposed independently, combining multiple methods together for the best outcome is still a laborious process, requiring building a complex compression pipeline. Second, there is a lack of composability between compression techniques and system optimizations. As we just mentioned, compressed models require specialized system optimizations to maximize latency and cost reduction. However, few existing methods take an end-to-end approach of composing compressions with system optimizations, as it requires significant efforts to bring modeling, algorithm, and system areas of deep learning to work synergistically together.
DeepSpeed Compression overcomes these challenges by offering novel state-of-the-art compression techniques, such as XTC for 32x smaller model size and ZeroQuant for 5000x lower compression cost reduction. It also takes an end-to-end approach to improve the computation efficiency of compressed models via a highly optimized inference engine. Furthermore, our library has multiple built-in state-of-the-art compression methods and supports synergistic composition of these methods together with the system optimizations, offering the best of both worlds while allowing a seamless and easy-to-use pipeline for efficient DL model inference. Each of these features is explained further below.
Smaller model size: 32x smaller transformer models via simple yet effective binarized extreme compression
Reducing the size of large models is critical when deploying them on both servers and client devices. In DeepSpeed Compression, we provide extreme compression techniques to reduce model size by 32x with almost no accuracy loss or to achieve 50x model size reduction while retaining 97% of the accuracy. We do this through two main techniques: extreme quantization and layer reduction. Extreme quantization via ternarization/binarization reduces the model size significantly but is considered a particularly challenging task due to the large quantization error resulting in model performance degradation. To improve the accuracy of binarized/ternarized models, existing methods often adopt complicated and computationally expensive compression pipelines, such as multi-stage distillation. However, it remains unclear how different components in extreme quantization affect the resulting performance. To tease apart their effects, we perform a systematic study on the impacts of various techniques currently used for extreme compression.
In this process, we have identified several best practices for extreme compression:
A longer training iteration with learning rate decay is highly preferred for closing the accuracy gap of extreme quantization;
Single-stage knowledge distillation with more training budgets is sufficient to match or even exceed accuracy from multi-stage ones;
Training without data augmentation hurts performance on downstream tasks for various compression tasks, especially on smaller tasks;
Lightweight layer reduction matches or even exceeds expensive pre-training distillation for task-specific compression.
Based on these findings, we greatly simplify the procedure of extreme compression and propose a new extreme compression technique, XTC, that compresses a model to its limit with lightweight layer reduction and robust binarization. XTC produces models with little loss in accuracy yet up to 50x model size reduction, as shown in Figure 1. XTC reduces the model size by 32x with almost no loss in the average score on the GLUE tasks via simple yet effective binarization technique. By combining extreme quantization and lightweight layer reduction, we can further improve the binarized model, achieving 50x model size reduction while retaining 97% of the accuracy. Given that transformers are becoming the standard architecture choice for AI, we believe the investigation and the proposed solution could be highly impactful to power large-scale models on resource-constrained devices. If you are interested in XTC, you can also find more details in our technical report “Extreme Compression for Pre-trained Transformers Made Simple and Efficient.”
Figure 1: Comparison between XTC and other state-of-the-art compression results on BERT
Lower compression cost: Quantizing models with >5000x compression cost reduction and no training data
Large-scale transformer models with hundreds of billions of parameters are usually challenging to quantize due to the lack of training resources and/or data access. To resolve those issues, we propose a method called ZeroQuant, which quantizes large-scale models with little or no fine-tuning cost on limited resources. Under the hood, ZeroQuant contains two major parts: 1) a hardware friendly fine-grained quantization scheme that allows us to quantize weights and activations into low-bit values with minimal errors while still empowering fast inference speed on commodity hardware with low quantization/dequantization cost; and 2) a layer-by-layer knowledge distillation pipeline, which fine-tunes the quantized model to close the accuracy gap from low-precision (e.g., INT4) quantization.
The benefits of ZeroQuant are threefold: First, unlike previous quantization-aware training that requires expensive retraining and parameter tuning, ZeroQuant enables quantizing BERT and GPT-style models from FP32/FP16 into INT8 weight and activations to retain accuracy without incurring any retraining cost, as shown in Figure 2. Second, by loading only one layer for low-precision (e.g., INT4) quantization at a time, the maximum memory footprint required to quantize the model depends solely on the size of individual layer instead of the entire model, allowing one to quantize gigantic models with as little as one GPU. Third, our quantization method is data-free, which means that it does not require the original training data of the model to obtain a quantized model. This is especially useful when the data is not available due to privacy related reasons, for example.
Figure 2: Comparison between ZeroQuant and standard Quantization Aware Training. ZeroQuant can significantly reduce training resources and time cost, without requiring the original training data.
We demonstrated the scalability of ZeroQuant on a GPT-3-style model with 1.3B parameters (GPT-3-1.3B) and one of the largest open-source language models, GPT-NeoX (20B). Particularly, thanks to the fine-grained quantization scheme, ZeroQuant can convert GPT-3-1.3B (trained with 128 NVIDIA A100 with five days) and GPT-NeoX (trained with 96 A100 with three months) to INT8 without any cost or training data while delivering comparable accuracy. Furthermore, with the lightweight layer-by-layer knowledge distillation, ZeroQuant can quantize GPT-3-1.3B with mixed INT4/INT8 precision in three hours on a single GPU, which leads to 5000x compression cost reduction compared to quantization-aware training. To find more details about ZeroQuant, refer to “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”.
Faster inference speed: Latency reduction via highly optimized DeepSpeed Inference system
System optimizations play a key role in efficiently utilizing the available hardware resources and unleashing their full capability through inference optimization libraries like ONNX runtime and DeepSpeed. We build our work on top of DeepSpeed inference, which provides high-performance model serving with inference optimized kernels, parallelism, and memory optimizations, covering a wide variety of models for both latency sensitive and throughput-oriented applications. Besides leveraging these, we also extend the inference capability to support models in compressed formats. For example, we developed variations of efficient low-bit computation such as INT8 GeMM kernels. These kernels load INT8 parameters and activations from GPU device memory to the registers and use the customized INT8 GeMM implemented on top of CUTLASS tuned for different batch sizes to deliver faster GeMM computation. The kernels also fuse quantization and dequantization operations before and after GeMM, further reducing the kernel invocation overhead and improving the memory bandwidth utilization. Furthermore, our inference engine supports many-GPU transformer layers for serving transformer models across GPUs using inference-adapted parallelism strategies. For compressed models that have a smaller memory footprint, the inference engine can automatically shrink the number of GPUs required to serve a model, leading to reduced cross-GPU communication and hardware cost. For example, DeepSpeed compression leverages INT8 for GPT-NeoX (20B) and reduces the GPU requirement of serving the model from two to one, reducing latency from 65ms to 25ms, and achieving a 5.2x cost reduction. As shown in Figure 3, DeepSpeed INT8 kernels can boost performance by up to 2x compared to our own FP16 kernels, and they achieve 2.8-5.2x latency cost reduction compared to the baseline FP16 in PyTorch, significantly reducing the latency and cost of large-scale model inference.
Figure 3: Inference efficiency improvements for large-scale transformer models with publicly available checkpoints using optimized DeepSpeed Inference engine. Inference efficiency is calculated by the inference latency speedup divided by the hardware cost reduction rate. DeepSpeed Inference achieves 2.8-4.8x latency reduction and up to 5.2x inference efficiency improvements.
A library that synergistically composes compression algorithms and system optimizations
DeepSpeed Compression proposes a seamless pipeline to address the compression composability challenges, as shown in Figure 4. The core piece of DeepSpeed Compression is a component called compression composer, which includes several significant features:
Figure 4: The DeepSpeed Compression library
It offers multiple cutting-edge compression methods, as shown in Table 1, including extreme quantization, head/row/channel pruning, and knowledge distillation, that can effectively reduce model size and inference cost. The list will expand as we continually integrate more state-of-the-art compression methods.
Category
Methods
Targets
Quantization
INT8/INT4
Activations
INT8/INT4/Ternary/Binary
Weights
Sparsification
Head pruning
Attention head (Transformer)
Sparse/Row pruning
Weights
Channel pruning
Conv2D weights
Layer Reduction
Arbitrary subset of network layers
Layers
Distillation
Output logits, feature map, attn. map
Layers
Table 1: Compression techniques supported in DeepSpeed Compression composer.
It offers an easy-to-use API that automatically takes care of the complexities of assembling different compression techniques to deliver the compound benefits of multiple compression methods. For example, XTC requires composition of lightweight layer reduction, binarization, and knowledge distillation. However, composing them together is non-trivial. With our compression composer, applying extreme compression is as easy as adding two new API calls to enable compression and clean the compressed model.
It is designed in a modular way so that it will be easy for users to add new compression schemes. For example, additional compression methods can be added through custom compression layers and, by registering them with the compression composer, the new methods can be composed with existing methods that are already managed by the composer.
It seamlessly works with the existing DeepSpeed library. This has two benefits. First, DeepSpeed Compression can be specified and enabled the same way as DeepSpeed training and inference via a JSON file, where enabling different combination of compression techniques only requires a few lines of modification in the JSON file. Second, once the compression schemes have been configured, the compression composer automatically modifies the model layers and training to enable the compression process and does not require additional changes from the user to the model structure or the training procedure.
After the DNN model has been compressed, DeepSpeed Compression replaces the compressed layers with highly optimized kernels in the DeepSpeed Inference engine to maximize hardware efficiency. Together, the compression composer and inference engine achieve the best of both worlds of compression and system optimization, delivering a compound effect of inference cost reduction.
Use Cases of DeepSpeed Compression
Although we started DeepSpeed Compression quite recently, we have successfully leveraged it to optimize several large-scale open-source models and Microsoft production workloads. It delivers significant latency and cost reduction, widely applicable on both various NLP and CV tasks.
We applied INT8 quantization of DeepSpeed Compression to optimize two large-scale open-source models in GPT-3 style: GPT-J (6B) and GPT-NeoX (20B) on the Azure AI platform. As shown in Figure 5, our quantized models achieve similar accuracy as the original models on 19 zero-shot evaluation tasks and WikiText, while achieving 3.67x and 5.2x inference cost savings, respectively, compared with PyTorch FP16 baseline on ND A100 v4 Azure instances. Very importantly, we quantize these models without requiring any training data, expensive compression time or GPU resources, bringing huge training cost savings compared with QAT!
Figure 5: DeepSpeed Compression results of GPT-J (6B)/GPT-NeoX (20B) using ZeroQuant. Left table shows the results of model quality and inference latency for the FP16 baseline and ZeroQuant; Right figures show the compression cost comparison between Quantization-aware Training (QAT, estimated) and ZeroQuant for INT8 quantization.
Beyond open-source models, DeepSpeed Compression has also demonstrated its effectiveness to optimize production workloads in Microsoft:
It reduces the Microsoft Turing Image Super Resolution model (T-ISR) model size by 3.1x together with 1.85x latency reduction by composing different compression schemes like pruning and distillation with efficient system optimizations. The model has also been deployed in Bing Maps and Microsoft Edge, where it automatically derives high-resolutions images from lower-resolution images, which can be seen in this blog post.
It also successfully compresses the Microsoft Relevance Fusion models—a Transformer-based ranking model used in Bing’s core search stack. Without DeepSpeed Compression, it took three days to quantize the model using QAT. With DeepSpeed Compression, we can quantize the model in a few minutes with improved accuracy and reduced latency compared to QAT.
DeepSpeed Compression release plan
DeepSpeed Compression is still at its early stage and under active development, but we’d like to share the results and tools to DeepSpeed users as soon as possible. At this first release, we open-source the core DeepSpeed Compression components, including the compression composer, which supports various compression methods consisting of INT8/INT4/Ternary/Binary quantization, lightweight layer reduction, pretraining and task specific knowledge distillation, head pruning, row pruning, and channel pruning, for compressing both NLP and computer vision models. Together with the compression composer, we are releasing the two novel technologies XTC and ZeroQuant introduced in this blog as part of the library.
We hope you will try DeepSpeed Compression. Please find the code, tutorial, and documents at the DeepSpeed GitHub, and website. We highly value your feedback and comments, so let us know what you think and how we can improve. As for the next steps, we plan to extend our offerings with more compression methods, an extended coverage of specialized kernels for compressed models, and an optimization module that automatically finds the best compression schemes. We believe that our composable library and new innovations will help close the gap between what is possible in AI and what is deployable as well as making DL inference faster, cheaper, and simpler.
We are a group of system and modeling researchers—Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Conglong Li, Reza Yazdani Aminabadi, Elton Zheng, Samyam Rajbhandari, Ammar Ahmad Awan, Jeff Rasley, Cheng Li, Olatunji Ruwase, Shaden Smith, Du Li, Michael Wyatt, Arash Bakhtiari, Guanhua Wang, Connor Holmes, Sam Ade Jacobs, Martin Cai, Yuxiong He (team lead)—who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing deep learning’s speed to train, speed to convergence, and speed to develop.
For many organizations, trusting their data to the cloud requires having a complete understanding of and control over the environment in which that data resides and how it’s being processed. Microsoft understands this, and we are committed to building a trustworthy cloud—one in which security, privacy, and transparency are built into its core. A key part of this vision is confidential computing—a set of hardware and software capabilities that give data owners visibility into the data environment and verifiable security protection of their data in use.
The Confidential Computing team at Microsoft Research is collaborating with hardware developers to create trusted execution environments (TEEs), where data stays encrypted not just when stored (encryption at rest) and in transit, but also during use. This work underpins the Azure confidential cloud platform, where users can upload encrypted code and data and get encrypted results back with strong privacy.
At Microsoft Build 2022, the company announced serverless confidential containers with lift-and-shift support, the next step in the evolution of confidential computing. This service builds on the Confidential Containers work conducted at Microsoft Research. Confidential Containers offers a verifiably secure container environment in Azure where users can confirm that the software performing computations on their data is exactly the software they expect to be running, that it will do what they want it to do with their data, and that they can trust the results it returns. Confidential Containers enables users to take existing container workloads, and with a small amount of configuration, use them in a confidential environment.
Smaller trusted computing base
Confidential Containers decreases the size of the trusted computing base (TCB)—the totality of elements in a computing environment that must be trustednot to violate the confidentiality of computation. The TCB can include software, hardware, and human administrators, among other things. By removing elements from the TCB, the components that can be compromised are reduced, decreasing the attack surface. Confidential Containers removes Microsoft administrators from the TCB, minimizing it as much as possible while still enabling customers to run existing workloads without modifying them.
This reduced TCB provides an option for organizations that currently run computations on their data on premises because they are concerned about the security of their data in the cloud. Even though setting up a computation environment in the cloud offers flexibility, data can be exposed to anyone who operates the servers on which the system runs. With Confidential Containers, the individuals who can access the data can be tightly controlled. This can be a single designated employee of the organization that owns the data or the business partner that is processing the data. It is never a Microsoft employee or another third party.
A secure hardware environment enables data protection in use. Confidential Containers runs on AMD processors backed by AMD Secure Encrypted Virtualization-Secure Nested Paging (SEV-SNP), which provides a TEE. This hardware-enforced security boundary provides a shield so that nothing outside the encrypted memory space can read the data.
Users of Confidential Containers create a policy defining precisely what can run in the confidential container environment and how. The AMD SEV-SNP hardware produces an attestation report, which provides a succinct representation of everything in the confidential environment, including information about the code that will be enforcing the policy. Users can request this attestation report any time before providing the container with a key to unlock the encrypted dataset for processing.
Sensitive data handling in the cloud
Before the development of HTTPS, businesses could not securely run a storefront on the public web because communication over the internet was not secure. In the same way, today individuals and organizations cannot run containerized computation over sensitive data in the public cloud. Confidential Containers addresses this need.
This is a game-changer for organizations that must comply with local and international regulations on how sensitive data is handled. For example, healthcare organizations that store encrypted patient information in the cloud are required by HIPAA regulations to download that data to perform computations on premises. This multistep process entails decrypting the data once it has been downloaded to an organization’s servers, performing the required computations, and then re-encrypting the data before re-uploading it to the cloud. It also requires ensuring that the on-premises environment contains the security architecture necessary to comply with HIPAA and other regulations.
Because Confidential Containers provides advanced security safeguards for data in use in Azure, organizations no longer need to perform these time-consuming steps. This also means they no longer need to maintain servers on premises. Moreover, Azure users can define even stricter policies for their container environment in the cloud than they have in place in their on-premises environment.
Secure multiparty computations
Another benefit of Confidential Containers is they enable secure multiparty computations. A single organization can securely process multiple datasets that contain sensitive information, or multiple organizations with datasets that must remain secure can share those datasets with the assurance that their data will not leak. Organizations can perform computations on multiple datasets, such as for training a machine learning model, and gain better results than they would if performing computations on a single dataset, all without knowing what is in those datasets.
Easy deployment and lift-and-shift of Linux containers
Creating a confidential container is straightforward for Azure users who are currently using or getting ready to use containers, requiring a small amount of configuration to move existing workloads. Linux users can easily lift-and-shift their Linux containers to Confidential Containers on Azure.
Unlimited potential with Confidential Containers
We believe that in the future, all computing in the cloud will be confidential, and we’re excited to share Confidential Containers—a technology that plays a role in making this happen. The capabilities it provides will have implications that we have yet to imagine. We’re particularly excited by the potential of multiparty computations. The ability to perform computations in a protected environment on multiple datasets brings limitless possibilities, unlocking great value to Azure users.
Confidential Containers is currently available for limited preview and will be available for public preview later this year. Sign up for the Confidential Containers preview.
Over the coming decade, deep learning looks set to have a transformational impact on the natural sciences. The consequences are potentially far-reaching and could dramatically improve our ability to model and predict natural phenomena over widely varying scales of space and time. Could this capability represent the dawn of a new paradigm of scientific discovery?
Jim Gray, a Turing Award winner, and former Microsoft Technical Fellow, characterised the historical evolution of scientific discovery through four paradigms. With origins dating back thousands of years, the first paradigm was purely empirical and based on direct observation of natural phenomena. While many regularities were apparent in these observations, there was no systematic way to capture or express them. The second paradigm was characterised by theoretical models of nature, such as Newton’s laws of motion in the seventeenth century, or Maxwell’s equations of electrodynamics in the nineteenth century. Derived by induction from empirical observation, such equations allowed generalization to a much broader range of situations than those observed directly. While these equations could be solved analytically for simple scenarios, it was not until the development of digital computers in the twentieth century that they could be solved in more general cases, leading to a third paradigm based on numerical computation. By the dawn of the twenty-first century computation was again transforming science, this time through the ability to collect, store and process large volumes of data, leading to the fourth paradigm of data-intensive scientific discovery. Machine learning forms an increasingly important component of the fourth paradigm, allowing the modelling and analysis of large volumes of experimental scientific data. These four paradigms are complementary and coexist.
The pioneering quantum physicist Paul Dirac commented in 1929 that “The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble.” For example, Schrödinger’s equation describes the behaviour of molecules and materials at the subatomic level with exquisite precision, and yet numerical solution with high accuracy is only possible for very small systems consisting of a handful of atoms. Scaling to larger systems requires increasingly drastic approximations leading to a challenging trade-off between scale and accuracy. Even so, quantum chemistry calculations are already of such high practical value that they form one of the largest supercomputer workloads.
However, over the last year or two, we have seen the emergence of a new way to exploit deep learning, as a powerful tool to address this speed-versus-accuracy trade-off for scientific discovery. This is a very different use of machine learning from the modelling of data that characterizes the fourth paradigm, because the data that is used to train the neural networks itself comes from numerical solution of the fundamental equations of science rather than from empirical observation. We can view the numerical solutions of scientific equations as simulators of the natural world that can be used, at high computational cost, to compute quantities of interest in applications such as forecasting the weather, modelling the collision of galaxies, optimizing the design of fusion reactors, or calculating the binding affinities of candidate drug molecules to a target protein. From a machine learning perspective, however, the intermediate details of the simulation can be viewed as training data which can be used to train deep learning emulators. Such data is perfectly labelled, and the quantity of data is limited only by computational budget. Once trained, the emulator can perform new calculations with high efficiency, achieving significant improvements in speed, sometimes by several orders of magnitude.
This ‘fifth paradigm’ of scientific discovery represents one of the most exciting frontiers for machine learning as well as for the natural sciences. While there is a long way to go before these emulators are sufficiently fast, robust, and general-purpose to become mainstream, the potential for real-world impact is clear. For example, the number of small-molecule drug candidates alone is estimated at 1060, while the total number of stable materials is thought to be around 10180 (roughly the square of the number of atoms in the known universe). Finding more efficient ways to explore these vast spaces would transform our ability to discover new substances such as better drugs to treat disease, improved substrates for capturing atmospheric carbon dioxide, better materials for batteries, new electrodes for fuel cells to power the hydrogen economy, and myriad others.
AI4Science is an effort deeply rooted in Microsoft’s mission, applying the full breadth of our AI capabilities to develop new tools for scientific discovery so that we and others in the scientific community can confront some of humanity’s most important challenges. Microsoft Research has a 30+ year legacy of curiosity and discovery, and I believe that the AI4Science team – spanning geographies and scientific fields – has the potential to yield extraordinary contributions to that legacy.
Kevin Scott, Executive Vice President and Chief Technology Officer, Microsoft
I’m delighted to announce today that I will be leading a new global team in Microsoft Research, spanning the UK, China and the Netherlands, to focus on bringing this fifth paradigm to reality. Our AI4Science team encompasses world experts in machine learning, quantum physics, computational chemistry, molecular biology, fluid dynamics, software engineering, and other disciplines who are working together to tackle some of the most pressing challenges in this field.
An example project is Graphormer, led by my colleague Tie-Yan Liu in our China team. This is a deep learning package that allows researchers and developers to train custom models for molecule modelling tasks, such as materials science, or drug discovery. Recently, Graphormer won the Open Catalyst Challenge, a molecular dynamics competition that aims to model the catalyst-absorbate reaction system by AI, and has more than 0.66 million catalyst-absorbate relaxation systems (144 million structure-energy frames) simulated by density functional theory (DFT) software. Another project, from our team in Cambridge, in collaboration with Novartis, is Generative Chemistry, where together we are empowering scientists with AI to speed up the discovery and development of break-through medicines.
“Not only can AI learn from our past experiments, but, with each new iteration of designing and testing in the lab, the machine learning algorithms can identify new patterns and help guide the early drug discovery and development process. Hopefully in doing this we can augment our human scientists’ expertise so they can design better molecules faster.”
The team has since used the platform to generate several promising early-stage molecules which have been synthesised for further exploration.
Alongside our teams in China and the UK, we have been growing a team in the Netherlands, including hiring the world-renowned machine learning expert, Max Welling. I am also excited to be able to announce today that our brand-new Lab in Amsterdam will be housed in Matrix One, which is currently under construction on the Amsterdam Science Park. This purpose-built space is in close proximity to the University of Amsterdam and the Vrije Universiteit Amsterdam, and we will maintain strong affiliations with both institutions through the co-supervision of PhD students.
Matrix One building in Amsterdam
It is with pride and excitement that we take this next step to come together as a cross-geographical team and follow in the footsteps of pioneers before us, to contribute to this next paradigm of scientific discovery, and in doing so impact many important societal challenges. If you share our excitement and ambition, and would like to join us, I encourage you to look at our open positions or get in touch to talk to anyone on the team.
Addressing and mitigating the effects of climate change requires a collective effort, bringing our strengths to bear across industry, government, academia, and civil society. As we continue to explore the role of technology to advance the art of the possible, we are launching the Microsoft Climate Research Initiative (MCRI). This community of multi-disciplinary researchers is working together to accelerate cutting-edge research and transformative innovation in climate science and technology.
MCRI enables us to bring Microsoft’s research skills and compute capacities to deep and continuous collaboration with domain experts. For the kickoff of this initiative, we are focusing on three critical areas in climate research where computational advances can drive key scientific transformations: Overcoming constraints to decarbonization, reducing uncertainties in carbon accounting, and assessing climate risks in more detail.
Through these collaborative research projects, we hope to develop and sustain a highly engaged research ecosystem comprising a diversity of perspectives. Researchers will offer transdisciplinary and diverse expertise, particularly in areas beyond traditional computer science, such as environmental science, chemistry, and a variety of engineering disciplines. All results of this initiative are expected to be made public and freely available to spark even broader research and progress on these important climate issues.
“As researchers, we’re excited to work together on projects specifically selected for their potential impact on global climate challenges. With Microsoft’s computational capabilities and the domain expertise from our collaborators, our complementary strengths can accelerate progress in incredible ways.”
– Karin Strauss, Microsoft
Microsoft researchers will be working with collaborators globally to co-investigate priority climate-related topics and bring innovative, world-class research to influential journals and venues.
Phase one collaborations
Carbon accounting
Real-time Monitoring of Carbon Control Progress from CO2 and Air Pollutant Observations with a Physically informed Transformer-based Neural Network
Understanding the change in CO2 emissions from the measurement of CO2 concentrations such as that done by satellites is very useful in tracking the real-time progress of carbon reduction actions. Current CO2 observations are relatively limited: numerical model-based methods have very low calculation efficiency. The proposed study aims to develop a novel method that combines atmospheric numerical modeling and machine learning to infer the CO2 emissions from satellite observations and ground monitor sensor data.
AI based Near-real-time Global Carbon Budget (ANGCB)
Zhu Liu, Tsinghua University; Biqing Zhu and Philippe Ciais, LSCE; Steven J. Davis, UC Irvine; Wei Cao, and Jiang Bian , Microsoft
Mitigation of climate change will depend upon a carbon emission trajectory that successfully achieves carbon neutrality by 2050. To that end, a global carbon budget assessment is essential. The AI-based, near-real-time Global Carbon Budget (ANGCB) project aims to provide the world’s first global carbon budget assessment based on Artificial Intelligence (AI) and other data science technologies.
Carbon reduction and removal
Computational Discovery of Novel Metal–Organic Frameworks for Carbon Capture
Removing CO2 from the environment is expected to be an integral component of keeping temperature rise below 1.5°C. However, today this is an inefficient and expensive undertaking. This project will apply generative machine learning to the design of new metal–organic frameworks (MOFs) to optimize for low-cost removal of CO2 from air and other dilute gas streams.
An Assessment of Liquid Metal Catalyzed CO2 Reduction
The CO2 reduction process can be used to convert captured carbon into a storable form as well as to manufacture sustainable fuels and materials with lower environmental impacts. This project will evaluate liquid metal-based reduction processes, identifying advantages, pinch-points, and opportunities for improvement needed to reach industrial-relevant scales. It will lay the foundation for improving catalysts and address scaling bottlenecks.
Computational Design and Characterization of Organic Electrolytes for Flow Battery and Carbon Capture Applications
Energy storage is essential to enable 100% zero-carbon electricity generation. This work will use generative machine learning models and quantum mechanical modeling to drive the discovery and optimization of a new class of organic molecules for energy-efficient electrochemical energy storage and carbon capture.
Despite encouraging progress in recycling, many plastic polymers often end up being one-time-use materials. The plastics that compose printed circuit boards (PCBs), ubiquitous in every modern device, are amongst those most difficult to recycle. Vitrimers, a new class of polymers that can be recycled multiple times without significant changes in material properties, present a promising alternative. This project will leverage advances in machine learning to select vitrimer formulations that withstand the requirements imposed by their use in PCBs.
The concrete industry is a major contributor to greenhouse gas emissions, the majority of which can be attributed to cement. The discovery of alternative cements is a promising avenue for decreasing the environmental impacts of the industry. This project will employ machine learning methods to accelerate mechanical property optimization of “green” cements that meet application quality constraints while minimizing carbon footprint.
Environmental resilience
Causal Inference to Understand the Impact of Humanitarian Interventions on Food Security in Africa
The Causal4Africa project will investigate the problem of food security in Africa from a novel causal inference standpoint. The project will illustrate the usefulness of causal discovery and estimation of effects from observational data by intervention analysis. Ambitiously, it will improve the usefulness of causal ML approaches for climate risk assessment by enabling the interpretation and evaluation of the likelihood and potential consequences of specific interventions.
Improving Subseasonal Forecasting with Machine Learning
Water and fire managers rely on subseasonal forecasts two to six weeks in advance to allocate water, manage wildfires, and prepare for droughts and other weather extremes. However, skillful forecasts for the subseasonal regime are lacking due to a complex dependence on local weather, global climate variables, and the chaotic nature of weather. To address this need, this project will use machine learning to adaptively correct the biases in traditional physics-based forecasts and adaptively combine the forecasts of disparate models.
They make restaurant recommendations, help us pay bills, and remind us of appointments. Many people have come to rely on virtual assistants and chatbots to perform a wide range of routine tasks. But what if a single dialog agent, the technology behind these language-based apps, could perform all these tasks and then take the conversation further? In addition to providing on-topic expertise, such as recommending a restaurant, it could engage in a conversation about the history of the neighborhood or a recent sports game, and then bring the conversation back on track. What if the agent’s responses continually reflect the latest world events? And what if it could do all of this without the need for any additional work by the designer?
With GODEL, this may not be far off. GODEL stands for Grounded Open DialogueLanguage Model, and it ushers in a new class of pretrained language models that enable both task-oriented and social conversation and are evaluated by the usefulness of their responses.
Pretrained language models are among the engines that power conversational AI, the technology that underlies these dialog agents. They can either be task-oriented (“give me a job, and I’ll do it”) or engage in a conversation without a specified outcome, known as open-domain or chit-chat. GODEL combines both these capabilities, giving dialog agents the ability to generate responses based not just on the context of the conversation, but also on external information, content that was not part of the dataset when the model was trained. This includes both structured content, such as information stored in databases, and unstructured content, such as restaurant reviews, Wikipedia articles, and other publicly available material found on the web. This explains how a simple task-based query about restaurant recommendations can evolve into a dialog about ingredients, food, and even cooking techniques—the kind of winding path that real-world conversations take.
In 2019, the Deep Learning and Natural Language Processing groups at Microsoft Research released DialoGPT, the first large-scale pretrained language model designed specifically for dialog. This helped make conversational AI more accessible and easier to work with, and it enabled the research community to make considerable progress in this area. With GODEL, our goal is to help further this progress by empowering researchers and developers to create dialog agents that are unrestricted in the types of queries they can respond to and the sources of information they can draw from. We also worked to ensure those responses are useful to the person making the query.
One of GODEL’s key features is the flexibility it provides users in defining their model’s grounding—the sources from which their dialog agents retrieve information. This flexibility informs GODEL’s versatility in diverse conversational settings. If someone were to inquire about a local restaurant for example, GODEL would be able to provide specific and accurate responses even though that venue may not have been included in the data used to train it. Responses would vary depending on whether the grounding information is empty, a snippet of a document, a search result (unstructured text), or information drawn from a database about the restaurant (structured text). However, each response would be appropriate and useful.
In addition to specificity, grounded generation helps keep models up to date, as the grounded text can incorporate information that may not have been available at the time the model was trained. For example, if a model were developed before the 2022 Winter Olympics, GODEL would be able to provide details on those games and a list of winners even though all the data available to train it predates that event.
Broad application of GODEL
Another main feature of GODEL is its wide range of dialog applications. While its predecessor, DialoGPT, and other prior pretrained models for dialog have mostly focused on social bots, GODEL can be applied to a variety of dialogs, including those that are task-oriented, question-answering, and grounded chit-chat. In the same conversation, GODEL can produce reasonable responses for a variety of query types, including general questions or requests for specific actions.
In addition, GODEL’s responses have been evaluated for their helpfulness. In our paper, we show that evaluation is done more reliably on datasets that are goal-directed, and that people generally agree on which responses are better when asked to judge their utility towards achieving certain goals. Equipped with this robust evaluation setup, we compared our model against several strong baselines and state-of-the-art approaches and show that GODEL is superior in terms of both human and automatic evaluation, as indicated in Figure 1. The paper describes extensive experiments against other state-of-the-art pretrained language models and demonstrates that performance gains are even larger in these cases.
Figure 1: These charts illustrate GODEL’s performance against T5, a pretrained model that performed best in our evaluation. They compare the aggregate performance of models fine-tuned from GODEL against that of models fine-tuned from T5. They show that GODEL performs much better in human evaluations and makes appreciable gains in the automatic evaluation. The test set for these experiments combines a variety of dialog genres, including task-oriented dialog, conversational question-answering, and grounded chit-chat.
The following examples illustrate different dialog scenarios where GODEL uses a variety of sources to respond to identical user queries.
This example illustrates how GODEL responds in an open-ended scenario in which the user asks a question that is completely unrelated to the initial question. Despite the lack of relevance, GODEL responds appropriately while trying to bring the conversation back on track.
This example illustrates how GODEL responds in a task-oriented setting in which the model is connected to the components of a traditional goal-oriented dialog systems, such as a database. In this case, the relevant environment contains structured information, a database returning two restaurants relevant to the current conversation.
This example illustrates how GODEL responds in a task-oriented setting in which traditional components of task-oriented dialog systems are not available. In this case, GODEL retrieves a restaurant review via a search engine. The response reflects both the context of the conversation and a snippet of the retrieved text, a restaurant review.
This example illustrates how GODEL responds in a question-answering scenario, where the user asks a general question and the context provides the dialog agent with the words it needs to search for the relevant information on the web.
GODEL available as open source
To advance research, we believe it is crucial to make code and models publicly available, and we have released GODEL as fully open source. We have made three versions of GODEL available: base, large, and extra-large. We are also including the code needed to retrain all pretrained models and to fine-tune models for specific tasks: the CoQA dataset, intended for conversational question-answering; the Wizard of Wikipedia and Wizard of the Internet datasets, aimed at information-seeking chats; and MultiWOZ is for task-completion dialogs.
We hope GODEL helps numerous academic research teams advance the field of conversational AI with innovative dialog models while eliminating the need for significant GPU resources. We plan to continuously improve GODEL and make more models available to the research community. Please visit our project page to learn more about the GODEL project and new releases.
Acknowledgements
We would like to thank our fellow colleagues at Microsoft Research who contributed to this work and blog post: Bill Dolan, Pengcheng He, Elnaz Nouri, Clarisse Simoes Ribeiro.