REACT — A synergistic cloud-edge fusion architecture

REACT — A synergistic cloud-edge fusion architecture

This research paper was accepted by the eighth ACM/IEEE Conference on Internet of Things Design and Implementation (IoTDI), which is a premier venue on IoT. The paper describes a framework that leverages cloud resources to execute large deep neural network (DNN) models with higher accuracy to improve the accuracy of models running on edge devices.

iotdi logo on blue gradient background

Leveraging the cloud and edge concurrently

The internet is evolving towards an edge-computing architecture to support latency-sensitive DNN workloads in the emerging Internet of Things and mobile computing applications domains. However, unlike cloud environments, the edge has limited computing resources and cannot run large, high accuracy DNN models. As a result, past work has focused on offloading some of the computation to the cloud to get around this limitation. However, this comes at the cost of increased latency.

For example, in edge video analytics use cases, such as road traffic monitoring, drone surveillance, and driver assist technology, one can transmit occasional frames to the cloud to perform object detection—a task ideally suited to models hosted on powerful GPUs. On the other hand, the edge handles the interpolating intermediate frames through object tracking—a relatively inexpensive computational task performed using general-purpose CPUs, a low-powered edge GPU, or other edge accelerators (e.g., Intel Movidius Neural Stick). However, for most real-time applications, processing data in the cloud is infeasible due to strict latency constraints.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

In our research paper, REACT: Streaming Video Analytics On The Edge With Asynchronous Cloud Support, we propose and demonstrate a novel architecture that leverages both the edge and the cloud concurrently to perform redundant computations at both ends. This helps retain the low latency of the edge while boosting accuracy with the power of the cloud. Our key technical contribution is in fusing the cloud inputs, which are received asynchronously, into the stream of computation at the edge, thereby improving the quality of detection without sacrificing latency.

Fusing edge and cloud detections

Figure (a) illustrates how REACT leverages object detections from both the cloud and the edge. The intermediate frames use object tracking, whose performance degrades over time. The edge detections are received immediately but the ones from cloud are received with some delay.
Figure 1(a): Orange and green boxes indicate detection from edge and cloud. Tracking performance degrades with every frame, indicated by the fading shades of blue.
Figure (b) shows a couple of images from a dashcam and how REACT can help to improve object detection performance.
Figure 1(b): REACT uses asynchronous cloud detections to correct the box labels and detect more objects.

We illustrate our fusion approach in REACT for object detection in videos. Figure 1 shows the result of object detection using a lightweight edge model. This suffers from both missed objects (e.g., cars in Frame 1 are not detected) and misclassified objects (e.g., the van on the right of the frame that has been misclassified as a car).

To address the challenges of limited edge computation capacity and the drop in accuracy from using edge models, we follow a two-pronged approach. First, since the sequence of video frames are spatiotemporally correlated, it suffices to call edge object detection only once every few frames. As illustrated in Figure 1(a), edge detection runs every fifth frame. As shown in the figure, to interpose the intermediate frames, we employ a comparatively lightweight operation of object tracking. Second, to improve the accuracy of inference, select frames are asynchronously transmitted to the cloud for inference. Depending on network delay and the availability of cloud resources, cloud detections reach the edge device only after a few frames. Next, the newer cloud detections—previously undetected—are merged with the current frame. To do this, we feed the cloud detection, which was made on an old frame, into another instance of the object tracker to “fast forward” to the current time. The newly detected objects can then be merged into the current frame so long as the scene does not change abruptly. Figure 1(b) shows a visual result of our approach on a dashcam video dataset.

Here’s a more detailed description of how REACT goes about combining the edge and the cloud detections. Each detection contains objects represented by a ⟨class_label, bounding_box, confidence_score⟩ tuple. Whenever we receive a new detection (either edge or cloud), we purge from the current list the objects that were previously obtained from the same detection source (either cloud or edge). Then we form a zero matrix of size (c, n). Here, c and are the indices associated with detections from current list and new source, respectively. We populate the matrix cell with the Intersection over Union (IoU) values—if it is greater than 0.5—corresponding to specific current and new detections. We then perform a linear sum assignment, which matches two objects with the maximum overlap. For overlapped objects, we modify the confidence values, bounding box, and class label based on the new detections’ source. Specifically, our analysis reveals that edge detection models could correctly localize objects, but often had false positives, i.e., they assigned class labels incorrectly. In contrast, cloud detections have higher localization error but lower error for class labels. Finally, newer objects (unmatched ones) will then get added to the list of current objects with the returned confidence values, bounding boxes, and class labels. Thus, REACT’s fusion algorithm must consider multiple cases —such as misaligned bounding boxes, class label mismatch, etc. — to consolidate the edge and cloud detections into a single list.

Detector Backbone Where #params
Faster R-CNN ResNet50-FPN Cloud 41.5M
RetinaNet ResNet50-FPN Cloud 36.1M
CenterNet DLA34 Cloud 20.1M
TinyYOLOv3 DN19 Edge 8.7M
SSD MobileNetV2 Edge 3.4M
Table 1: Models used in our evaluation

In our experimentation, we leveraged state-of-the-art computer vision algorithms for getting object detections at the edge and the cloud (see Table 1). Further, we use mAP@0.5 (mean average precision at 0.5 IoU), a metric popular in the computer vision community to measure the performance of object detections. Moreover, to evaluate the efficacy of REACT, we looked at two datasets:

  1. VisDrone: as drone-based surveillance
  2. D2City: dashcam-based driver assist

Based on our evaluation, we observed that REACT outperforms baseline algorithms by as much as 50%. Also, we noted that edge and cloud models can complement each other, and overall performance improves due to our edge-cloud fusion algorithm.

As already noted, the object detector runs only once every few frames and a lightweight object tracking is performed on intermediate frames. Running detection redundantly at both the edge and the cloud allows an application developer to flexibly trade off the frequency of edge versus cloud executions while achieving the same accuracy, as shown in Figure 2. For example, if the edge device experiences thermal throttling, we can pick a lower edge detection frequency (say, once every 20 frames) and complement it with cloud detection once every 30 frames to get mAP@0.5 of around 22.8. However, if there are fewer constraints at the edge, we can increase the edge detection frequency to once every five frames and reduce cloud detections to once every 120 frames to get similar performance (mAP@0.5 of 22.7). This provides a playground for fine-grained programmatic control.

The figure shows a heatmap of object detection accuracy metric called mAP@0.5 with change in edge and cloud detection frequency. For higher accuracy, we need to run detections at a higher rate. The figure highlights the trade-off, i.e., to maintain accuracy, one can increase cloud detection frequency but reduce edge frequency, and vice versa.
Figure 2: mAP@0.5 values for varying cloud and edge detection frequency on the D2-City dataset. Similar shading corresponds to similar mAP@0.5.

Further, one can amortize the cost of using cloud resources over multiple edge devices by having these share the same cloud hosted model. Specifically, if an application can tolerate a median latency of up to 500 ms, we can support over 60 concurrent devices at a time using the V100 GPU (Figure 3).

A scatter plot showing the median response time with increasing number of concurrent edge devices that share the same GPU for model serving. Here, we have shown 4 types of GPUs. Initially, the median response times for all GPUs increase slowly till it reaches a knee point after which the increase is faster.
Figure 3: 50th percentile response time vs number of edge devices that concurrently share a cloud GPU

Conclusion

REACT represents a new paradigm of edge + cloud computing that leverages the resources of each to improve accuracy without sacrificing latency. As we have shown above, the choice between offloading and on-device inference is not binary, and redundant execution at cloud and edge locations complement each other when carefully employed. While we have focused on object detection, we believe that this approach could be employed in other contexts such as human pose-estimation, instance and semantic segmentation applications to have the “best of both worlds.”

The post REACT — A synergistic cloud-edge fusion architecture appeared first on Microsoft Research.

Read More

Achieving Zero-COGS with Microsoft Editor Neural Grammar Checker

Achieving Zero-COGS with Microsoft Editor Neural Grammar Checker

black male sitting at a table working on a laptop

Microsoft Editor provides AI-powered writing assistance to millions of users around the world. One of its features that writers of all levels and domains rely on is the grammar checker, which detects grammar errors in a user’s writing and offers suggested corrections and explanations of the detected errors.

The technology behind grammar checker has evolved significantly since the 1970s, when the first-generation tool was based on simple pattern matching. A major breakthrough occurred in 1997, when Microsoft Word 97 introduced a grammar checker that relied on a full-fledged natural language processing system (Heidorn, 2000), enabling more sophisticated and accurate error detection and correction. Another major breakthrough occurred in 2020, when Microsoft launched a neural grammar checker that leveraged deep neural networks with a novel fluency boost learning and inference mechanism, achieving state-of-the-art results on both CoNLL-2014 and JFLEG benchmark datasets[1,2]. In 2022, Microsoft released a highly optimized version of the Microsoft Editor neural grammar checker on expanded endpoints in Word Win32, Word Online, Outlook Online, and the Editor Browser Extension.

In this blog post, we will describe how we have optimized the Editor neural grammar checker model using the Aggressive Decoding algorithm pioneered by Microsoft Research (MSR) and accelerated with high performance ONNX Runtime (ORT). With the Aggressive Decoding algorithm and ORT optimizations, the server model has achieved ~200% increase in inference speed while saving two-thirds of the cost, with no loss of model prediction quality compared to the previous production model.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

But we did not stop there. We also implemented EdgeFormer, MSR’s cutting-edge on-device seq2seq modeling technology, to obtain a lightweight generative language model with competitive performance that can be run on a user’s device, allowing us to achieve the ultimate zero-cost-of-goods-sold (COGS) goal.

Shipping a client model offers three other key benefits in addition to achieving zero-COGS:

  1. Increased privacy. A client model that runs locally on the user’s device does not need to send any personal data to a remote server.
  2. Increased availability. A client model operates offline without relying on network connectivity, bandwidth, or server capacity.
  3. Reduced cost and increased scalability. Shipping a client model to a user’s device removes all the computation that a server would be required to execute, which allows us to ship to more customers.

Additionally, we leveraged GPT-3.5 (the most advanced AI model at the time) to generate high-quality training data and identify and remove low-quality training examples, leading to a boost of model performance.

Innovation: Aggressive Decoding

Behind the AI-powered grammar checker in Microsoft Editor is the transformer model, enhanced by cutting-edge research innovations[1,2,3] from MSR for grammar correction. As with most seq2seq tasks, we used autoregressive decoding for high-quality grammar correction. However, conventional autoregressive decoding is very inefficient as it cannot fully utilize modern computing devices (CPUs, GPUs) due to its low computational parallelism, which results in high model serving costs and prevents us from scaling quickly to more (web/desktop) endpoints.

To address the challenge for serving cost reduction, we adopt the latest decoding innovation, Aggressive Decoding,[3] published by MSR researchers Tao Ge and Furu Wei at ACL 2021. Unlike the previous methods that speed up inference at the cost of prediction quality drop, Aggressive Decoding is the first efficient decoding algorithm for lossless speedup of seq2seq tasks, such as grammar checking and sentence rewriting. Aggressive Decoding works for tasks whose inputs and targeted outputs are highly similar. It uses inputs as the targeted outputs and verifies them in parallel instead of decoding sequentially, one-by-one, as in conventional autoregressive decoding. As a result, it can substantially speed up the decoding process, handling trillions of requests per year, without sacrificing quality by better utilizing the powerful parallel computing capabilities of modern computing devices, such PCs with graphics processing units (GPUs).

A gif demonstration of the lossless speedup mechanism of Aggressive Decoding. Aggressive Decoding speculatively uses the input text as the draft output to efficiently verify the draft results in parallel, making it possible to achieve the same result with much less time cost.

The figure above shows how Aggressive Decoding works. If we find a bifurcation during Aggressive Decoding, we discard all the predictions after the bifurcation and re-decode them using conventional one-by-one autoregressive decoding. If we find a suffix match (i.e., some advice highlighted with the blue dotted lines) between the output and the input during one-by-one re-decoding, we switch back to Aggressive Decoding by copying the tokens (highlighted with the orange dashed lines) and following the matched tokens in the input to the decoder input by assuming they will be the same. In this way, Aggressive Decoding can guarantee that the generated tokens are identical to autoregressive greedy decoding but with much fewer decoding steps, significantly improving the decoding efficiency.

Offline evaluations

We test Aggressive Decoding in grammar correction and other text rewriting tasks, such as text simplification, with a 6+6 standard transformer as well as a transformer with deep encoder and shallow decoder. All results confirm that Aggressive Decoding can introduce a significant speedup without quality loss.

    CoNLL14 NLCC-18 Wikilarge
F0.5 speedup F0.5 speedup SARI BLEU speedup
6+6 Transformer (beam=1) 61.3 1 29.4 1 36.1 90.7 1
6+6 Transformer (AD) 61.3 6.8 29.4 7.7 36.1 90.7 8
    CoNLL14
F0.5 speedup
12+2 Transformer (beam=1) 66.4 1
12+2 Transformer (AD) 66.4 4.2

And it can work even better on more powerful computing devices that excel at parallel computing (e.g., A100):

Four charts showing the speedup introduced by Aggressive Decoding in different computing devices. Aggressive Decoding can result in better speedup results in more advanced computing devices (I.e., V100 and A100 with fp16), demonstrating its huge potential in the future with even more powerful computing devices (e.g., H100 with fp8).

Online evaluation

We ran an A/B experiment between a Marian server model and an equal size server model with Aggressive Decoding using ONNX Runtime. The latter shows 2x+ improvement @p50 and 3x+ improvement @p95 and @p99 over the Marian runtime, with conventional autoregressive decoding in CPU as shown in the graph below. Moreover, it offers better efficiency stability than the previous autoregressive decoding, which varies drastically in latency (approximately proportional to the sentence length), as Aggressive Decoding substantially reduces the decoding cost with only a few steps of parallel computing regardless of the sentence length. This substantial inference time speedup resulted in a two-thirds COGS reduction in the production endpoints.

Three bar charts showing model latency comparison between the Marian server model and the ONNX server model with aggressive decoding at 50th percentile, 95th percentile and 99th percentile across fifteen regions. The first bar chart shows 2x latency improvement from the ONNX model at 50th percentile. The second and third bar charts show 3x latency improvement from the ONNX model at 95th percentile and 99th percentile.

Both offline/online evaluations confirm that Aggressive Decoding allows us to achieve significant COGS reduction without any loss of model prediction quality. Based on this intuition, we generalize[4]Aggressive Decoding to more general seq2seq tasks. Its high efficiency with lossless quality makes Aggressive Decoding likely to become the de facto decoding standard for seq2seq tasks and to play a vital role in the cost reduction of seq2seq model deployment.

Accelerate Grammar Checker with ONNX Runtime

ONNX Runtime is a high-performance engine, developed by Microsoft, that runs AI models across various hardware targets. A wide range of ML-powered Microsoft products leverage ONNX Runtime for inferencing performance acceleration. To further reduce the inferencing latency, the PyTorch Grammar Checker with Aggressive Decoding was exported to ONNX format using PyTorch-ONNX exporter, then inferenced with ONNX Runtime, which enables transformer optimizations and quantitation for CPU performance acceleration as well as model size reduction. A number of techniques are enabled in this end-to-end solution to run the advanced grammar checker model efficiently.

PyTorch provides a built-in function to export the PyTorch model to ONNX format with ease. To support the unique architecture of the grammar checker model, we enabled export of complex nested control flows to ONNX in the exporter. During this effort, we also extended the official ONNX specification on sequence type and operators to represent more complex scenarios (i.e., the autoregressive search algorithm). This eliminates the need to separately export model encoder and decoder components and stitch them together later with additional sequence generation implementation for production. With sequence type and operators support in PyTorch-ONNX exporter and ONNX Runtime, we were able to export one single ONNX graph, including encoder and decoder and sequence generation, which brings in both efficient computation and simpler inference logic. Furthermore, the shape type inference component of PyTorch ONNX exporter is enhanced to produce a valid ONNX model under stricter ONNX shape type constraints.

The innovative Aggressive Decoding algorithm introduced in the grammar checker model was originally implemented in Fairseq. To make it ONNX compatible, we reimplemented this Aggressive Decoding algorithm in HuggingFace for easy exporting. When diving into the implementation, we identified certain components that are not directly supported in the ONNX standard operator set (e.g., bifurcation detector). There are two approaches for exporting unsupported operators to ONNX and running with ONNX Runtime. We can either create a graph composing several standard ONNX operators that have equivalent semantics or implement a custom operator in ONNX Runtime with more efficient implementation. ONNX Runtime custom operator capability allows users to implement their own operators to run within ONNX Runtime with more flexibility. This is a tradeoff between implementation cost and performance. Considering the complexity of these components, the composition of standard ONNX operators might become a performance bottleneck. Hence, we introduced custom operators in ONNX Runtime to represent these components.

ONNX Runtime enables transformer optimizations and quantization, showing very promising performance gain on both CPU and GPU. We further enhanced encoder attention fusion and decoder reshape fusion for the grammar checker model. Another big challenge of supporting this model is multiple model subgraphs. We implemented subgraphs fusion in ONNX Runtime transformers optimizer and quantization tool. ONNX Runtime Quantization was applied to the whole model, further improving throughput and latency.

Quality Enhancement by GPT-3.5 LLMs

To further improve the precision and recall of the models in production, we employ the powerful GPT-3.5 as the teacher model. Specifically, the GPT-3.5 model works in the following two ways to help improve the result:

  • Training data augmentation: We fine-tune the GPT-3.5 model and use it to generate labels for massive unannotated texts. The annotations obtained are verified to be of high quality and can be used as augmented training data to enhance the performance of our model.
  • Training data cleaning: We leverage the powerful zero/few-shot capability of GPT-3.5 to distinguish between high-quality and low-quality training examples. The annotations of the identified low-quality examples are then regenerated by the GPT-3.5 model, resulting in a cleaner and higher-quality training set, which directly enhances the performance of our model.

EdgeFormer: Cost-effective parameterization for on-device seq2seq modeling

In recent years, the computational power of client devices has greatly increased, allowing for the use of deep neural networks to achieve the ultimate zero-COGS goal. However, running generative language models on these devices still poses a significant challenge, as the memory efficiency of these models must be strictly controlled. The traditional methods of compression used for neural networks in natural language understanding are often not applicable when it comes to generative language models.

Two illustrations to show the differences between a server model and a client model.

To ship a client grammar model, the model should be highly efficient (e.g., within 100ms latency), which has already been solved by Aggressive Decoding, mentioned earlier. Moreover, the client model must be memory-efficient (e.g., within a 50MB RAM footprint), which is the main bottleneck for a powerful (generative) transformer model (usually over 50 million parameters) to run on a client device.

To address this challenge, we introduce EdgeFormer[6], a cutting-edge on-device seq2seq modeling technology for obtaining lightweight generative language models with competitive performance that can be easily run on a user’s computer.

A figure shows the latency and memory shipping bar for the client DNN grammar checker. Aggressive Decoding can effectively address the latency challenge, while the memory challenge is resolved by another innovation called EdgeFormer.

The main idea of EdgeFormer is two principles, which we proposed for cost-effective parameterization:

  • Encoder-favored parameterization
  • Load-balanced parameterization
An illustration and a table that show encoder-favored parameterization is cost-effective.
The (left) figure shows parameters’ load in different network architectures. The (right) chart shows that either underusing or overusing a parameter is undesirable, suggesting we balance the load of parameters.

We designed EdgeFormer with the above principles of cost-effective parameterization, allowing each parameter to be utilized to its maximum potential, which achieves competitive results despite the stringent computational and memory constraints of client devices.

Based on EdgeFormer, we further propose EdgeLM – the pretrained version of EdgeFormer, which is the first publicly available pretrained on-device seq2seq model that can be easily fine-tuned for seq2seq tasks with strong results. EdgeLM serves as the foundation model of the grammar client model to realize the zero-COGS goal, which achieves over 5x model size compression with minimal quality loss compared to the server model.

Inference cost reduction to empower client-device deployment

Model deployment on client devices has strict requirements on hardware usage, such as memory and disk size, to avoid interference with other user applications. ONNX Runtime shows advantages for on-device deployment along with its lightweight engine and comprehensive client-inference focused solutions, such as ONNX Runtime quantization and ONNX Runtime extensions. In addition, to maintain service quality while meeting shipping requirements, MSR introduced a series of optimization techniques, including system-aware model optimization, model metadata simplification, and deferred parameter loading as well as customized quantization strategy. Based on the EdgeFormer modeling, these system optimizations can further reduce the memory cost by 2.7x, without sacrificing model performance.

We will elaborate on each one in the following sections: 

System-aware model optimization. As the model is represented as a dataflow graph, the major memory cost for this model is from the many subgraphs generated. As shown in the figure below, a branch in the PyTorch code is mapped as a subgraph. Therefore, we optimize the model implementation to reduce the usage of branch instructions. Particularly, we leverage greedy search as the decoder search algorithm, as beam search contains more branch instructions. The usage of this method can reduce memory cost by 38%

Two charts show the mapping of a PyTorch model and ONNX model graph. The left chart shows a while loop with the if_else statement as the loop body. It is an example of a control flow in a PyTorch DNN model. Each branch of the control flow is mapped to a subgraph in the right chart. The right chart illustrates an ONNX dataflow graph composed of connected nodes. Each node contains metadata. Each subgraph in the main graph is mapped to a PyTorch branch.
Mapping of PyTorch model and ONNX model graph

Model metadata simplification. Also shown in the figure above, the model contains a lot of metadata that consumes memory, such as the node name and type, input and output, and parameters. To reduce the cost, we simplify the metadata to keep only the basic required information for inference. For example, the node name is simplified from a long string to an index. Besides that, we optimize the model graph implementation in ONNX Runtime to keep just one copy of the metadata, rather than duplicating all the available metadata each time a subgraph is generated.

Deferred weight loading in ONNX Runtime. Current model files include both the model graphs and weights, which are then loaded into memory together during model initialization. However, this increases memory usage as shown in the figure below, because the weights will be copied repeatedly during model graph parsing and conversion. To avoid this, we save model graphs and weights separately. During initialization in ONNX Runtime, only the graphs are loaded into memory for actual parsing and conversion. The weights, on the other hand, still reside on disk with only the pointer kept in memory, through file mapping. The actual weight loading to memory will be deferred until the model inference. This technique can reduce the peak memory cost by 50%.

Two charts show the difference between the deferred weights loading and the default ONNX runtime implementation. The upper chart shows, in the model initialization stage, each step of model graph parsing and conversion requires a weight copy. The three steps from left to right are FlatBuffer, TensorProto, and OrtValue. During inference stage, the peak memory cost is added with three times of mode weight size. The lower chart also shows the three steps, but with mapped weights in each step. The weights are loaded until inference starts. The peak memory is thus added with the weight size only.
Deferred weights loading by file mapping during model initialization

ONNX Runtime quantization and ONNX Runtime extensions. Quantization is a well-known model compression technique that brings in both performance acceleration and model size reduction while sacrificing model accuracy. ONNX Runtime Quantization offers diverse tuning knobs to allow us to apply customized quantization strategy. Specifically, we customize the strategy as post-training, dynamic, UINT8, per-channel and all-operator quantization, for this model for minimum accuracy impact. Onnxruntime-extensions provides a set of ONNX Runtime custom operators to support the common pre- and post-processing operators for vision, text, and natural language processing models. With it, the pre- and post-processing for this model, including tokenization, string manipulation, and so on, can be integrated into one self-contained ONNX model file, leading to improved performance, simplified deployment, reduced memory usage, and better portability.

Conclusion

In this blog post, we have presented how we leveraged the cutting-edge research innovations from MSR and ONNX Runtime to optimize the server grammar checker model and achieve the ultimate zero-COGS goal with the client grammar checker model. The server model has achieved ~200% increase in inference speed while saving two-thirds of the cost, with no loss of model prediction quality. The client model has achieved over 5x model size compression with minimal quality loss compared to the server model. These optimizations have enabled us to scale quickly to more web and desktop endpoints and provide AI-powered writing assistance to millions of users around the world.

The innovation shared in this blog post is just the first milestone in our long-term continuous effort of COGS reduction for generative AI models. Our proposed approach is not limited to accelerating the neural grammar checker; it can be easily generalized and applied more broadly to scenarios such as abstractive summarization, translation, or search engines to accelerate large language models for COGS reduction[5,8], which is critical not only for Microsoft but also for the entire industry in the artificial general intelligence (AGI) era.

Reference

[1] Tao Ge, Furu Wei, Ming Zhou: Fluency Boost Learning and Inference for Neural Grammatical Error Correction. In ACL 2018.

[2] Tao Ge, Furu Wei, Ming Zhou: Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study. https://arxiv.org/abs/1807.01270

[3] Xin Sun, Tao Ge, Shuming Ma, Jingjing Li, Furu Wei, Houfeng Wang: A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-lingual Language Model. In IJCAI 2022.

[4] Xin Sun, Tao Ge, Furu Wei, Houfeng Wang: Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding. In ACL 2021.

[5] Tao Ge, Heming Xia, Xin Sun, Si-Qing Chen, Furu Wei: Lossless Acceleration for Seq2seq Generation with Aggressive Decoding. https://arxiv.org/pdf/2205.10350.pdf

[6] Tao Ge, Si-Qing Chen, Furu Wei: EdgeFormer: A Parameter-efficient Transformer for On-device Seq2seq Generation. In EMNLP 2022.

[7] Heidorn, George. “Intelligent Writing Assistance.” Handbook of Natural Language Processing. Robert Dale, Hermann L. Moisl, and H. L. Somers, editors. New York: Marcel Dekker, 2000: 181-207.

[8] Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei: Inference with Reference: Lossless Acceleration of Large Language Models. https://arxiv.org/abs/2304.04487

The post Achieving Zero-COGS with Microsoft Editor Neural Grammar Checker appeared first on Microsoft Research.

Read More

Large-language models for automatic cloud incident management

Large-language models for automatic cloud incident management

This research was accepted by the IEEE/ACM International Conference on Software Engineering (ICSE), which is a forum for researchers, practitioners, and educators to gather, present, and discuss the most recent innovations, trends, experiences, and issues in the field of software engineering.

The Microsoft 365 Systems Innovation research group has a paper accepted at the 45thInternational Conference on Software Engineering (ICSE), widely recognized as one of the most prestigious research conferences on software engineering. This paper, Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models, focuses on using state-of-the-art large language models (LLMs) to help generate recommendations for cloud incident root cause analysis and mitigation plans. With a rigorous study on real production incidents and analysis of several LLMs in different settings using semantic and lexical metrics as well as human evaluation, the research shows the efficacy and future potential of using AI for resolving cloud incidents.

Challenges of building reliable cloud services

Building highly reliable hyperscale cloud services such as Microsoft 365 (M365), which supports the productivity of hundreds of thousands of organizations, is very challenging. This includes the challenge of quickly detecting incidents, then performing root cause analysis and mitigation.

Our recent research starts with understanding the fundamentals of production incidents: we analyze the life cycle of incidents, then determine the common root causes, mitigations, and engineering efforts for resolution. In a previous paper: How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service, which won a Best Paper award at SoCC’22, we provide a comprehensive, multi-dimensional empirical study of production incidents from Microsoft Teams. From this study, we envision that automation should support incident diagnosis and help identify the root cause and mitigation steps to quickly resolve an incident and minimize customer impact. We should also leverage past lessons to build resilience for future incidents. We posit that adopting AIOps and using state-of-the-art AI/ML technologies can help achieve both goals, as we show in the ICSE paper.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

Adapting large-language models for automated incident management

Recent breakthroughs in AI have enabled LLMs to develop a rich understanding of natural language. They can understand and reason over large volumes of data and complete a diverse set of tasks, such as code completion, translation, and Q&A. Given the complexities of incident management, we sought to evaluate the effectiveness of LLMs in analyzing the root cause of production incidents and generating mitigation steps.

A block diagram that shows using title and summary of the incidents as input through GPT-3.x models that generate root cause and mitigation recommendations.
Figure 1: Leveraging GPT-3.x for root cause analysis and mitigation

In our recently published ICSE paper, we demonstrated the usefulness of LLMs for production incident diagnosis for the first time. When an incident ticket is created, the author specifies a title for each incident created and describes any relevant details, such as error messages, anomalous behavior, and other details which might help with resolution. We used the title and the summary of a given incident as the input for LLMs and generated root cause and mitigation steps, as shown in Figure 1.

We did a rigorous study on more than 40,000 incidents generated from more than 1000 services and compared several LLMs in zero-shot, fine-tuned, and multi-task settings. We find that fine-tuning the GPT-3 and GPT-3.5 models significantly improves the effectiveness of LLMs for incident data.

Effectiveness of GPT-3.x models at finding root causes

Model BLEU-4 ROUGE-L METEOR BERTScore BLEURT NUBIA
Top1 Top5 Top1 Top5 Top1 Top5 Top1 Top5 Top1 Top5 Top1 Top5
RoBERTa 4.21 NA 12.83 NA 9.89 NA 85.38 NA 35.66 NA 33.94 NA
CodeBERT 3.38 NA 10.17 NA 6.58 NA 84.88 NA 33.19 NA 39.05 NA
Curie 3.40 6.29 19.04 15.44 7.21 13.65 84.90 86.36 32.62 40.08 33.52 49.76
Codex 3.44 6.25 8.98 15.51 7.33 13.82 84.85 86.33 32.50 40.11 33.64 49.77
Davinci 3.34 5.94 8.53 15.10 6.67 12.95 83.13 84.41 31.06 38.61 35.28 50.79
Davinci-002 4.24 7.15 11.43 17.2 10.42 16.8 85.42 86.78 36.77 42.87 32.3 51.34
%gain for Davinci-002 23.26 13.67 26.44 10.90 42.16 21.56 0.61 0.49 12.72 6.88 -8.45 1.08
Table 1: Lexical and semantic performance of different LLMs

In our offline evaluation, we compared the performance of GPT-3.5 against three GPT-3 models by computing several semantic and lexical metrics (which measures the text similarity) between the generated recommendations and the ground truth of root cause or mitigation steps mentioned in incident management (IcM) portal. The average gains for GPT-3.5 metrics for different tasks were as follows: 

  1. For root cause and mitigation recommendation tasks, Davinci-002 (GPT-3.5) provided at least 15.38% and 11.9% gains over all the GPT-3 models, respectively, as shown in Table 1.
  2. When we generated mitigation plans by adding root cause as input to the model, GPT-3.5 model provided at least an 11.16% gain over the GPT-3 models.
  3. LLMs performed better on machine reported incidents (MRIs) as opposed to customer reported incidents (CRIs), due to the repetitive nature of the MRIs.
  4. Fine-tuning LLMs with incident data improved performance significantly. A fine-tuned GPT-3.5 model improved the average lexical similarity score by 45.5% for root cause generation and 131.3% for mitigation generation tasks over zero-shot (i.e., inferencing directly on pretrained GPT-3 or GPT-3.5 model) setting.

Looking through the incident owners’ eyes

In addition to analysis with semantic and lexical metrics, we also interviewed the incident owners to evaluate the effectiveness of the generated recommendations. Overall, GPT-3.5 outperforms GPT-3 in a majority of the metrics. More than 70% of on-call engineers gave a rating of 3 out of 5 or better for the usefulness of recommendations in a real-time production setting.

Looking forward

With future versions of LLMs coming, we expect the performance for automatic incident resolution will further improve, and the need for fine-tuning may decrease. Yet we are in the initial stage, with many open research questions in this field. For instance, how can we incorporate additional context about the incident, such as discussion entries, logs, service metrics, and even dependency graphs of the impacted services to improve the diagnosis? Another challenge is staleness since the models would need to be frequently retrained with the latest incident data. To solve these challenges, we are working on leveraging the latest LLMs combined with retrieval augmented approaches to improve incident diagnosis via a conversational interface, as shown in Figure 2.

A workflow diagram that shows how to use retrieval augmentation approach to recommend root causes. This approach including a retriever and corpus to retrieve relevant information from historical incidents, troubleshooting guides, and engineering hub, to add context into LLMs.
Figure 2: Workflow of retrieval-augmented root cause analysis

Moreover, ChatGPT can be actively integrated into the “discussion” of the incident diagnosis. By collecting evidence from available documents and logs, the model can generate coherent, contextual, natural-sounding responses to inquiries and offer corresponding suggestions, thereby facilitating the discussion and accelerating the incident resolution process. We believe this could deliver a step function improvement in the overall incident management process with contextual and meaningful root causes analysis and mitigation, thereby reducing significant human effort required and bolstering reliability and customer satisfaction.

Acknowledgement

This post includes contributions from Toufique Ahmed during his internship at Microsoft.

The post Large-language models for automatic cloud incident management appeared first on Microsoft Research.

Read More

Highlights from CHI 2023

Highlights from CHI 2023

Microsoft at CHI'23 highlights

The ways in which people are able to interact with technologies can have a profound effect on a technology’s utility and adoptability. Building computing tools and services around people’s natural styles of work, communication, and play can give technology the value it needs to have meaningful impact. For decades, human-computer interaction (HCI) has examined the relationship between people and computers to help maximize the capabilities of each across a range of experiences and situations.

The ACM CHI Conference on Human Factors in Computing Systems (CHI) is a renowned meeting ground for top talent in the HCI field and a showcase for some of its most compelling work. Hosted April 23 through April 28, this year’s conference drew more than 4,500 participants from 79 countries. Contributions from Microsoft researchers and their collaborators demonstrated the breadth of work inspired by the myriad and diverse ways people use computing today and will in the future.

Check out a few highlights from this year’s conference below, including researchers’ efforts to better understand the role of wellbeing in work, to augment memory through our sense of smell, and to bridge the gap between programmers and code-generating models, which received honorable mention at the conference.

“What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code-Generating Large Language Models
CHI 2023 Honorable Mention

Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Ben Zorn, Jack Williams, Neil Toronto, Andy Gordon

Programming languages are an extremely powerful form of user interface. They also happen to be extremely difficult to learn, especially for non-expert end-user programmers who lack training in computing. What if end-user programmers could instead use a natural language they already know? This prospect can be realized through large language models (LLM): deep neural networks using the transformer architecture, trained on large corpora, and fine-tuned to generate code from natural language. Despite impressive benchmark performance, LLMs are beset with issues in practical use. Lab and field studies have shown that the mapping between natural language and code is poorly understood, that generated code can contain subtle bugs, and that generated code can be difficult to verify.

In their paper, researchers consider the specific problem of abstraction matching: when the user has well-formed intent, how do they select an utterance from the near infinite space of naturalistic utterances that they believe the system will reliably map to a satisfactory solution? This involves “matching” the utterance to the right level of “abstraction” by specifying the utterance at a level of granularity and detail that matches the set of actions the system can take and selecting suitable words and grammar.

Workplace Rhythm Variability and Emotional Distress in Information Workers

Subigya Kumar Nepal, Javier Hernandez, Judith Amores, Mehrab Bin Morshed, Robert Lewis, Hemma Prafullchandra, Mary Czerwinski

Regularity in daily activities has been linked to positive wellbeing outcomes, but previous studies have mainly focused on clinical populations and traditional daily activities such as sleep and exercise. This research extends prior work by examining the regularity of both self-reported and digital activities of 49 information workers in a four-week naturalistic study. Findings suggest that greater variability in self-reported mood, job demands, lunch time, and sleep quality may be associated with increased stress, anxiety, and depression. However, when it comes to digital activity–based measures, greater variability in rhythm is associated with reduced emotional distress. This study expands our understanding of workers and the potential insights that can be gained from analyzing technology interactions and wellbeing.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Olfactory Wearables for Targeted Memory Reactivation

Judith Amores, Nirmita Mehra, Bjoern Rasch, Pattie Maes

This paper investigates how a smartphone-controlled olfactory wearable might improve memory recall. Researchers conducted a within-subjects experiment with 32 participants using the device and not using the device (control). In the experimental condition, bursts of odor were released during visuo-spatial memory navigation tasks, which also had a language learning component, and rereleased during sleep the following night in the subjects’ home. The researchers found that compared with control, there was an improvement in memory performance when using the scent wearable in memory tasks that involved walking in a physical space. Furthermore, participants recalled more objects and translations when re-exposed to the same scent during the recall test in addition to during sleep. These effects were statistically significant, and in the object recall task, they also persisted for more than a week. This experiment demonstrates a potential practical application of olfactory interfaces that can interact with a user during wake, as well as sleep, to support memory.

AdHocProx: Sensing Mobile, Ad-Hoc Collaborative Device Formations using Dual Ultra-Wideband Radios

Richard Li, Teddy Seyed, Nicolai Marquardt, Eyal Ofek, Steve Hodges, Mike Sinclair, Hugo Romat, Michel Pahud, Jatin Sharma, William A. S. Buxton, Ken Hinckley, Nathalie Henry Riche

In their paper, researchers present AdHocProx, a system that uses device-relative, inside-out sensing to augment co-located collaboration across multiple devices without recourse to externally anchored beacons or even reliance on Wi-Fi connectivity.

AdHocProx achieves this via sensors, including dual ultra-wideband (UWB) radios for sensing distance and angle to other devices in dynamic, ad-hoc arrangements and capacitive grip to determine where the user’s hands hold the device and to partially correct for the resulting UWB signal attenuation. All spatial sensing and communication take place via the side-channel capability of the UWB radios, suitable for small-group collaboration across up to four devices (eight UWB radios).

Together, these sensors detect proximity and natural, socially meaningful device movements to enable contextual interaction techniques. Researchers find that AdHocProx can obtain 95 percent accuracy recognizing various ad-hoc device arrangements in an offline evaluation, with participants particularly appreciative of interaction techniques that automatically leverage proximity-awareness and relative orientation among multiple devices.

Escapement: A Tool for Interactive Prototyping with Video via Sensor-Mediated Abstraction of Time

Molly Jane Nicholas, Nicolai Marquardt, Michel Pahud, Nathalie Henry Riche, Hugo Romat, Christopher Collins, David Ledo, Rohan Kadekodi, Badrish Chandramouli, Ken Hinckley

This paper introduces Escapement, a video prototyping tool that introduces a powerful new concept for prototyping screen-based interfaces by flexibly mapping sensor values to dynamic playback control of videos. This recasts the time dimension of video mockups as sensor-mediated interaction.

This abstraction of time as interaction, which the researchers dub video-escapement prototyping, empowers designers to rapidly explore and viscerally experience direct touch or sensor-mediated interactions across one or more device displays. The system affords cross-device and bidirectional remote (telepresent) experiences via cloud-based state sharing across multiple devices. This makes Escapement especially potent for exploring multi-device, dual-screen, or remote-work interactions for screen-based applications. Researchers share the results of observations of long-term usage of video-escapement techniques with experienced interaction designers and articulate design choices for supporting a reflective, iterative, and open-ended creative design process.

Your Mileage May Vary: Case Study of a Robotic Telepresence Pilot Roll-out for a Hybrid Knowledge Work Organization

Andriana Boudouraki, Joel E. Fischer, Stuart Reeves, Sean Rintel

Organizations wishing to maintain employee satisfaction for hybrid collaboration need to explore flexible solutions that provide value for both remote and on-site employees. This case study reports on the roll-out of a telepresence robot pilot at Microsoft Research Cambridge to test whether robots would provide enjoyable planned and unplanned encounters between remote and on-site employees. Researchers describe the work that was undertaken to prepare for the roll-out, including the occupational health and safety assessment, systems for safety and security, and the information for employees on safe and effective use practices. The pilot ended after three months, and robot use has been discontinued after weighing the opportunities against low adoption and other challenges. The researchers discuss the pros and cons within this organizational setting and make suggestions for future work and roll-outs.

Focus Time for Wellbeing and Work Engagement of Information Workers 

Koustuv Saha, Shamsi Iqbal 

Having little time for focused work is a major challenge in information work. While research has explored computing-assisted user-facing solutions for protecting time for focused work, there’s limited empirical evidence about the effectiveness of these features on wellbeing and work engagement. Toward this problem, researchers study the effects of automatically scheduling time for focused work on people’s work calendars using the “focus time” feature on Outlook calendars. The researchers conducted an experimental study over six weeks with 15 treatment and 10 control participants, who responded to survey questions on wellbeing and work engagement throughout the study. The researchers found that the treatment participants showed higher wellbeing, including increased excitement, relaxation, and satisfaction, and decreased anger, frustration, tiredness, and stress. The researchers study the needs, benefits, and challenges of scheduling focus time and discuss the importance of and design recommendations for enabling mechanisms and tools supporting focused work.

The post Highlights from CHI 2023 appeared first on Microsoft Research.

Read More

Research Focus: Week of May 8, 2023

Research Focus: Week of May 8, 2023

Microsoft Research Focus 15 | Week of May 8, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

AWARD

Microsoft’s danah boyd awarded MIT’s Morison Prize

danah boyd, a partner researcher at Microsoft Research, has been awarded MIT’s Morison Prize in Science, Technology, and Society, for outstanding work combining humanistic values with effectiveness in the world of practical affairs, particular in science and technology.

Dr. boyd, who is also a Distinguished Visiting Professor at Georgetown University, is currently conducting a multi-year ethnographic study of the U.S. census to understand how data are made legitimate. Her previous studies have focused on media manipulation, algorithmic bias, privacy practices, social media, and teen culture. 

To learn more, see the Microsoft Research Summit presentation Statistical Imaginaries: An Ode to Responsible Data Science or the publications Differential Perspectives: Epistemic Disconnects Surrounding the U.S. Census Bureau’s Use of Differential Privacy.


AWARD

Microsoft’s Nicole Immorlica receives 2023 SIGecom Test of Time Award

Nicole Immorlica, a Senior Principal Researcher with Microsoft Research New England, has been awarded the 2023 SIGecom Test of Time Award for her work on a 2005 paper on matching markets. The award from the Association of Computing Machinery (ACM) recognizes “an influential paper or series of papers published between ten and twenty-five years ago that has significantly impacted research or applications exemplifying the interplay of economics and computation.” 

In the award-winning paper: Marriage, honesty, and stability, Immorlica and a co-author explored centralized two-sided markets, such as the medical residency market, matching participants by running a stable marriage algorithm. While no matching mechanism based on a stable marriage algorithm can guarantee ‘truthfulness’ as a dominant strategy, the paper showed that in certain probabilistic settings, truthfulness is the best strategy for the participants.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

AWARD

Microsoft’s Lorin Crawford named 2023 COPSS Emerging Leader

Lorin Crawford, a principal researcher at Microsoft Research New England, has been named a 2023 COPSS Emerging Leader by the Committee of Presidents of Statistical Societies. The award announcement cited Crawford’s path-breaking research combining theory and methods of mathematics, statistics and computing to generate new knowledge and insight about the genetic basis of disease, and exceptional mentoring of students from multiple scientific disciplines.

The award recognizes the important role of early-career statistical scientists in shaping the future of their discipline. The selection criteria are designed to highlight contributions in areas not traditionally recognized by other early-career awards in the statistical sciences.

Crawford, who is also a faculty member at Brown University’s School of Public Health, focuses on developing novel and efficient algorithms that address complex problems in quantitative genetics, cancer pharmacology, molecular genomics, and geometric morphometrics.


AWARD

Microsoft researchers receive Test of Time award for personalized news recommendation work

A paper co-authored by two Microsoft researchers has received a 2023 Seoul Test of Time Award from the International World Wide Web Conference Committee (IW3C2). The 2020 paper: A Contextual-Bandit Approach to Personalized News Article Recommendation, was written by John Langford and Robert Schapire, along with two industry colleagues. The authors proposed a new approach for personalized recommendation using contextual bandit algorithms. According to the IW3C2, the paper now has more than 2,730 citations and has become foundational research in the area of recommendation systems.

The award announcement also states: “The paper addressed fundamental challenges in real-world recommendation systems via computationally efficient algorithms grounded in learning theory. It also showed that recommendation algorithms can be reliably evaluated offline, enabling algorithm selection without operational impact, and that contextual bandits can yield significant gains in user engagement.”


NEW RESEARCH

A Frequency Domain Approach to Predict Power System Transients

The dynamics of power grids are governed by a large number of nonlinear differential and algebraic equations (DAEs). To safely run the system, operators need to check that the states described by these DAEs stay within prescribed limits after various potential faults. However, current numerical solvers of DAEs are often too slow for real-time system operations. In addition, detailed system parameters are often not exactly known. Machine learning approaches have been proposed to reduce the computational efforts, but existing methods generally suffer from overfitting and failures to predict unstable behaviors.

In a new paper: A Frequency Domain Approach to Predict Power System Transients, Microsoft researchers propose a novel framework to predict power system transients by learning in the frequency domain. The intuition is that although the system behavior is complex in the time domain, relatively few dominant modes exist in the frequency domain. Therefore, the researchers learn to predict by constructing neural networks with Fourier transform and filtering layers. System topology and fault information are encoded by taking a multi-dimensional Fourier transform, allowing researchers to leverage the fact that the trajectories are sparse both in time and spatial frequencies. This research shows that the proposed approach does not need detailed system parameters, greatly speeds up prediction computations and is highly accurate for different fault types.


NEW RESEARCH

Inference with Reference: Lossless Acceleration of Large Language Models

The growing use of large foundation models like GPT-3.5/4 for real-world applications has raised concerns about high deployment costs. While general methodologies such as quantization, pruning, compression, and distillation help reduce costs. At test time, output tokens must be decoded (sequentially) one by one, which poses significant challenges for LLMs to be deployed at scale.

In a new paper: Inference with Reference: Lossless Acceleration of Large Language Models, Microsoft researchers study accelerating LLM inference by improving the efficiency of autoregressive decoding. In multiple real-world applications, this research shows that an LLM’s output tokens often come from its context. For example, in a retrieval-augmented generation scenario for a search engine, an LLM’s context usually includes relevant documents retrieved from an external corpus as reference according to a query, and its output usually contains many text spans found in the reference (i.e., retrieved documents). Motivated by this observation, the researchers propose an LLM accelerator (LLMA) to losslessly speed inference with references. Its improved computational parallelism allows LLMA to achieve over 2x speed-up for LLMs, with identical generation results as greedy decoding, in many practical generation scenarios where significant overlap between in-context reference and outputs exists. The researchers are collaborating with the Bing search team to explore integrating this technique into snippet/caption generation, Bing chat, and other potential scenarios.


NEW RESEARCH

High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation

Quantum chemical calculations on atomistic systems have evolved into a standard approach to studying molecular matter. But these calculations often involve a significant amount of manual input and expertise. Most of these calculations could be automated, alleviating the need for expertise in software and hardware accessibility.

In a new paper: High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation, researchers from Microsoft present the AutoRXN workflow, an automated workflow for exploratory high-throughput electronic structure calculations of molecular systems.

This workflow i) uses density functional theory methods to deliver minimum and transition-state structures and corresponding energies and properties, (ii) launches coupled cluster calculations for optimized structures to provide more accurate energy and property estimates, and (iii) evaluates multi-reference diagnostics to back check the coupled cluster results and subjects them to automated multi-configurational calculations for potential multi-configurational cases.

All calculations take place in a cloud environment and support massive computational campaigns. Key features of all components of the AutoRXN workflow are autonomy, stability, and minimum operator interference.

The paper was recently published in the Journal of Chemistry and Physics.

The post Research Focus: Week of May 8, 2023 appeared first on Microsoft Research.

Read More

Using generative AI to imitate human behavior

Using generative AI to imitate human behavior

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

An overview of our method, providing a side-by-side comparison of text-to-image diffusion, with observation-to-action diffusion. On the right are diagrams of the different denoising architectures tested, as well an illustration of the sampling schemes explored.
Figure 1: Overview of our method.

Diffusion models have emerged as a powerful class of generative AI models. They have been used to generate photorealistic images and short videos, compose music, and synthesize speech. And their uses don’t stop there. In our new paper, Imitating Human Behaviour with Diffusion Models, we explore how they can be used to imitate human behavior in interactive environments.

This capability is valuable in many applications. For instance, it could help automate repetitive manipulation tasks in robotics, or it could be used to create humanlike AI in video games, which could lead to exciting new game experiences—a goal particularly dear to our team.

We follow a machine learning paradigm known as imitation learning (more specifically behavior cloning). In this paradigm, we are provided with a dataset containing observations a person saw, and the actions they took, when acting in an environment, which we would like an AI agent to mimic. In interactive environments, at each time step, an observation ( o_t ) is received (e.g. a screenshot of a video game), and an action ( a_t ) is then selected (e.g. the mouse movement). With this dataset of many ( o )’s and ( a )’s performed by some demonstrator, a model ( pi ) could try to learn this mapping of observation-to-action, ( pi(o) to a ).

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

When the actions are continuous, training a model to learn this mapping introduces some interesting challenges. In particular, what loss function should be used? A simple choice is mean squared error, as often used in supervised regression tasks. In an interactive environment, this objective encourages an agent to learn the average of all the behaviors in the dataset.

If the goal of the application is to generate diverse human behaviors, the average might not be very useful. After all, humans are stochastic (they act on whims) and multimodal creatures (different humans might make different decisions). Figure 2 depicts the failure of mean squared error to mimic the true action distribution (marked in yellow) when it is multimodal. It also includes several other popular choices for the loss function when doing behavior cloning.

This toy example (based on an arcade claw game) shows an action space with two continuous action dimensions. It shows that popular choices of behavioral cloning loss fail to capture the true distribution, but diffusion models offer a good approximation.
Figure 2: This toy example (based on an arcade claw game) shows an action space with two continuous action dimensions. Here the demonstration distribution is marked in yellow—it is both multimodal and has correlations between action dimensions. Diffusion models offer a good imitation of the full diversity in the dataset.

Ideally, we’d like our models to learn the full variety of human behaviors. And this is where generative models help. Diffusion models are a specific class of generative model that are both stable to train and easy to sample from. They have been very successful in the text-to-image domain, which shares this one-to-many challenge—a single text caption might be matched by multiple different images.

Our work adapts ideas that have been developed for text-to-image diffusion models, to this new paradigm of observation-to-action diffusion. Figure 1 highlights some differences. One obvious point is that the object we are generating is now a low-dimensional action vector (rather than an image). This calls for a new design for the denoising network architecture. In image generation, heavy convolutional U-Nets are in vogue, but these are less applicable for low-dimensional vectors. Instead, we innovated and tested three different architectures shown in Figure 1.

In observation-to-action models, sampling a single bad action during an episode can throw an agent off course, and hence we were motivated to develop sampling schemes that would more reliably return good action samples (also shown in Figure 1). This problem is less severe in text-to-image models, since users often have the luxury of selecting a single image from among several generated samples and ignoring any bad images. Figure 3 shows an example of this, where a user might cherry-pick their favorite, while ignoring the one with nonsensical text.

Four samples from a text-to-image diffusion model from Bing using the prompt “A cartoon style picture of people playing with arcade claw machine”. Some of the samples are good quality, some contain errors, for example the text in one image is nonsensical.
Figure 3: Four samples from a text-to-image diffusion model from Bing (note this is not our own work), using the prompt “A cartoon style picture of people playing with arcade claw machine”.

We tested our diffusion agents in two different environments. The first, a simulated kitchen environment, is a challenging high-dimensional continuous control problem where a robotic arm must manipulate various objects. The demonstration dataset is collected from a variety of humans performing various tasks in differing orders. Hence there is rich multimodality in the dataset.

We found that diffusion agents outperformed baselines in two aspects. 1) The diversity of behaviors they learned were broader, and closer to the human demonstrations. 2) The rate of task completion (a proxy for reward) was better.

The videos below highlight the ability of diffusion to capture multimodal behavior–starting from the same initial conditions, we roll out the diffusion agent eight times. Each time it selects a different sequence of tasks to complete.

A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.

The second environment tested was a modern 3D video game, Counter-strike. We refer interested readers to the paper for results.

In summary, our work has demonstrated how exciting recent advances in generative modeling can be leveraged to build agents that can behave in humanlike ways in interactive environments. We’re excited to continue exploring this direction – watch this space for future work.

For more detail on our work, please see our paper and code repo.

The post Using generative AI to imitate human behavior appeared first on Microsoft Research.

Read More

Inferring rewards through interaction

Inferring rewards through interaction

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

A diagram in which five newspaper icons are lined up in the middle, the first of which is labeled a. An arrow points from the newspaper to an icon of a person above it. The person is labeled x and has a mouse click icon next to it and a thought bubble with the words “I like this!” that’s labeled r. An arrow points from the mouse click icon to a box labeled “recommender system” under the newspapers.

Reinforcement learning (RL) hinges on the power of rewards, driving agents—or the models doing the learning—to explore and learn valuable actions. The feedback received through rewards shapes their behavior, culminating in effective policies. Yet, crafting reward functions is a complex, laborious task, even for experts. A more appealing option, particularly for the people ultimately using systems that learn from feedback over time, is an agent that can automatically infer a reward function. The interaction-grounded learning (IGL) paradigm from Microsoft Research enables agents to infer rewards through the very process of interaction, utilizing diverse feedback signals rather than explicit numeric rewards. Despite the absence of a clear reward signal, the feedback relies on a binary latent reward through which the agent masters a policy that maximizes this unseen latent reward using environmental feedback.

In our paper “Personalized Reward Learning with Interaction-Grounded Learning,” which we’re presenting at the 2023 International Conference on Learning Representations (ICLR), we propose a novel approach to solve for the IGL paradigm: IGL-P. IGL-P is the first IGL strategy for context-dependent feedback, the first use of inverse kinematics as an IGL objective, and the first IGL strategy for more than two latent states. This approach provides a scalable alternative to current personalized agent learning methods, which can require expensive high-dimensional parameter tuning, handcrafted rewards, and/or extensive and costly user studies.

IGL-P in the recommender system setting

IGL-P is particularly useful for interactive learning applications such as recommender systems. Recommender systems help people navigate increasing volumes of content offerings by providing personalized content suggestions. However, without explicit feedback, recommender systems can’t detect for certain whether a person enjoyed the displayed content. To accommodate, modern recommender systems equate implicit feedback signals with user satisfaction. Despite the popularity of this approach, implicit feedback is not the true reward. Even the click-through rate (CTR) metric, the gold standard for recommender systems, is an imperfect reward, and its optimization naturally promotes clickbait.

Interaction-grounded learning (IGL) for the recommender system setting. The recommender system receives features describing a person (x), recommends an item (a), and observes implicit user feedback (y), which is dependent on the latent reward (r) but not r itself, to learn how to better recommend personalized content to the individual.
Interaction-grounded learning (IGL) for the recommender system setting. The recommender system receives features describing a person (x), recommends an item (a), and observes implicit user feedback (y), which is dependent on the latent reward (r) but not r itself, to learn how to better recommend personalized content to the individual.

This problem has led to the handcrafting of reward functions with various implicit feedback signals in modern recommender systems. Recommendation algorithms will use hand-defined weights for different user interactions, such as replying to or liking content, when deciding how to recommend content to different people. This fixed weighting of implicit feedback signals might not generalize across a wide variety of people, and thus a personalized learning method can improve user experience by recommending content based on user preferences.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

The choice of reward function is further complicated by differences in how people interact with recommender systems. A growing body of work shows that recommender systems don’t provide consistently good recommendations across demographic groups. Previous research suggests that this inconsistency has its roots in user engagement styles. In other words, a reward function that might work well for one type of user might (and often does) perform poorly for another type of user who interacts with the platform differently. For example, older adults have been found to click on clickbait more often. If the CTR is used as an objective, this group of users will receive significantly more clickbait recommendations than the general public, resulting in higher rates of negative user experiences and leading to user distrust in the recommender system.

IGL-P provides a novel approach to optimize content for latent user satisfaction—that is, rewards that a model doesn’t have direct access to—by learning personalized reward functions for different people rather than requiring a fixed, human-designed reward function. IGL-P learns representations of diverse user communication modalities and how these modalities depend on the underlying user satisfaction. It assumes that people may communicate their feedback in different ways but a given person expresses (dis)satisfaction or indifference to all content in the same way. This enables the use of inverse kinematics toward a solution for recovering the latent reward. With additional assumptions that rewards are rare when the agent acts randomly and some negatively labeled interactions are directly accessible to the agent, IGL-P recovers the latent reward function and leverages that to learn a personalized policy.

IGL-P successes

The success of IGL-P is demonstrated with experiments using simulations, as well as with real-world production traces. IGL-P is evaluated in three different settings:

  • A simulation using a supervised classification dataset shows that IGL-P can learn to successfully distinguish between different communication modalities.
  • A simulation for online news recommendation based on publicly available data from Facebook users shows that IGL-P leverages insights about different communication modalities to learn better policies and achieve consistent performance among diverse user groups (the dataset, created in 2016, consists of public posts from the official Facebook pages of news companies from 2012 to 2016 and aggregated user reactions; because of this aggregation, identifying information can’t be extracted).
  • A real-world experiment deployed in the Microsoft image recommendation product Windows Spotlight showcases that the proposed method outperforms the hand-engineered reward baseline and succeeds in a practical application serving millions of people.

The post Inferring rewards through interaction appeared first on Microsoft Research.

Read More

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

GitHub Product Manager Kasia Sitkiewicz and Protocol Labs Research Scientist Petar Maymounkov discuss their collaboration on Gov4git on the Microsoft Research Podcast

Episode 139 | May 3, 2023

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with. 

In this inaugural episode, host Dr. Gretchen Huizinga talks with GitHub Staff Product Manager Kasia Sitkiewicz and Protocol Labs Research Scientist Petar Maymounkov about how their collaboration on Gov4git, a governance tool for decentralized, open-source cooperation, is helping to lay the foundation for a future in which everyone can collaborate more efficiently, transparently, and easily and in the ways that meet the unique desires and needs of their respective communities. They discuss the governance features that make Gov4git more suitable for serving a broader range of communities than today’s public blockchains and the open-source book project allowing them to test the potential and limitations of the work.

Transcript

[MUSIC] 

GRETCHEN HUIZINGA: Every great idea at Microsoft Research is yearning to find its way into the hearts, minds, and hands of people. Microsoft researchers work with an amazing—and sometimes surprising—array of collaborators from across the sciences who are integral to the process of shepherding these ideas from lab to life. Welcome to Collaborators, a podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga, and in this series, we’ll dive deep into the collaboration process and illuminate how research ideas move from mind to market in our ongoing effort to enhance human abilities, strengthen human communities, and benefit human lives. 


[MUSIC ENDS] 

Welcome to Episode 1 of Collaborators. Today, I’m joined by our first two guests, Petar Maymounkov and Kasia Sitkiewicz. Petar and Kasia are working on a project that has collaboration in its DNA: Gov4git, a decentralized, transparent, and secure git-based protocol for governing open-source communities that they say circumvents more costly approaches to things like validation and dispute resolution. 

We’re going to unpack all of that in this episode. But before we do, let’s get to know our collaborators. Kasia, let’s start with you. You’re at GitHub, “an open-source platform for collaborative software development and version management.” This platform is well-known in the dev community but give us a brief elevator tour of GitHub and particularly what your role is there. 

KASIA SITKIEWICZ: Sure. So I’m happy to give an overview of GitHub. Uh, GitHub is primarily known to be a home for all developers and open-source communities. It’s one of the most popular resources for developers, as you mentioned, to share code and work on projects in collaboration. It makes [it] super easy for developers to share code files and collaborate with each other using GitHub issues, which we will be referencing in the podcast, and pull requests, uh, which we call PRs. So imagine GitHub issues being like a project description or some kind of information that what needs to be built, and PRs, um, are pretty much amendments to the code change that a community wants to merge with the main code branch, uh, and that’s very well known among developer community. So pretty much like that’s how we use version control. We know what needs to be changed, what needs to be merged, and community pretty much participates in all of those changes. And what I do at GitHub, uh, I work as a product manager. I oversee growth for GitHub Enterprise Cloud and GitHub Advanced Security, and on the side, I collaborate with Microsoft, Web3, and Microsoft Research team on, uh, working on projects like Gov4git or other Web3 partnerships where I represent GitHub and, um, trying to onboard and make those projects successful. 

HUIZINGA: So there’s meta-collaboration, and then there’s micro-collaboration, and collaboration all over the place in GitHub. 

SITKIEWICZ: Exactly. Yes, we, we do like to collaborate. 

HUIZINGA: [LAUGHS] Well, you’re perfect for this show. So, Petar, you’re at Protocol Labs, “an open-source research, development, and deployment laboratory.” And, and you say you’re “building the next generation of the internet and making human existence orders of magnitude better through technology.” No pressure, right? Briefly tell us about Protocol Labs and your role in taking the internet and humanity to the next level. 

PETAR MAYMOUNKOV: Yeah, um, first, thank you for having us. Since you’re asking about the North Star mission of Protocol Labs, so to speak, I think it’s quite simple. I think it’s really trying to sort of create a better world that is both, um, it’s sustainable, fair, and inclusive, and it’s trying to do this through decentralization as a concept and technologies, of course, in particular. Now this is a mighty goal, and in practice, it, um, comprises essentially three workstreams, if you will. Um, the first thing is decentralized infrastructure, because it’s not possible to, to build anything useful without the infrastructure, and in this regard, Protocol Labs is, um, essentially working on and stewarding, uh, two products Filecoin and IPFS, which provide decentralized infrastructure in a democratic way to the whole world essentially. Um, now the second workstream is, um—Protocol Labs was one of the companies to realize early on that, uh, whenever decentralized technologies are involved, um, they go hand in hand with, uh, enabling everybody to contribute, so this raises the question of decentralized development, which is how do people collaborate across country boundaries, backgrounds, different levels of experience, and so forth. So along with all the engineering efforts, Protocol Labs is also essentially innovating workflows and culture about being productive in a decentralized development kind of, um setting. And the final workstream, uh, which kind of shows you how long term the vision is in Protocol Labs, so we recognize that, um, we cannot have a sustainable, decentralized world unless we replicate some of the important, um, sort of processes that happen in the real world, in particular the research-to-development innovation pipeline. So in the real world, this goes from academia to industry, and so forth. And part of, um, why this question is new and not the same as in the real world is because, uh, decentralized products being a type of public good, um, do not succumb to the same incentive mechanisms that drive the conventional economy. So we, we have a department called network funding and funding of public goods, which is itself involved in thinking about new mechanisms and incentives for, for making this, this process work in a repeatable fashion, basically. And my, uh, my role currently in the company is, uh, to think about facilitating decentralized development through standardized tools and protocols. 

HUIZINGA: Gotcha. Well, as we’re talking about collaboration and collaborators and you two are at two different companies, I’m going to call this question “how I met your mother”! How did Gov4git come about, and what was the initial felt need that defined the purpose? And as you answer that, tell us who’s all involved and how you each got involved on the team. Kasia, I’ll let you take the lead on this one. 

SITKIEWICZ: Sure. So I guess on my end, it all started through the passion I have for open source and the idea of decentralized communities. As I mentioned, I’m part of a lot of, uh, projects here at Microsoft and GitHub, and one of them is Web3 and Plural Technology Collaboratory that is led by Glen Weyl, and a few months ago, Glen and I, we had a conversation about how amazing git is and how amazing our GitHub communities are and overall like the efforts that they are working on towards like better world, public goods, and so on, and I share my vision for GitHub to be a tool or platform that can be accessible by anyone around the world where people can collaborate, they can own, uh, share and like earn money pretty much because of those contributions that they have. So we talk about this vision and we share the same kind of like a passion for all of those different projects and, you know, aspects of like open source, and he mentioned like, “Hey, we’re actually working on this like open-source book, uh, that will be hosted on GitHub, and we would love to do some kind of collaboration here.” And then he introduced me to Petar and Protocol Labs, and we had our first intro call. Uh, we learned like what is the objective, what problems we are trying to solve, and we put a small team of GitHub, Microsoft, and folks from Protocol Labs and a few folks also from open source, like purely I put a tweet about like, “Hey, I’m looking for contributors to this amazing project that will help with governance for open source,” and few folks reach out, and that’s how we kind of put it together. 

HUIZINGA: Right. Petar, how do you see the, the thing coming around? 

MAYMOUNKOV: So I had been working for Protocol Labs for about three and a half years. The first couple of years, I spent most of my time engineering and sort of being in the real-world decentralized development kind of environment, so I saw lots of things that work well; I saw lots of things that need improving; and over time, I developed an interest to kind of address, uh, this question sort of systematically and head on, which is when I, um, started working specifically just on this problem. And about six months ago or so, when I was starting, I was initially researching the space and what’s known. This is how I ran into Glen Weyl’s work, so eventually, we, we connected, and, um, I read sort of most of the stuff that he’s been working on and tried to sort of find a connection between this and what I knew from, from the trenches, if you will, from the engineering department, and then—and then, you know, he connected us with, um, with, uh, with Kasia. But the thing that sparked it, though, so at some point, Glen did sort of point out the specific project that he was trying to initiate, the plurality book, and this was kind of the thing that put a shape to our efforts because it was a very concrete task that we needed to figure out how to like address and accomplish in like a reasonable time. 

HUIZINGA: Yeah, so, so let’s get sort of granular about Gov4git and what it is, because I don’t think we’ve defined that, uh, from the get-go here, so, Kasia, can you kind of explain what it is and why it’s different? 

SITKIEWICZ: Sure. So Gov4git is pretty much a tool that helps, uh, open-source community to govern their community members in a more efficient, transparent, and easy way. There is a lot of problems in traditional governance model for any communities, and the larger communities are, there, there are more problems. And Gov4git is trying to solve a very particular problem of giving autonomy and ownership to the community to make decision what needs to happen and what changes the community needs to prioritize in order to make the project more successful. So, it’s just a solution that helps you to govern your communities in an efficient way. 

HUIZINGA: Yeah, so even as we’re talking, I’m thinking, OK, you’ve got Microsoft Research, you’ve got GitHub, you’ve got Protocol Labs. But do you use this to govern the things that you guys are working through as a community collaboration? 

MAYMOUNKOV: The tool itself is essentially implementing processes that kind of have organically emerged both in, in the context of Protocol Labs, as well as even other organizations like Ethereum. Um, I mean, this is the process of people kind of collaborating on specifications for decentralized protocols and so forth. For the particular—for Gov4git specifically, since the tool is still, uh, in some sense under development, but it, but it is kind of approaching MVP, we have used it internally as, as dog food, um, but not at large scale yet. 

HUIZINGA: Right. Gotcha. 

SITKIEWICZ: Yeah. And I think the beauty of Gov4git is actually very useful when you have a bigger community. Right now, our team is very small. It’s just like, uh, six people working together, so—and this is something I want to elaborate a little bit more in our, later in the podcast—but the smaller community, there is less problems, and you kind of make a decision on the fly, on the go, like, “Hey, what are we going to build next? And should we, should we focus on this or that?” So you can actually make those decisions without really spending too much time. And that’s a beauty for all startups moving fast, but the moment the community grows, you have those constraints and problems. So Gov4git is precisely designed for those growing communities and making sure the communities grow in like a very healthy way versus like there is a stop at some point where, like, you cannot make a consensus because of, you know, this person is out, or I don’t have enough information, or I don’t have rights or permissions to make those changes. So, uh, we—like Petar said—we dogfood the code, but at the same time, the use cases are like for a little bit bigger groups and communities. 

HUIZINGA: Well let’s get specific about the problems and solutions from a technical perspective. And, um, Petar, I’m going to ask you to take the lead on this. As I understand Gov4git from my non-technical perch, it’s a sort of sandbox for community governments mechanisms. How would you define the problems you’re trying to solve with Gov4git, and how are you going about solving them technically? 

MAYMOUNKOV: Yeah, this is a good way of putting it. It’s, it’s a sandbox for governance, um, solutions, so, um, indeed I have the, um, technical kind of part of this, um, project. And from, um, from a computer science point of view, governance is synonymous with trusted computation. So trusted computation is, is an abstraction or a notion whereby there is a public, uh, program or rules of governance and the community has a method of kind of—there is a, there is a device that, that executes and follows the rules of governance and the community members have, um, assurance that the rules are followed as advertised and that nobody can sidestep the system regardless of their role in the community. So governance is trusted computation to scientists, basically. Now, uh, trusted computation being a general abstraction is, is something that has various embodiments in the real world, and the most, uh, famously known currently embodiment of trusted computation are public blockchains such as Ethereum, Filecoin, and others. So we could have sort of chosen to use these existing solutions to how you build governance applications, um, but we ran into a number of practical issues with them that prevent us from delivering sort of practical results in a reasonable amount of time. And also, there are some shortcomings that prevent these solutions from reaching people in unprivileged parts of the world, so developing world, war zones, authoritarian countries. Uh, so effectively, Gov4git from a technical standpoint is a different embodiment, a different implementation, of trusted computation, which is not in competition with public blockchains. It captures a, a different tradeoff, so to speak. 

HUIZINGA: OK, talk a little bit more about the tradeoff. I mean, some of these things would represent to me a barrier to entry—I wouldn’t be able to, um, afford it. What are some of the, the upsides to Gov4git that, um, we don’t find in the other spaces? 

MAYMOUNKOV: Yeah, so to make a fair comparison, I should first give some context on the existing blockchains. Um, so the existing blockchain technologies are quite exciting, um, and they, they’re very promising. But currently, they’re in a state of having overshot in their level of ambition and slightly underdelivered, at least for the present time, and I’m sure they will eventually deliver, uh, sort of completely. So what do I mean by this? So they have overshot in the sense that they are—they provide so many features and, and they capture an extremely large set of applications, but at the same time, this of course involves a lot of complexity that they need to deal with, and this complexity hasn’t been fully sorted out yet to make them usable for sort of common cases. So what, what we’ve noticed here is that there is a large group of applications, in particular community governance, which does not need most of the features that are provided by public blockchains. And so once you realize that this is the case, you unlock much simpler solutions that have the same sort of outcome for the users. So public blockchains—let me be a little bit specific here for the technical listeners—so public blockchains, they’re global systems, so across the world. They’re capable of hosting multiple independent applications. Uh, you can think of this as independent communities which need to interact with each other at very high speeds and with a very high throughput. So the typical applications that you can think of is essentially high-volume, cross-community business or trade interactions. And, of course, this is a real use case, especially with financial systems and so forth. But, um, in contrast, community governance applications, which are sort of designed to serve humancentric deliberative processes within a community, they’re not global; they’re local to a community. They are not multiple applications; they are a single application that governs one community. And because they are human-deliberative applications, they don’t need high speeds and high throughput, so recognizing that these, um, this is the case, alternative designs for trusted computation, um, sort of emerge and this is what we’ve, what we went after. 

HUIZINGA: That’s, that’s awesome. Well, and so, Kasia, let’s go back to a little bit because we’re going to cross over here. There’s a couple of themes that are emerging that I think are really interesting. Um, you talk about, earlier, the issues in pull requests that you deal with and that Gov4git has some mechanisms to help address the tension between what I might call anarchy and dictatorship. Is there some kind of a, a mechanism that’s different that can help mitigate that? 

SITKIEWICZ: Yeah, absolutely. So, as I mentioned, there are different types of communities, and the bigger the community gets, the more issues you have. Within smaller community, you pretty much know who you’re interacting with; you know the contributors; you know who is the maintainer. And it’s actually quite fast to make those changes and like approving those pull requests and reviewing comments and issues and other activities that are happening around every project. With the bigger communities, there’s more, uh, logistics problem and governance problem, and many times, you truly don’t know who is contributing to your code source. You just know their handle. That can be anyone; that can be even some kind of like ChatGPT, especially with like right now like the generative foundation models. Like we’re going to see more problems of like interacting with non-humans, right? So I feel like communities will have more and more problems facing like, “OK, how do I manage my contributors, and, uh, how fast we want to move the project?” So Gov4git is using, uh, a lot of like beautiful features from Web3, which is quadratic voting. It’s, uh, pretty much collective decision-making procedures that involve individuals who are part of your community with allocating votes to express the degree of their preferences. So as you mention, in a traditional organization, there is one person or one dictator that tells you like, “Hey, you’re going to build that.” And once we have it, we’re going to like approve it, right? And we’re going to like ship it. With quadratic voting, the decision is made collectively. So we’re going to implement quadratic voting part of our governance model. Second feature that is also very nice is like the governance tokens. Right now, um, communities, there are few ways of like how they make decisions, either majority of the votes or through consensus. With this type of governance tokens, you will be able to see like how many people voted on a specific pull request or a feature, and the majority of the votes will be pretty much the decision-making. So community can use those governance tokens for making the decision. And lastly, uh, there is a concept of badges. So in the Web3 space, there are like NFTs, and one of the NFTs is a soulbound token, which is a token that you are given that you cannot transfer, and we believe that by implementing those soulbound tokens, you can authenticate the user, you can say, “Hey, I know you; you’re part of this community; you got this badge.” And that badge gives you, let’s say, right to receive those tokens and so on. So again, those are just like a few features that are actually like very nice in that decentralized communities that we want to bring into Gov4git so that the communities can benefit from having specific features like, uh, quadratic voting, governance tokens, or like those badges. And what I want to say is like, you know, GitHub or like other git platforms, they don’t support this type of governance features, and that’s the need from the users and customers being like, “Hey, I need something that will be very easy, efficient, and transparent,” and Gov4git provides all of it. 

HUIZINGA: Yeah. Well, and on that same topic, Petar, I always like to ask what could possibly go wrong, and even as Kasia’s talking, all kinds of things are coming into my head like, um, could a bot get an SBT or, I mean, do you have to be, provide validation to who you are and what you represent yourself as? 

MAYMOUNKOV: Yeah, so, um, let me answer the general question and the specific question. So I think the specific question about bots is that, has the following answer. So I think people in Microsoft Research in particular, but people in general, are realizing that identity is going to be much harder to, uh, prove and understand in the presence of AI. And so here we kind of—especially Glen, sort of leading with his paper on soulbound tokens, is essentially looking into something that we do in the real world, uh, which is that we have deep ways of verifying people’s identity by essentially, um, looking into their history with communities and within society. Uh, so the presence of these badges that Kasia is mentioning is essentially creating a system whereby people can collect certificates from different endeavors that they have participated in to build out a résumé that is verifiable by the communities where they participated that they are who they are. In some, in some sense, the person is the sum total of everything they’ve done for other people. And currently, a bot cannot accomplish as much as a person and get sort of, you know, certificates from other humans that this has been the case. So roughly, this addresses the question of, OK, can something go wrong with, with bots. In a sense, bot or not, to be acknowledged in a system, you have to have contributed verifiably to, to multiple communities eventually. Um, but there is a bigger sort of picture about what can possibly go wrong. And so in this regard, Gov4git kind of sits in a very standard situation with most, uh, very promising software tools, which is that it, it is, it is a powerful tool that can fall in the hands both of good and bad people, acknowledging the fact that good and bad are relative terms. And, and this is, this actually also plays on a, on a general theme in software and science, which is that software engineers and engineers, scientists and so forth, they design software which is symmetric, so the software from the start treats everybody in the same way. It doesn’t have a way of distinguishing, you know, who’s using it. And even though this sounds like the right place to be—it’s a neutral place to be—there are plenty of cases already in the real world where, um, it is unclear, you know, whether society wants symmetric treatment of everybody. The, the classical example here that I would give is, is Twitter. When it comes to the question of censorship on Twitter, there’s a few different alternative, um, kind of directions that that people can think of, of taking. One direction is to say that, uh, no censorship should happen, uh, which is the symmetric treatment. So everybody gets the same agency within a system. But as you know, there’s plenty of people who don’t like this approach. There’s other approaches, such as “somebody should censor us.” But who’s, who’s the somebody? So, so these kinds of issues all apply in this case, as well, because if governance for git is to be successful, what I hope, or, you know, cautiously hope, that it will result in, it’ll enable communities to forum at a much larger speed and a much larger volume around the world. And usually, when things speed up for humans, just like Twitter sped up discourse between people, um, we tend to find ourselves in a situation where we are slightly unprepared to, to, to reason about where does this go. 

HUIZINGA: Right. Kasia, what do you have to add to Petar’s conversation there on the “what could go wrong” from your end? 

SITKIEWICZ: I think from the product side—and I can speak as a product manager—there might be a case where like the community will come back to us like, “Hey, this is not what we want. We want something different,” right. Which, it’s a hypothesis, and can, this can, this feedback can happen, right. But at the same time, I believe that the community will ask for more. So like we are building just a very simple MVP to pretty much let the community to make those decisions, but perhaps the direction might be like, “Hey, the value’s somewhere else.” Uh, because once we launch, we can learn like, OK, this is great, but it’s not enough. So I would speak from the product side and like the user testing that perhaps we might discover like, oh, the actually true value will be somewhere else, and perhaps it can be a quadratic voting; it can be those tokens or those badges, right. So from my end, I feel like that’s the biggest like unknown, and speaking about bots and, uh, all the AI work, I feel like there is a lot of value in that, as well. So it’s not just a negative aspect of like, “Hey, I don’t want automation to be part of my project.” I think we will see it more, and there will be a lot of benefits. It’s just there are a lot of things we do not know as of now, and we just have to make sure like we are very flexible in terms of like how we pivot and how we adapt to feedback. 

HUIZINGA: Right. But, but in other ways, GitHub itself and Gov4git is a platform for people to form their own communities and govern their own communities, right? So you’re not going to be sort of the 10,000-foot hall monitor and try to meta-govern the people that are governing their own communities, correct? 

MAYMOUNKOV: Yes. SITKIEWICZ: That’s correct, yes. 

HUIZINGA: They’re nodding their heads. It’s a podcast—you can’t see it! [LAUGHS] Well, and this, this discussion on the “what could possibly go wrong” is important for me because I think people who are going to use the technology want to know that people promoting it are aware of the potential for unforeseen and unintended consequences and have a plan for mitigating. But it’s such an interesting ramp up to this new kind of use case for collaborative, open-source governance that it’s really cool. Kasia, let’s talk specifically about some of those use cases from the product side that you’ve alluded to. Um, GitHub is well known in the developer community, but how’s the idea of decentralized open-source work moving into non-technical communities and applications? 

SITKIEWICZ: Yeah, absolutely. So in any open-source project, you will find very technical contributors and maintainers and also you will find people who just like want to like observe the project or perhaps help with like project management or translation and so on. So we already have a lot of non-technical contributors who perhaps are struggling when they first log in to GitHub and they learn about git. They were like, “What the heck is that?” It’s a black box. So we truly get that feedback from customers. It’s like a very overwhelming experience, and it takes some time to wrap up and kind of learn how to use it. So the idea for Gov4git is pretty much a very simple presentation, or UI, via extension, Chrome extension, where you will see something very familiar like you see on Twitter, where you have like a post that you need to vote on, and if you are eligible to vote, you will, you’ll be able to use your tokens, uh, and vote on the decision, and you will be able to comment and interact with the community, and so on. So the ultimate goal is to create something very simple, just like a Twitter, you know, is simple, so that community is like, “Hey, I can participate, and I can put my vote, and I can contribute to this project.” So ultimately that’s the case. And the way—how we will be testing it, we talked about this book. So the book is called Plurality: Technology for Collaborative Diversity and Democracy, and it’s led by Audrey Tang and Glen Weyl and with, along with the plurality community. So the Plurality, it’s an open git-based collective book project that aims to offer a vision for the future of technology focusing on empowering and bridging social differences. So that book is on GitHub, and collaborators and maintainers who are participating are writing this book in an open-source way. And as you can imagine, writing a book is not an easy or trivial thing. You have a lot of reviews; you have everyone looking and providing feedback. So we believe that they can benefit from, uh, using Gov4git, with like management of like PRs and issues and decision-making. And, um, the initiative is already like there, right; it’s started. So we are just like trying to see like how that can—book can be completely managed by a community versus like Audrey or Glen has to like spend a lot of hours to review all of those PRs. And it sometimes is very challenging, and it’s almost impossible to go every single comment, so we believe that this can help and expedite the process and make it very transparent and efficient way to write in open source. 

HUIZINGA: Petar, talk a little bit about the other applications, including this one, from a technical perspective. Um, what makes it easier to resolve arguments and make edits with Gov4git versus other mechanisms to do that? 

MAYMOUNKOV: Gov4git, being a sandbox, at least technologically, is not trying to be prescriptive about how people do this. We’re trying to enable people to, to, to pick the mechanisms that they want for themselves, for arbitrating conflicts, so, you know, starting with, with Glen’s project, of course, we are starting with quadratic voting, and we plan, um, the quadratic voting is a, is a large, at this point, field. There’s lots of different variants of it. So we, we build the product so that over time Glen and Audrey can experiment with, you know, different types of conflict resolution and, and so forth. What Gov4git provides is the ease of adding a new mechanism that the community wants. And of course, we plan to have a library of like mechanisms that people can choose from. One nice side benefit from this entire project is that Gov4git, uh, enables people to like reflect on what they’ve done and, and what is happening. So with Gov4git, you always have a complete history, both of the governance motions of, of the community, alongside with the actual open-source collaborative work, which in particular enables academics and researchers from organizations such as the Metagovernance Project being a good example to go in there and study what types of mechanisms make for better results, basically, and kind of improve iteratively over this. 

HUIZINGA: Yeah. So it sounds like there’s a spectrum of assessment or meta-governance testing with computer scientists, product managers, academics. Even there, you see this great collaboration happening. Go back to the, the academics and other, uh, collaborators that are coming in on this. Do you find a broad spectrum of disciplines involved, not just computer scientists in academia but perhaps social scientists, legal scholars, any of these kinds of things coming into this? 

MAYMOUNKOV: Um, it’s too early to tell, but, uh, but there has been indeed interest, so, so from a few places, right. So the, the academics are indeed interested to, to consume this data when it’s available from real-world communities, because the key thing for them is to have real-world data like sufficiently scaled communities, like the Plurality book would be a great example because it’s probably expecting to have thousands of contributors. And otherwise, um, in addition to, uh, the Plurality book as like a first customer, so to speak, uh, we already have lots of interest from AI companies. So these are AI companies that are currently building open-source AI models, and they want to experiment with attaching governance to their open-source work, which is already happening on gits and GitHub. And they want—uh, because once you have governance plus open source, then you, you have a holistically democratic development of something like an AI tool. 

HUIZINGA: Right. That just struck me that you say thousands of contributors to a book and you never [LAUGHS] think of that being the case. Um … 

MAYMOUNKOV: Well, that’s a special, that’s a special book because it’s, it’s going to have translations in multiple languages, and being, being it, uh, also needs to be fact-checked, so there’s a lot of work on fact-checking that, that goes along with the writing process. 

HUIZINGA: Yeah. Sounds a bit like wiki in terms of contributors and checking and making decisions and so on. Um, is, is Gov4git even in beta yet, or is it still just, um, sandboxing itself? 

MAYMOUNKOV: Um, so the, the MVP—the first version, if you will—is, is ready and has been tested for a few months internally at Protocol Labs. What we’re missing and we’re still working on is like the user interface that brings in the non-technical users. So I guess you could say that it’s in beta. I think like our launch with the Plurality book would be the first kind of official introduction event. 

HUIZINGA: Right. Yeah, and that’s an interesting, you know, when the outsiders looking in going open source, you think software, you think developers, you think code, but there’s a lot of other applications, including writing a book, which is basically just text-based writing. So, Kasia, are there any other sort of cream-floating-to-the-top applications or products that you could see coming out of this? 

SITKIEWICZ: Technically, anyone who wants to start something new and is looking for collaborators, and it can be pretty much whatever you want to build. It doesn’t have to be like a big idea. It can be just, “Hey, I want to collaborate with someone, and I want to like figure out how to do things and how to practice.” It can be used by academics, as you mentioned. Like pretty much any, any, any person who wants to start with like building something in public, they can do it and use it. So there is no limits. It’s up to you if you want to build community around the project you’re working on. So we don’t have any restrictions, and I feel like, um, we are in the stage right now or like this AI revolution where we’re just entering this like open-source community’s growth because there is like a lot of hype right now and everybody’s interested in it. Oh, maybe I can build that. It’s just so much easier to do things right now. And, you know, if you want to grow, you have to have a community around you. Um, so I think this is just like a best practices for anyone who wants to start writing in public. Whatever is that is—it might be like just a book or a code or like learning or like sharing some information. It doesn’t really matter. And, you know, being at GitHub, we see a lot of like amazing projects regardless of the discipline and like the area, and communities are just fascinating. And I think that’s the future. Like pretty much a lot of companies will start doing open-source code, just [like] Twitter has done it, right, just to bring the transparencies, because in a decentralized world, that’s like the value proposition, like, hey, it’s a very transparent way of building, and you have a history being displayed of the decision-making. And there are a lot of companies started noticing the beauty of it, and they—I think the movement is just starting, so I see a huge growth. 

HUIZINGA: Yeah, and that leads into the last question I wanted to ask both of you, um, and you both alluded to some of this already in your answers, but just if you could encapsulate in your ideal preferred future, what is your work look like in five to 10 years? How have you changed the landscape of collaborative work, community governance, and even that concept of communities? 

MAYMOUNKOV: So I hope that well within 10 years, this tool becomes perceived as a somewhat go-to tool for building, you know, communities from scratch, and, in particular, I actually hope that the tool reaches a critical point which you can label the beginning of intersectionality, to borrow a term from Glen’s, um, Glen’s vocabulary. Um, and what this means, this is a point where there is enough deployments of Gov4git that you have a non-trivial amount of people that are members of more than one community. So in other words, communities are starting to overlap, and when, when we reach this critical point, there’s a whole new set of applications that open up because now communities can, uh, interact with each other, uh, and ask each other for various kinds of help. The classical example here is that, um, one community can ask another community whether a given member has had a long and productive career in the other community. And this kind of idea—also mostly coming from Glen—is actually a mirror image of what I mentioned earlier, what happens in the real world. So when you apply for a job with, uh, an employer, the employer being a community, this employer calls up your university to verify that you actually went there and you did a good job. So you have these two communities basically sharing information. Um, so there’s lots of applications of intersectionality, but the reason I call this a critical point is because once you get there, you actually expect the network effect that we know from social networks to start taking place. In particular, if the network of communities using Gov4git is, is, is large and there’s lots of intersection, then any new communities being formed would benefit a lot from reusing the same technologies because now they can benefit from all of these other communities that already exist and that they can interoperate with. This is sort of a critical point, because, uh, if we reach it, then the tool really has a chance of becoming like an international standard for like conceiving communities, basically. 

HUIZINGA: Yeah. Kasia, what would you add to that? 

SITKIEWICZ: So I will speak a little bit more high level on the data we are seeing at GitHub, and what we believe that will happen is last year we hit 100 million developers being on our platform … 

HUIZINGA: Wow. 

SITKIEWICZ: and they’re like thousands of thousand different open-source communities. And we, we see a huge growth, and especially with like the AI innovation that is happening in that space, I think this will like triple in the upcoming few years. So the more people start understanding the beauty of technology and collaboration and like writing in public, the more adoption we will have. So I think it’s just a matter of time how fast, uh, tools like Gov4git will grow and will be needed. We’re still early because there is, like we don’t know what we don’t know. We know the problem. But we don’t know how the problem will, um, intensify in the upcoming like months or years, right. So I truly believe that there is a need for it. There will be a huge growth in terms of like creating new communities, and people from around the world, they can unite through using platforms like GitHub or other services where they can actually engage with other people who are passionate about the same thing. So as you mentioned, open-source concept is not new, but it’s actually getting more in the strength, and the value’s there. So in my eyes, it’s just a matter of time on like the scale and the growth, and features like, like prioritization or like quadratic funding will be just like more adopted by the community. So that’s my, uh, take and, uh, opinion about the space. 

[MUSIC] 

HUIZINGA: Petar and Kasia, thank you so much for coming on the show today and being our first guests on the Collaborators podcast. 

MAYMOUNKOV: It’s a pleasure. 

SITKIEWICZ: Thank you for having us. 

[MUSIC ENDS] 

The post Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz appeared first on Microsoft Research.

Read More

AI self-play for algorithm design

AI self-play for algorithm design

A flow chart demonstrating the five steps in a self-play pipeline for a language model to improve itself automatically.A self-play pipeline for a language model (LM) to improve itself in a fully automatic manner. First, the LM generates novel puzzles based on a training set of handwritten puzzles. Then, the LM attempts to solve each of these puzzles 100 times. In Step 3, the computer (specifically a Python interpreter) filters the candidate solutions for correctness. Finally, the LM is improved by further training on these verified correct solutions to synthetic puzzles, and the process repeats. This process leads to significant improvements as measured on held-out test puzzles that were also handwritten.
A self-play pipeline for a language model (LM) to improve itself in a fully automatic manner. First, the LM generates novel puzzles based on a training set of handwritten puzzles. Then, the LM attempts to solve each of these puzzles 100 times. In Step 3, the computer (specifically a Python interpreter) filters the candidate solutions for correctness. Finally, the LM is improved by further training on these verified correct solutions to synthetic puzzles, and the process repeats. This process leads to significant improvements as measured on held-out test puzzles, which were also handwritten.

Efficient algorithms are crucial for many purposes, including reducing energy consumption in digital devices. While humans outperform AI systems at designing such algorithms, we show how to improve AI programming abilities using self-play, a technique that has helped AI systems dominate in games such as chess and Go.

Designing fast and accurate algorithms requires high-level abstract reasoning, which remains difficult for AI systems. Our approach involves having the AI design and solve its own programming challenges, enabling practice on millions of artificial challenges and exploration of problem types not found in public repositories. We detail our work in a new paper, “Language Models Can Teach Themselves to Program Better,” which we’re presenting at the 2023 International Conference on Learning Representations (ICLR).

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

The key challenge and our solution

How can an AI system generate novel algorithmic programming problems without knowing the solution?

Our approach uses programming puzzles introduced by Microsoft Research in 2021. These puzzles—known in complexity theory as the class of “NP” decision problems—are easy to check for correctness (no hidden answer key) but often difficult to solve. In this way, they’re like a Rubik’s cube, where it’s trivial to recognize a solution but hard to find one. Three examples are illustrated below: a novel string challenge and the classic Towers of Hanoi and factoring problems. Programming puzzles can range from trivial to major open problems in algorithms and mathematics, and solving them requires all the major algorithmic techniques, such as dynamic programming and greedy algorithms. However, each puzzle just checks a single input as opposed to standard problems in algorithms, which require a solution that scales efficiently for all inputs, which is much harder to test.

Programming puzzle examples

Can computers generate valuable, novel challenges?

Surprisingly, language models such as Codex and GPT-Neo can indeed create novel puzzles when prompted to generate “more like these” on a set of example puzzles without solutions. You may wonder what makes a challenge good. Instead of focusing on interesting, we prioritize useful challenges. Our evaluation has the language model generate, solve, and train on its own puzzles; then we assess whether the training improved its performance on a hidden test set of puzzles. (By now, solutions to our puzzles may have leaked into AI training sets, but with the help of champion competitive programmers, we have created a secret test set that remains unpublished, which can be used for uncontaminated evaluation.) In our experiments with small- to medium-sized language models—with a few billion parameters, much fewer than the latest GPT models—self-training more than doubled success rates.

Risks and limitations

This research was conducted prior to GPT-4’s release. While we believe similar techniques may help GPT-4 self-improve in programming, this is an active area of research as we better understand the capabilities and limitations of these models. One key limitation of puzzles is that solutions might only work for the specific instance provided. However, this limitation also serves as an advantage in terms of human-AI alignment. Unlike other AI challenges with inherent ambiguities that could lead to unintended consequences if objectives are imprecisely defined (for example, an AI-designed math-tutor app that may become addicting unintendedly), our programming puzzles encompass exactly those standalone problems that can be perfectly verified for meeting a precise objective. As there remains a risk that any work that substantially advances AI programming capabilities can be used in other systems and with unintended consequences, we continue to encourage taking great care before deploying systems with artificially generated code.  

Examples of programming puzzles for AI self-play

Each puzzle is specified by a short Python program that checks a possible answer. Each solution is a Python program that outputs an answer in a limited amount of time.

Example 1: Towers of Hanoi

A Towers of Hanoi puzzle in three steps: the first a picture with the puzzle’s seven disks on the first tower, the second a picture with the disks split among the three towers, and the third a picture of all the disks on the last tower.

The goal of the well-known Towers of Hanoi puzzle is to move all the disks from the first tower to the last tower, one by one, without ever putting a bigger disk on top of a smaller disk. It’s easy to check that a solution is correct but hard to find a correct solution. Even though the number of steps required to solve it is exponential in the number of disks, there’s a solution in the form of a short program that is often used to teach recursion. The clever solution program that outputs the moves is easier to find than the sequence of moves itself. Here are the programming puzzle and solution:

Example 2: String challenge

This concise puzzle perplexes AI systems, although humans find it simple. The puzzle requires a string with 1,000 “A” characters but no two consecutive A’s. Most programmers devise solutions like “ABABAB …” (1,000 times), generated by the compact Python solution above. In contrast, AI systems usually need multiple attempts. Fortunately, AI systems can easily verify their attempts by running the checking program. This puzzle exemplifies a straightforward, unique problem specifically created for our dataset.

Example 3: Integer factorization

Another classic example is integer factorization. The puzzle above requires a factor of a relatively small number so it can be solved quickly by a simple loop. However, our dataset also contains factoring challenges like the 309-digit RSA Factoring Challenge number, which was published in 1991 along with a $100,000 prize. The 309-digit number was never factored, and the challenge has since ended.

The post AI self-play for algorithm design appeared first on Microsoft Research.

Read More

Research Focus: Week of April 24, 2023

Research Focus: Week of April 24, 2023

Microsoft Research Focus 14 edition, week of April 24, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

AWARD

Microsoft researcher Kalai awarded 2022 ACM Prize in Computing

Yael Tauman Kalai, a senior principal researcher at Microsoft Research, has been awarded the 2022 ACM Prize in Computing. Kalai was recognized for breakthroughs in verifiable delegation of computation and fundamental contributions to cryptography. According to the award announcement, “Kalai’s contributions have helped shape modern cryptographic practices and provided a strong foundation for further advancements.”

The ACM Prize in Computing recognizes early-to-mid-career computer scientists whose research contributions have fundamental impact and broad implications.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

Among the multiple accomplishments cited for the award, Kalai has developed methods for producing succinct proofs that certify the correctness of any computation. This method enables a weak device to offload any computation to a stronger device in a way that enables the results to be efficiently checked for correctness. Such succinct proofs have been used by blockchain companies to certify transaction validity, thereby overcoming key obstacles in blockchain scalability and enabling faster and more reliable transactions.

Kalai was also cited for her breakthrough work on the security of the “Fiat-Shamir paradigm,” a general technique for eliminating interaction from interactive protocols. This paradigm is extensively utilized in real-world applications, including the most prevalent digital signature scheme (ECDSA), which is used by all iOS and Android mobile devices.


NEW RESEARCH

Empowering Azure Storage with RDMA

High performance and highly reliable storage are fundamental requirements of public clouds. Given the wide adoption of disaggregated storage in the cloud, networking is essential for enabling high performance and high reliability. Microsoft’s Azure cloud service uses remote direct memory access (RDMA) as its transport and aims to enable it for both storage frontend traffic (between compute virtual machines and storage clusters) and backend traffic (within a storage cluster) to fully realize its benefits. As compute and storage clusters may be located in different datacenters within an Azure region, RDMA needs to be supported at regional scale.

In a new paper: Empowering Azure Storage with RDMA, Microsoft Azure and Microsoft Research report on their intra-region RDMA deployment to support storage workloads in Azure. The high complexity and heterogeneity of Azure infrastructure creates challenges, such as the problem of interoperability between different types of RDMA network interface cards. Several changes were made to the network infrastructure to address these challenges. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. This helps achieve significant disk I/O performance improvements and CPU core savings.


NEW RESEARCH

LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models

Systems that support users in the automatic creation of visualizations must address several subtasks—understand the semantics of data; enumerate relevant visualization goals; and generate visualization specifications. In a new paper: LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models, researchers from Microsoft pose visualization generation as a multi-stage generation problem and argue that well-orchestrated pipelines based on large language models (LLMs) and image generation models (IGMs) are suitable to addressing these tasks.

LIDA is a novel tool for generating grammar-agnostic visualizations and infographics. It is comprised of four modules—a summarizer that converts data into a rich but compact natural language summary; a goal explorer that enumerates visualization goals given the data; a visgenerator that generates, evaluates, refines, executes, and filters visualization code; and an infographer module that yields data-faithful stylized graphics using IGMs. LIDA provides a python API and a hybrid user interface (direct manipulation and multilingual natural language) for interactive chart, infographics and data story generation.


NEW RELEASE

Announcing DeepSpeed-Chat: Easy, fast, affordable RLHF Training of ChatGPT-like models at all scales

Microsoft’s AI at Scale initiative has released DeepSpeed-Chat, an easy, fast, and low-cost open-source solution for reinforcement learning from human feedback (RLHF) training that can create high-quality ChatGPT-like models ranging in size from a few to hundreds of billions of parameters. DeepSpeed-Chat provides complete RLHF training experience with a single click. It combines the prowess of DeepSpeed-Inference and DeepSpeed-Training to offer 15x faster throughput than the previous state of the art, while also supporting model sizes that are up to 8x larger on the same hardware. With DeepSpeed-Chat, practitioners can train an OPT-13B ChatGPT-like model in under 1.5 hours or a massive 175B model in a day on a modest GPU cluster. For those who don’t have a GPU cluster handy, DeepSpeed-Chat enables practitioners to train up to a 13B model on a single GPU, or at $300 to train on Azure Cloud. 


NEWS

Gov4git: Decentralized community governance to fuel open-source projects

Communal open-source projects have helped build countless applications for sourcing and sharing information like bug details and scientific data, as well as decentralized planning, design and policymaking. 

But the lack of a standardized and secure governance solution prevents many open-source projects from getting started—and holds them back when they get too big to be managed through ad-hoc methods. These small communities often resort to external mechanisms to manage their projects and protect them from malicious actors.

Microsoft Research and Protocol Labs, an open-source R&D company, are collaborating to develop Gov4git, a decentralized, git-native protocol with configurable governance rules to help launch more open-source projects and communities and support their growth.

Gov4git comes with many of the transparency, decentralization, and security benefits of blockchains while also harnessing the power of formal governance to avoid costly approaches to validation and dispute resolution. 

Git is the worldwide standard for version control and management of collaborative software development projects. Gov4git is designed as a secure and cost-effective framework solution which can be tailored to the specific needs of any one community and deployed by non-technical users anywhere where access to git is present. Gov4git can strengthen the security of such communities against the risks of malicious actors posing as collaborators with the intent to negatively impact community maintenance.

The post Research Focus: Week of April 24, 2023 appeared first on Microsoft Research.

Read More