Abstracts: December 12, 2023

Abstracts: December 12, 2023

Microsoft Research Podcast: Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Principal Research Manager Tao Qin and Senior Researcher Lijun Wu discuss “FABind: Fast and Accurate Protein-Ligand Binding.” The paper, accepted at the 2023 Conference on Neural Information Processing Systems (NeurIPS), introduces a new method for predicting the binding structures of proteins and ligands during drug development. The method demonstrates improved speed and accuracy over current methods.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today, I’m talking to Dr. Tao Qin, a Senior Principal Research Manager, and Dr. Lijun Wu, a Senior Researcher, both from Microsoft Research. Drs. Qin and Wu are coauthors of a paper titled “FABind: Fast and Accurate Protein-Ligand Binding,” and this paper—which was accepted for the 2023 Conference on Neural Information Processing Systems, or NeurIPS—is available now on arXiv. Tao Qin, Lijun Wu, thanks for joining us on Abstracts

LIJUN WU: Thanks. 

TAO QIN: Yeah, thank you. Yeah, it’s great to be here and to share our latest research. 

HUIZINGA: So, Tao, let’s start off with you. In a couple sentences, tell us what issue or problem your research addresses and, more importantly, why people should care about it.


QIN: Yeah, uh, we work on the problem of molecular docking, a computational modeling method used to predict the preferred orientation of one molecule when it binds to a second molecule to form a stable complex. So it aims to predict the binding pose of a ligand in the active site of a receptor and estimate the ligand-receptor binding affinity. This problem is very important for drug discovery and development. Accurately predicting binding poses can provide insights into how a drug candidate might bind to its biological target and whether it is likely to have the desired therapeutic effect. To make an analogy, just like a locker and a key, protein target is a locker, while the ligand is a key. We should carefully design the structure of the key so that it can perfectly fit into the locker. Similarly, the molecular structure should be accurately constructed so that the protein can be well bonded. Then the protein function would be activated or inhibited. Molecular docking is used intensively in the early stages of drug design and discovery to screen a large library of hundreds of thousands of compounds to identify promising lead compounds. It helps eliminate poor candidates and focus on experimental results of those most likely to bind to the target protein well. So clearly, improving the accuracy and also the speed of docking methods, like what we have done in this work, could accelerate the development of new life-saving drugs. 

HUIZINGA: So, Lijun, tell us how your approach builds on and/or differs from what’s been done previously in this field. 

WU: Sure, thanks, yeah. So conventional protein-ligand docking methods, they usually take the sampling and scoring ways. So … which … that means, they will use first some sampling methods to generate multiple protein-ligand docking poses as candidates. And then we will use some scoring functions to evaluate these candidates and select from them and to choose the best ones. So such as DiffDock, a very recent work developed by MIT, which is a very strong model to use the diffusion algorithm to do the sampling in this kind of way. And this kind of method, I say the sampling and scoring methods, they are accurate with good predictions, but of course, they are very slow. So this is a very big limitation because the sampling process usually takes a lot of time. So some other methods such as EquiBind or TANKBind, they treat the docking prediction as a regression task, which is to use deep networks to directly predict the coordinates of the atoms in the molecule. Obviously, this kind of method is much faster than the sampling methods, but the prediction accuracy is usually worse. So therefore, our FABind, which … aims to provide a both fast and accurate method for the docking problem. FABind keeps its fast prediction by modeling in a regression way, and also, we utilize some novel designs to improve its prediction accuracy. 

HUIZINGA: So, Lijun, let’s stay with you for a minute. Regarding your research strategy on this, uh, how would you describe your methodology, and how did you go about conducting this research? 

WU: OK, sure. So when we’re talking about the detailed method, we actually build an end-to-end deep learning framework, FABind, here. So for the protein-ligand docking, FABind divides the docking task as a pocket prediction process and also a pose prediction process. But importantly, we unify these two processes within a single deep learning model, which is a very novel equivalent graph neural network. Here, the pocket means a local part of the whole protein, which are some specific amino acids that can bind to the molecule in the structure space. So simply speaking, this novel graph neural network is stacked by some identity graph neural networks. And the graph neural layer is carefully designed by us, and we use the first graph layer for the pocket prediction and the later layers to do the pose prediction. And for each layer, there are some message passing operations we designed. The first one is an independent message passing, which is to update the information within the protein molecule itself. And the second one is the cross-attention messenger passing, which is to update the information between the whole protein and also the whole molecule so we can then let each other have a global view. And the last one is an interfacial messenger passing, which is to do the update, and we can message pass the information between the closed nodes between the protein and the molecule. So besides, there are also some small points that will help to get an accurate docking model. For example, we use a scheduled training technique to bridge the gap between the training and the inference stages. And also, we combine direct coordinate prediction and also the distance map refinement as our optimization method. 

HUIZINGA: Well, listen, I want to stay with you even more because you’re talking about the technical specifications of your research methodology. Let’s talk about results. What were your major findings on the performance of FABind?

WU: Yeah, the results are very promising. So first we need to care about the docking performance, which is the accuracy of the, uh, docking pose prediction. We compare our FABind to different baselines such as EquiBind, TANKBind, and also, I talked before about the recent strong model DiffDock, developed by MIT. So the results showed that our docking prediction accuracy are very good. They achieve a very competitive performance to the DiffDock like that. But specifically, we need to talk about that the speed is very important. When compared to DiffDock, we achieved about 170 times faster speed than DiffDock. So this is very promising. Besides, the interesting thing is that we found our FABind can achieve very, very strong performance on the unseen protein targets, which means that the protein structure that we have never seen before during the training, we can achieve very good performance. So our FABind achieves significantly better performance with about 10 percent to 40 percent accuracy improvement than DiffDock. This performance demonstrates that the practical effectiveness of our work is very promising since such kinds of new proteins are the most important ones that we need to care for a new disease. 

HUIZINGA: Tao, this is all fascinating, but talk about real-world significance for this work. Who does it help most and how? 

QIN: Yeah. As Lijun has introduced, FABind significantly outperforms earlier methods in terms of speed while maintaining competitive accuracy. This fast prediction capability is extremely important in real-world applications, where high-throughput virtual screening for compound selection is often required for drug discovery. So an efficient virtual screening process can significantly accelerate the drug discovery process. Furthermore, our method demonstrates great performance on unseen or new proteins, which indicates that our FABind possesses a strong generalization ability. This is very important. Consider the case of SARS-CoV-2, for example, where our knowledge of the protein target is very limited at the beginning of the pandemic. So if we have a robust docking model that can generalize to new proteins, we could conduct a large-scale virtual screening and, uh, confidently select potentially effective ligands. This would greatly speed up the development of new treatments. 

HUIZINGA: So downstream from the drug discovery science, benefits would accrue to people who have diseases and need treatment for those things. 

QIN: Yes, exactly. 

HUIZINGA: OK, well, Tao, let’s get an elevator pitch in here, sort of one takeaway, a golden nugget, uh, that you’d like our listeners to take away from this work. If, if there was one thing you wanted them to take away from the work, what would it be? 

QIN: Yeah, uh, thanks for a great question. So I think one sentence for takeaway is that if for some researchers, they are utilizing molecular docking and they are seeking an AI-based approach, our FABind method definitely should be in their consideration list, especially considering the exceptional predictive accuracy and the high computational efficiency of our method.

HUIZINGA: Finally, Tao, what are the big questions and problems that remain in this area, and what’s next on your research agenda? 

QIN: Actually, there are multiple unaddressed questions along this direction, so I think those are all opportunities for further exploration. So here I just give three examples. First, our method currently tackles rigid docking, where the target protein structure is assumed to be fixed, leaving only the ligand structure to be predicted. However, in a more realistic scenario, the protein is dynamic during molecular binding. So therefore, exploring flexible docking becomes an essential aspect. Second, our approach assumes that the target protein has only one binding pocket. In reality, a target protein may have multiple binding pockets. So this situation will be more challenging. So how to address such kind of significant challenge is worth exploration. Third, in the field of drug design, sometimes we need to find a target or we need to find a drug compound that can bind with multiple target proteins. In this work, we only consider a single target protein. So the accurate prediction of docking for multiple target proteins poses a great challenge. 

HUIZINGA: Well, Tao Qin and Lijun Wu, thank you for joining us today. And to our listeners, thanks for tuning in.  

[MUSIC PLAYS] 

If you’re interested in learning more about this work, you can find a link to the paper at aka.ms/abstracts or you can find it on arXiv. See you next time on Abstracts

[MUSIC FADES]

The post Abstracts: December 12, 2023 appeared first on Microsoft Research.

Read More

Steering at the Frontier: Extending the Power of Prompting

Steering at the Frontier: Extending the Power of Prompting

three conversation bubbles on a blue, purple, and pink gradient background

We’re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise. Even seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts. Beyond basic, out-of-the-box prompting, we’ve been exploring new prompting strategies, showcased in our Medprompt work, to evoke the powers of specialists.  

Today, we’re sharing information on Medprompt and other approaches to steering frontier models in promptbase (opens in new tab), a collection of resources on GitHub. Our goal is to provide information and tools to engineers and customers to evoke the best performance from foundation models. We’ll start by including scripts that enable replication of our results using the prompting strategies that we present here. We’ll be adding more sophisticated general-purpose tools and information over the coming weeks.  

As an illustration of the capabilities of the frontier models and on opportunities to harness and extend the recent efforts with reaching state-of-the-art (SoTA) results via steering GPT-4, we’ll review SoTA results on benchmarks that Google chose for evaluating Gemini Ultra. Our end-to-end exploration, prompt design, and computing of performance took just a couple of days.

MICROSOFT RESEARCH PODCAST

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

In this episode, PhD students Jennifer Scurrell and Alejandro Cuevas talk to Senior Researcher Dr. Madeleine Daepp. They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers to the teamwork they say helped make it possible for them to succeed, and the impact they hope to have with their work.


Let’s focus on the well-known MMLU (opens in new tab) (Measuring Massive Multitask Language Understanding) challenge that was established as a test of general knowledge and reasoning powers of large language models.  The complete MMLU benchmark contains tens of thousands of challenge problems of different forms across 57 areas from basic mathematics to United States history, law, computer science, engineering, medicine, and more.  

In our Medprompt study, we focused on medical challenge problems, but found that the prompt strategy could have more general-purpose application and examined its performance on several out-of-domain benchmarks—despite the roots of the work on medical challenges. Today, we report that steering GPT-4 with a modified version of Medprompt achieves the highest score ever achieved on the complete MMLU.

In our explorations, we initially found that applying the original Medprompt to GPT-4 on the comprehensive MMLU achieved a score of 89.1%. By increasing the number of ensembled calls in Medprompt from five to 20, performance by GPT-4 on the MMLU further increased to 89.56%. To achieve a new SoTA on MMLU, we extended Medprompt to Medprompt+ by adding a simpler prompting method and formulating a policy for deriving a final answer by integrating outputs from both the base Medprompt strategy and the simple prompts. The synthesis of a final answer is guided by a control strategy governed by GPT-4 and inferred confidences of candidate answers. More details on Medprompt+ are provided in the promptbase repo. A related method for coupling complex and simple queries was harnessed by the Google Gemini team. GPT-4 steered with the modified Medprompt+ reaches a record score of 90.10%. We note that Medprompt+ relies on accessing confidence scores (logprobs) from GPT-4. These are not publicly available via the current API but will be enabled for all in the near future.

A graph showing the reported performance of baseline multiple models and methods on the MMLU benchmark. Moving from left to right, Palm 2-L (5-shot) achieved 78.4% accuracy, Claude 2 (5-shot CoT) achieved 78.5% accuracy, Inflection-2 (5-shot) achieved 79.6% accuracy, Google Pro (CoT@8) achieved 79.13% accuracy, Gemini Ultra (CoT@32) achieved 90.04% accuracy, GPT-4-1106 (5-Shot) achieved 86.4% accuracy, GPT-4-1106 (Medprompt @ 5) achieved 89.1% accuracy, GPT-4-1106 (Medprompt @ 20) achieved 89.56% accuracy, and GPT-4-1106 (Medprompt @ 31) achieved 90.10% accuracy.
Figure1. Reported performance of multiple models and methods on the MMLU benchmark.

While systematic prompt engineering can yield maximal performance, we continue to explore the out-of-the-box performance of frontier models with simple prompts. It’s important to keep an eye on the native power of GPT-4 and how we can steer the model with zero- or few-shot prompting strategies. As demonstrated in Table 1, starting with simple prompting is useful to establish baseline performance before layering in more sophisticated and expensive methods.

Benchmark GPT-4 Prompt GPT-4 Results Gemini Ultra Results
MMLU Medprompt+ 90.10% 90.04%
GSM8K Zero-shot 95.27% 94.4%
MATH Zero-shot 68.42% 53.2%
HumanEval Zero-shot 87.8% 74.4%
BIG-Bench-Hard Few-shot + CoT* 89.0% 83.6% 
DROP Zero-shot + CoT 83.7% 82.4%
HellaSwag 10-shot** 95.3%** 87.8%
* followed the norm of evaluations and used standard few-shot examples from dataset creators 
** source: Google 

Table 1: Model, strategies, and results

We encourage you to check out the promptbase repo (opens in new tab) on GitHub for more details about prompting techniques and tools. This area of work is evolving with much to learn and share. We’re excited about the directions and possibilities ahead.

The post Steering at the Frontier: Extending the Power of Prompting appeared first on Microsoft Research.

Read More

Phi-2: The surprising power of small language models

Phi-2: The surprising power of small language models

Contributors

Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Anh Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, Yi Zhang

Satya Nadella on stage at Microsoft Ignite 2023 announcing Phi-2.
Figure 1. Satya Nadella announcing Phi-2 at Microsoft Ignite 2023.

Over the past few months, our Machine Learning Foundations team at Microsoft Research has released a suite of small language models (SLMs) called “Phi” that achieve remarkable performance on a variety of benchmarks. Our first model, the 1.3 billion parameter Phi-1 (opens in new tab), achieved state-of-the-art performance on Python coding among existing SLMs (specifically on the HumanEval and MBPP benchmarks). We then extended our focus to common sense reasoning and language understanding and created a new 1.3 billion parameter model named Phi-1.5 (opens in new tab), with performance comparable to models 5x larger.

We are now releasing Phi-2 (opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2 (opens in new tab) available on the Azure model catalog to foster research and development on language models.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


Key Insights Behind Phi-2

The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.

Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:

Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.

A bar plot comparing the performance of Phi-2 (with 2.7B parameters) and Phi-1.5 (with 1.3B parameters) on common sense reasoning, language understanding, math, coding, and the Bigbench-hard benchmark. Phi-2 outperforms Phi1.5 in all categories. The commonsense reasoning tasks are PIQA, WinoGrande, ARC easy and challenge, and SIQA. The language understanding tasks are HellaSwag, OpenBookQA, MMLU, SQuADv2, and BoolQ. The math task is GSM8k, and coding includes the HumanEval and MBPP benchmarks.
Figure 2. Comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) models. All tasks are evaluated in 0-shot except for BBH and MMLU which use 3-shot CoT and 5-shot, respectively.

Training Details

Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned. Despite this, we observed better behavior with respect to toxicity and bias compared to existing open-source models that went through alignment (see Figure 3). This is in line with what we saw in Phi-1.5 due to our tailored data curation technique, see our previous tech report (opens in new tab) for more details on this. For more information about the Phi-2 model, please visit Azure AI | Machine Learning Studio (opens in new tab).

A barplot comparing the safety score of Phi-1.5, Phi-2, and Llama-7B models on 13 categories of the ToxiGen benchmark. Phi-1.5 achieves the highest score on all categories, Phi-2 achieves the second-highest scores and Llama-7B achieves the lowest scores across all categories.
Figure 3. Safety scores computed on 13 demographics from ToxiGen. A subset of 6541 sentences are selected and scored between 0 to 1 based on scaled perplexity and sentence toxicity. A higher score indicates the model is less likely to produce toxic sentences compared to benign ones.

Phi-2 Evaluation

Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report “Textbooks Are All You Need.” Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).

Model Size BBH Commonsense
Reasoning
Language
Understanding
Math Coding
Llama-2 7B 40.0 62.2 56.7 16.5 21.0
13B 47.8 65.0 61.9 34.2 25.4
70B 66.5 69.2 67.6 64.1 38.3
Mistral 7B 57.2 66.4 63.7 46.4 39.4
Phi-2 2.7B 59.2 68.8 62.0 61.1 53.7
Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.
Model Size BBH BoolQ MBPP MMLU
Gemini Nano 2 3.2B 42.4 79.3 27.2 55.8
Phi-2 2.7B 59.3 83.3 59.1 56.7
Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.

In addition to these benchmarks, we also performed extensive testing on commonly used prompts from the research community. We observed a behavior in accordance with the expectation we had given the benchmark results. For example, we tested a prompt used to probe a model’s ability to solve physics problems, most recently used to evaluate the capabilities of the Gemini Ultra model, and achieved the following result:

An example prompt is given to Phi-2 which says “A skier slides down a frictionless slope of height 40m and length 80m. What's the skier’s speed at the bottom?”. Phi-2 then answers the prompt by explaining the conversion of potential energy to kinetic energy and providing the formulas to compute each one. It then proceeds to compute the correct speed using the energy formulas.
Figure 4. Phi-2’s output on a simple physics problem, which includes an approximately correct square root calculation.
The model is then provided with a student’s wrong answer to the skier physics problem and asked if it can correct the student’s mistake. Phi-2 replies with the student’s mistake, i.e., using the wrong formula for potential energy, and provides the correct formula.
Figure 5. Similarly to Gemini’s test we also further queried Phi-2 with a student’s wrong answer to see if Phi-2 could identify where the mistake is (it did, despite Phi-2 being not fine-tuned for chat or instruction-following). We note however that it is not fully an apple-to-apple comparison with the Gemini Ultra’s output described in the Gemini report, in particular in the latter case the student’s answer was given as an image with handwritten text rather than raw text in our case.

The post Phi-2: The surprising power of small language models appeared first on Microsoft Research.

Read More

Abstracts: December 11, 2023

Abstracts: December 11, 2023

Microsoft Research Podcast: Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Principal Researcher Alessandro Sordoni joins host Gretchen Huizinga to discuss “Joint Prompt Optimization of Stacked LLMs using Variational Inference.” In the paper, which was accepted at the 2023 Conference on Neural Information Processing Systems (NeurIPS), Sordoni and his coauthors introduce Deep Language Networks, or DLNs, an architecture that treats large language models as layers within a network and natural language prompts as each layer’s learnable parameters.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Dr. Alessandro Sordoni, a Principal Researcher from Microsoft Research. Dr. Sordoni is coauthor of a paper titled “Joint Prompt Optimization of Stacked LLMs using Variational Inference,” and this paper, which was accepted for the 2023 Conference on Neural Information Processing Systems, or NeurIPS, is available now on arXiv. Alessandro, thanks for joining us on Abstracts!


ALESSANDRO SORDONI: Hi, Gretchen, thank you for having me.

HUIZINGA: So in a few sentences, tell us about the issue or problem that your research addresses and why we should care about it.

SORDONI: So in this paper, our starting points are large language models, and to make large language models solve tasks, one of the ways that is currently used is to prompt them. By prompting that means just giving instruction to them, and hopefully by joining instruction and the input of the task, the language model can solve the task following the rules specified in the instructions. And there has been some approaches already in the literature to actually infer what that instruction is without human intervention. And in this paper, we operate in that space, which is called kind of automatic prompt engineering. And our specific problem is to, one, how to actually infer those prompts for a language model. And, two, what happens if actually the output of that large language model gets into another language model and both language model needs prompt to operate? And so basically, we give sort of an algorithm to solve that joint prompt optimization. That’s why it’s called joint.

HUIZINGA: So what’s the underlying issue there that we should care about as potential users of this technology?

SORDONI: There are some problems that cannot be just solved by kind of one instruction or rule, I would say, but they necessitate some sort of higher-level reasoning or some sort of decomposition. And in that sense, it would maybe be useful to actually have multiple calls to the LLM, where each call is modulated by a different instruction. So the first instruction could be something very general, for example, decompose or visualize the problem into a different language that is formulated in. And the second call is now recompose this visualization that you have produced to solve the problem itself. And so basically, in that context, you can think about this as kind of augmenting the computational power of the language model by splitting the one call in multiple calls.

HUIZINGA: Well, go in a little deeper on the work that this builds on. All research kind of gets a prompt—no pun intended—from previous work. So how does your work build on and/or differ from what’s been done previously in this field?

SORDONI: I would say that our work started more with this intuition that LLMs are just kind of black-box computation units. Now this sort of black box can accept input as input language. The computation is modulated by an instruction and it outputs language, so you can stack these layers, right. So if the weights of this language layer now are the instructions and you can stack them together, how can you optimize them, right? And then we start to think, OK, but this is very related to kind of automatic prompt optimization. The overall kind of prompt engineering and prompt optimization approaches right now work by proposing some prompts and accepting some prompts. So we did some modifications with respect to how we propose new prompts to language models and how do we evaluate and accept then those that work given some task inputs and outputs. Our goal in the future—I would say in the near future—is going to be to basically integrate optimization that can really express arbitrary graphs …

HUIZINGA: Gotcha …

SORDONI: … of LLM calls right now. But in our paper, we started with the first step, which is, OK, say that I just have two calls. Can I just optimize prompts for that very simple graph? And we proposed an algorithm to do so. So basically, I guess our main contribution is, one, getting a better prompt optimizer for one layer and also devising an algorithm that works for two layers right now and that can be extended to multiple layers. But that’s also an engineering problem that needs to be tackled.

HUIZINGA: [LAUGHS] Yeah, always got to get the engineering in there! Well, listen, let’s keep going on this because it sounds like you’re talking about methodology and, and how you conducted this research. So expand a little bit on what you did actually to experiment in this arena.

SORDONI: Yeah, so I think that, uh, really the birth of this paper started from this kind of view of these language models as layers modulated by instructions that can be stacked upon each other. And from there, we said, OK, what can we do with this, basically? And so some of us worked on datasets that could be interesting for this new sort of methodology, I would say, or architecture. So basically, one question was, how do you go forward to actually test if this works in any way? And so we tried to select some datasets that were more of natural language tasks—for example, sentiment classification—and some datasets that were more about reasoning tasks. And our hunch was that basically stacking multiple layers together would help more in those tasks that would require some sort of decomposition of reasoning.

HUIZINGA: Right.

SORDONI: And for the reasoning task, we worked with this BIG-Bench Hard setting. And so parallel to that, there were some of us that worked, for example myself, in the optimization part, really in the algorithm part. And at first, we tried to do some sort of back propagation. But I quickly saw that there were some sort of issues with that … probably empirically issues. And so we tried to actually have a more formal understanding of this optimization algorithm by recurring to variational inference basically, so basically, to understand actually the first layer as producing some text and considering this text as a latent variable. When you open that box, it links also in your head to all … a bunch of kind of related works in the literature that have studied this problem very, very thoroughly. And so you can use those techniques into this context.

HUIZINGA: Interesting. So what were the results of this research? What did you find?

SORDONI: So what we found was that, indeed, the tasks in which these approaches seem to help the most are the tasks that require this sort of decomposition and reasoning. The first thing that was really, really kind of cool, it was that kind of you can go a long way in improving the performance of these language models by accurate prompt optimization. Because in some models, prompt optimization can be understood as kind of really tweaking the models towards solving the task. But in some other tasks, actually, when humans write prompts, they tend to maybe underspecify the prompt or tend to basically be not very clear to how to instruct the model. So the model has to do a lot of work to understand …

HUIZINGA: Right …

SORDONI: … what the human really wants to say to them. And so basically, this sort of prompt optimization acts as a sort of translator where it formulates a prompt that more comprehensively describes the task and more comprehensively contains some rules to solve the task. So it was very interesting to me, that kind of level of abstraction that was sort of required and needed in the prompt to really solve this task very, very well. The other finding is that this problem is very hard. It’s very tricky to optimize, to prompt, this type of optimization because this type of optimization doesn’t really follow a gradient direction like in deep neural networks.

HUIZINGA: Yeah.

SORDONI: It’s basically a sort of trial and error. And this trial and error is very finicky. There’s a lot of problems there. But I feel like I’m hopeful in the sense that this paper allowed us, I think, to hone in some very specific problem that if we solve them, we can make the problem much easier.

HUIZINGA: Let’s talk for a second about real-world impact of this research. Let’s extrapolate out from the lab and move into life. Who benefits from this most, and how do they benefit?

SORDONI: I think that, as I said before, like these automatic prompt optimization methods could benefit, I think, a large audience, or large amount of users, I would say, because they could be understood as a sort of translator between the user needs and what the LLM can do. For example, one effort here in Montréal that was led by my colleagues was kind of building this sort of interactive agent that would, through interaction with the user, form a prompt but interactively. So, for example, in DLN, like in our paper, we assume that we have a task and we do not have input or interaction with the user, right. But in more realistic scenarios, you might want to actually instruct your model to do something by some sort of active learning process where the model actually propose you whether what it did was favorable or desirable or not.

HUIZINGA: Right.

SORDONI: And the user can actually interact with that output, right. For the multilayer case, my hope is that that would be useful to build and optimize these large sort of graphs of LLM calls.

HUIZINGA: I want to take a second here to spell out some acronyms. You’ve referred to DLNs, and I don’t think our audience might know what that means. I’m assuming they know LLM means “large language model.” That’s sort of in the parlance. But talk a little bit about what that other acronym is.

SORDONI: Yeah, sorry I didn’t mention this. So DLN is basically how we refer to these architectures that are composed of language model layers. So DLN is, spells as “Deep Language Network.”

HUIZINGA: Gotcha.

SORDONI: People are free to use this name or not.

HUIZINGA: No, I like it …

SORDONI: I’m not a big fan of imposing acronyms on the world [LAUGHS], but that’s a, that’s a shorter version of it. So, yeah, so it’s really the idea that a language model is a layer in this hierarchy, and the layer accepts as input a text, it outputs a text, and really is modulated by an instruction or prompt that we want to learn.

HUIZINGA: And so the DLN is a deep language network and it sort of acts as a deep neural network but using language models as your layer.

SORDONI: Exactly, exactly, yes.

HUIZINGA: So this is a question I ask everyone, and it’s sort of like, how could you boil this down to one little takeaway if you’re standing on an elevator with somebody and they say, what do you do, Alessandro? So if there’s one thing you want people to take away from this work, what would it be?

SORDONI: The first thing that came to my mind is really the fact that these models can be understood really as a class, I would say, of probability distributions and that they are modulated by these prompts. And so basically, once you have that, once a language model just defines a (p) over sentences given some prompt, you can apply a lot of algorithms with those models. You can apply algorithms that resembles to EM, expectation maximization, or … I mean, we applied a form of that with variational inference, but maybe kind of it could open the path for other types of usages where kind of these are just very, very powerful probability distributions over these sentences that are considered as latent variable. I hope that our paper can show like a more or less practical kind of implementation of that idea. And that basically if you have to optimize, for example, prompts with one or two layers, you can definitely try our approach.

HUIZINGA: Well, finally, and we’ve been talking about this kind of already, but there seem to be some unresolved problems in the area. What do researchers like you need to be looking at in order to solve those? Sort of what’s next on the research agenda, whether it’s you or other researchers in this field?

SORDONI: So let me try to answer by something that really excites me now. What we are doing is that we are producing text, right. With the language model. But we are producing this text in such a way that it helps to solve a problem. And basically, this variational inference method and kind of framework gives us a way of understanding what does it mean to be a good text? Like what does it mean to be a good kind of latent variable or useful latent variable?

HUIZINGA: Right.

SORDONI: What does it mean to produce good data? So, for example, these big models kind of are really data creators, like this generative AI, right. But can we actually teach them to produce data such that this data can be helpful to solve tasks or to condition those same models to solve a task?

HUIZINGA: Right.

SORDONI: And what are the objective functions that promote the production of this useful data? What useful means from a mathematical perspective. I think that, apart from the prompt optimization angle, I feel like DLN to me kind of opened a little bit my spirit into kind of investigating ways of understanding what does it mean for some generated text to be useful to solve a task, I would say. Yeah.

HUIZINGA: Alessandro Sordoni, thanks for joining us today. And thanks to our listeners for tuning in. If you’re interested in learning more about this work, you can find a link to the paper at aka.ms/abstracts or you can find it on arXiv. See you next time on Abstracts!

The post Abstracts: December 11, 2023 appeared first on Microsoft Research.

Read More

NeurIPS 2023 highlights breadth of Microsoft’s machine learning innovation

NeurIPS 2023 highlights breadth of Microsoft’s machine learning innovation

Research Focus: NeurIPS
December 11, 2023

Microsoft is proud to sponsor the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). This interdisciplinary forum brings together experts in machine learning, neuroscience, statistics, optimization, computer vision, natural language processing, life sciences, natural sciences, social sciences, and other adjacent fields. We are pleased to share that Microsoft has over 100 accepted papers and is offering 18 workshops at NeurIPS 2023. 

This year’s conference includes three papers from Microsoft that were chosen for oral presentations, which feature groundbreaking concepts, methods, or applications, addressing pressing issues in the field. Additionally, our spotlights posters, also highlighted below, have been carefully curated by conference organizers, exhibiting novelty, technical rigor, and the potential to significantly impact the landscape of machine learning. This blog post celebrates those achievements.

Oral Presentations

Bridging Discrete and Backpropagation: Straight-Through and Beyond

Gradient computations are pivotal in deep learning’s success, yet they predominantly depend on backpropagation, a technique limited to continuous variables. The paper Bridging Discrete and Backpropagation: Straight-Through and Beyond, tackles this limitation. It introduces ReinMax, extending backpropagation’s capability to estimate gradients for models incorporating discrete variable sampling. Within extensive experiments of this study, ReinMax demonstrates consistent and significant performance gain over the state of the art. More than just a practical solution, the paper sheds light on existing deep learning practices. It elucidates that the ‘Straight-Through’ method, once considered merely a heuristic trick, is actually a viable first-order approximation for the general multinomial case. Correspondingly, ReinMax achieves second-order accuracy in this context without the complexities of second-order derivatives, thus having negligible computation overheads. 

MICROSOFT RESEARCH PODCAST

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

In this episode, PhD students Jennifer Scurrell and Alejandro Cuevas talk to Senior Researcher Dr. Madeleine Daepp. They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers to the teamwork they say helped make it possible for them to succeed, and the impact they hope to have with their work.


The MineRL BASALT Competition on Learning from Human Feedback

The growth of deep learning research, including its incorporation into commercial products, has created a new challenge: How can we build AI systems that solve tasks when a crisp, well-defined specification is lacking? To encourage research on this important class of techniques, researchers from Microsoft led The MineRL BASALT Competition on Learning from Human Feedback (opens in new tab), an update to a contest first launched in 2021 (opens in new tab) by researchers at the University of California-Berkeley and elsewhere. The challenge of this competition was to complete fuzzy tasks from English language descriptions alone, with emphasis on encouraging different ways of learning from human feedback as an alternative to a traditional reward signal. 

The researchers designed a suite of four tasks in Minecraft for which writing hardcoded reward functions would be difficult. These tasks are defined by natural language: for example, “create a waterfall and take a scenic picture of it”, with additional clarifying details. Participants must train a separate agent for each task. Agents are then evaluated by humans who have read the task description.

The competition aimed to encourage development of AI systems that do what their designers intended, even when the intent cannot be easily formalized. Besides allowing AI to solve more tasks, this can also enable more effective regulation of AI systems, as well as making progress on value alignment problems, in which the specified objectives of an AI agent differ from those of its users.

Related

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

This comprehensive evaluation platform aims to answer the question: How trustworthy are generative pre-trained transformer (GPT) models? In DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models, researchers focus specifically on GPT-4, GPT-3.5, and a series of open LLMs. They consider diverse perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.

The researchers’ evaluations identified previously unpublished vulnerabilities relating to trustworthiness. The team worked with Microsoft product groups to confirm that the potential vulnerabilities identified do not impact current customer-facing services. This is in part true because finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology. They also shared their findings with GPT’s developer, OpenAI, which has noted the potential vulnerabilities in the system cards for relevant models.

This research aims to encourage others in the research community to utilize and build upon this work, potentially pre-empting adversaries who would exploit vulnerabilities to cause harm. To facilitate collaboration, the benchmark code is very extensible and easy to use: a single command is sufficient to run the complete evaluation on a new model.

Spotlight Posters

Differentially Private Approximate Near Neighbor Counting in High Dimensions

Differential privacy (DP) is a widely used tool for preserving the privacy of sensitive personal information. It allows a data structure to provide approximate answers to queries about the data it holds, while ensuring that the removal or addition of a single database entry does not significantly affect the outcome of any analysis.

Range counting (counting the number of data points falling into a given query ball) under differential privacy has been studied extensively. However, current algorithms for this problem come with challenges. One class of algorithms suffers from an additive error that is a fixed polynomial in the number of points. Another class of algorithms allows for polylogarithmic additive error, but the error grows exponentially in the dimension. To achieve the latter, the problem is relaxed to allow a “fuzzy” definition of the range boundary, e.g., a count of the points in a ball of radius r might also include points in a ball of radius cr for some c > 1.

In Differentially Private Approximate Near Neighbor Counting in High Dimensions, researchers present an efficient algorithm that offers a sweet spot between these two classes. The algorithm has an additive error that is an arbitrary small power of the data set size, depending on how fuzzy the range boundary is, as well as a small (1 + o(1)) multiplicative error. Crucially, the amount of noise added has no dependence on the dimension. This new algorithm introduces a variant of Locality-Sensitive Hashing, utilizing it in a novel manner.

Exposing Attention Glitches with Flip-Flop Language Modeling

Why do large language models sometimes output factual inaccuracies and exhibit erroneous reasoning? The brittleness of these models, particularly when executing long chains of reasoning, seems to be an inevitable price to pay for their advanced capabilities of coherently synthesizing knowledge, pragmatics, and abstract thought.

To help make sense of this fundamentally unsolved problem, Exposing Attention Glitches with Flip-Flop Language Modeling identifies and analyzes the phenomenon of attention glitches, in which the Transformer architecture’s inductive biases intermittently fail to capture robust reasoning. To isolate the issue, the researchers introduce flip-flop language modeling (FFLM), a parametric family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models. This simple generative task requires a model to copy binary symbols over long-range dependencies, ignoring the tokens in between. This research shows how Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which can be eliminated using various regularization techniques. The preliminary mechanistic analyses show why the remaining errors may be very difficult to diagnose and resolve. The researchers hypothesize that attention glitches account for some of the closed-domain errors occurring in natural LLMs.

In-Context Learning Unlocked for Diffusion Models

An emergent behavior of large language models (LLMs) is the ability to learn from context, or in-context learning. With a properly designed prompt structure and in-context learning, LLMs can combine the pre-training of multiple language tasks and generalize well to previously unseen tasks. While in-context learning has been extensively studied in natural language processing (NLP), its applications in the field of computer vision are still limited.

In-Context Learning Unlocked for Diffusion Models presents Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images and text guidance, this model understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, the researchers propose a vision-language prompt that can model a wide range of vision-language tasks, and a diffusion model that takes it as input. The diffusion model is trained jointly over six different tasks using these prompts. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation on the trained tasks and generalizes to new, unseen vision tasks with their respective prompts. This model also shows compelling text-guided image editing results.

Optimizing Prompts for Text-to-Image Generation

Generative foundation models can be prompted to follow user instructions, including language models and text-to-image models. Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, Optimizing Prompts for Text-to-Image Generation proposes prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts.

The researchers use reinforcement learning to explore better prompts with a language model. They define a reward function that encourages the policy network (i.e., language model) to generate more aesthetically pleasing images while preserving the original user intentions. Experimental results on Stable Diffusion show that this method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. Reinforcement learning further boosts performance, especially on out-of-domain prompts.

Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck

Algorithm design in deep learning can appear to be more like “hacking” than an engineering practice. There are numerous architectural choices and training heuristics, which can often modulate model performance and resource costs in unpredictable and entangled ways. As a result, when training large-scale neural networks (such as state-of-the-art language models), algorithmic decisions and resource allocations are foremost empirically-driven, involving the measurement and extrapolation of scaling laws. A precise mathematical understanding of this process is elusive, and cannot be explained by statistics or optimization in isolation.

In Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck, researchers from Microsoft, Harvard, and the University of Pennsylvania explore these algorithmic intricacies and tradeoffs through the lens of a single synthetic task: the finite-sample sparse parity learning problem. In this setting, the above complications are not only evident, but also provable: intuitively, due to the task’s computational hardness, a neural network needs a sufficient combination of resources (“data × model size × training time × luck”) to succeed. This research shows that standard algorithmic choices in deep learning give rise to a Pareto frontier, in which successful learning is “bought” with interchangeable combinations of these resources. They show that algorithmic improvements on this toy problem can transfer to the real world, improving the data-efficiency of neural networks on small tabular datasets.

PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers

Time-dependent partial differential equations (PDEs) are ubiquitous in science and engineering. The high computational cost of traditional solution techniques has spurred increasing interest in deep neural network based PDE surrogates. The practical utility of such neural PDE solvers depends on their ability to provide accurate, stable predictions over long time horizons, which is a notoriously hard problem.

PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers presents a large-scale analysis of common temporal rollout strategies, identifying the neglect of non-dominant spatial frequency information, often associated with high frequencies in PDE solutions, as the primary pitfall limiting stable, accurate rollout performance. Motivated by recent advances in diffusion models, the researchers developed PDE-Refiner, a novel model class that enables more accurate modeling of all frequency components via a multistep refinement process. They validate PDE-Refiner on challenging benchmarks of complex fluid dynamics, demonstrating stable and accurate rollouts that consistently outperform state-of-the-art models, including neural, numerical, and hybrid neural-numerical architectures. They also demonstrate that PDE-Refiner greatly enhances data efficiency, since the denoising objective implicitly induces a novel form of spectral data augmentation. Finally, PDE-Refiner’s connection to diffusion models enables an accurate and efficient assessment of the model’s predictive uncertainty, allowing researchers to estimate when the surrogate becomes inaccurate.

Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations

Randomized experiments are the gold-standard method of determining causal effects, whether in clinical trials to evaluate medical treatments or in A/B tests to evaluate online product offerings. But randomized experiments often need to be stopped prematurely when the treatment or test causes an unintended harmful effect. Existing methods that determine when to stop an experiment early are typically applied to the data in aggregate and do not account for treatment effect heterogeneity.

Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations examines the early stopping of experiments for harm on heterogeneous populations. The paper shows that current methods often fail to stop experiments when the treatment harms a minority group of participants. The researchers use causal machine learning to develop Causal Latent Analysis for Stopping Heterogeneously (CLASH), the first broadly-applicable method for heterogeneous early stopping. They demonstrate CLASH’s performance on simulated and real data and show that it yields effective early stopping for both clinical trials and A/B tests.

Survival Instinct in Offline Reinforcement Learning

In offline reinforcement learning (RL), an agent optimizes its performance given an offline dataset. Survival Instinct in Offline Reinforcement Learning presents a novel observation: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL’s return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design.

This research demonstrates that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and a certain bias implicit in common data collection practices. This work shows that this pessimism endows the agent with a “survival instinct”, i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. The researchers argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones.

Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics

Molecular dynamics (MD) is a well-established technique for simulating physical systems at the atomic level. When performed accurately, it provides unrivalled insight into the detailed mechanics of molecular motion, without the need for wet lab experiments. MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution (opens in new tab). However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed from scratch for each molecular system studied.

Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics presents an enhanced sampling method which uses a normalizing flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of 10^5−10^6fs. Crucially, Timewarp is transferable between molecular systems: the researchers show that, once trained, Timewarp generalizes to unseen small peptides (2-4 amino acids), exploring their metastable states and providing wall-clock acceleration when sampling compared to standard MD. This new method constitutes an important step towards developing general, transferable algorithms for accelerating MD.

The post NeurIPS 2023 highlights breadth of Microsoft’s machine learning innovation appeared first on Microsoft Research.

Read More

MatterGen: Property-guided materials design

MatterGen: Property-guided materials design

MatterGen

Generative AI has revolutionized how we create text and images. How about designing novel materials? We at Microsoft Research AI4Science are thrilled to announce MatterGen, our generative model that enables broad property-guided materials design.

The central challenge in materials science is to discover materials with desired properties, e.g., high Li-ion conductivity for battery materials. Traditionally, this has been done by first finding novel materials and then filtering down based on the application. This is like trying to create the image of a cat by first generating a million different images and then searching for the one with a cat. In MatterGen, we directly generate novel materials with desired properties, similar to how DALL·E 3 tackles image generation.  

MatterGen is a diffusion model specifically designed for generating novel, stable materials. MatterGen also has adapter modules that can be fine-tuned to generate materials given a broad range of constraints, including chemistry, symmetry, and properties. MatterGen generates 2.9 times more stable (≤ 0.1 eV/atom of our training + test data convex hull), novel, unique structures than a SOTA model (CDVAE). It also generates structures 17.5 times closer to energy local minimum. MatterGen can directly generate materials satisfying desired magnetic, electronic, mechanical properties via classifier-free guidance. We verify generated materials with DFT-based workflows. 

Figure 1 (alt text) 

This figure displays six pairs of crystalline structures, two for each property constrain. The property constraints are, top to bottom and left to right, high space group symmetry, high bulk modulus, target chemical system, target band gap, high magnetic density, combined high magnetic density and low HHI index.
Figure 1: Stable and new materials generated by MatterGen while constrained on properties. 

Additionally, MatterGen can keep generating novel materials that satisfy a target property like high bulk modulus while screening methods instead saturate due to the exhaustion of materials in the database.

This is a line plot. The x axis indicates the number of DFT property calculations calls; the y axis reports the number of structures found. The title of the plot says
Figure 2: MatterGen discovers more novel stable high bulk modulus materials than the screening baseline, and does not plateau for increasing computational resources. MatterGen can find more than 250 materials with a bulk modulus > 400 GPa, while only 2 such materials are found in the reference dataset.

MatterGen can also generate materials given target chemical systems. It outperforms substitution and random structure search baselines equipped with MLFF filtering, especially in challenging 5-element systems. MatterGen also generates structures given target space groups. Finally, we tackle the multi-property materials design problem of finding low-supply-chain risk magnets. MatterGen proposes structures that have both high magnetic density and a low supply-chain risk chemical composition. 

We believe MatterGen is an important step forward in AI for materials design. Our results are currently verified via DFT, which has many known limitations. Experimental verification remains the ultimate test for real-word impact, and we hope to follow up with more results. 

None of this would be possible without the highly collaborative work between Andrew Fowler, Claudio Zeni, Daniel Zügner, Matthew Horton, Robert Pinsler, Ryota Tomioka, Tian Xie and our amazing interns Xiang Fu, Sasha Shysheya, and Jonathan Crabbé, as well as Jake Smith, Lixin Sun and the entire AI4Science Materials Design team.  

We are also grateful for all the help from Microsoft Research, AI4Science, and Azure Quantum.

The post MatterGen: Property-guided materials design appeared first on Microsoft Research.

Read More

LLMLingua: Innovating LLM efficiency with prompt compression

LLMLingua: Innovating LLM efficiency with prompt compression

This research paper was presented at the 2023 Conference on Empirical Methods in Natural Language Processing (opens in new tab) (EMNLP 2023), the premier conference on natural language processing and artificial intelligence.

EMNLP 2023 logo to the left of accepted paper

As large language models (LLMs) models advance and their potential becomes increasingly apparent, an understanding is emerging that the quality of their output is directly related to the nature of the prompt that is given to them. This has resulted in the rise of prompting technologies, such as chain-of-thought (CoT) and in-context-learning (ICL), which facilitate an increase in prompt length. In some instances, prompts now extend to tens of thousands of tokens, or units of text, and beyond. While longer prompts hold considerable potential, they also introduce a host of issues, such as the need to exceed the chat window’s maximum limit, a reduced capacity for retaining contextual information, and an increase in API costs, both in monetary terms and computational resources.

To address these challenges, we introduce a prompt-compression method in our paper, “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (opens in new tab),” presented at EMNLP 2023 (opens in new tab). Using a well-trained small language model, such as GPT2-small or LLaMA-7B, LLMLingua identifies and removes unimportant tokens from prompts. This compression technique enables closed LLMs to make inferences from the compressed prompt. Although the token-level compressed prompts may be difficult for humans to understand, they prove highly effective for LLMs. This is illustrated in Figure 1.

This is an illustration of the LLMLingua framework, which estimates the important tokens of a prompt based on a small language model. It consists of three modules: a budget controller, iterative token-level prompt compression, and distribution alignment. The framework can compress a complex prompt of 2,366 tokens down to 117 tokens, achieving a 20x compression while maintaining almost unchanged performance.
Figure 1. LLMLingua’s framework

LLMLingua’s method and evaluation

To develop LLMLingua’s framework, we employed a budget controller to balance the sensitivities of different modules in the prompt, preserving the language’s integrity. Our two-stage process involved course-grained prompt compression. We first streamlined the prompt by eliminating certain sentences and then individually compressed the remaining tokens. To preserve coherence, we employed an iterative token-level compression approach, refining the individual relationships between tokens. Additionally, we fine-tuned the smaller model to capture the distribution information from different closed LLMs by aligning it with the patterns in the LLMs’ generated data. We did this through instruction tuning.

To assess LLMLingua’s performance, we tested compressed prompts on four different datasets, GSM8K, BBH, ShareGPT, and Arxiv-March23, encompassing ICL, reasoning, summarization, and conversation. Our approach achieved impressive results, achieving up to 20x compression while preserving the original prompt’s capabilities, particularly in ICL and reasoning. LLMLingua also significantly reduced system latency.

During our test, we used LLaMA-7B as the small language model and GPT-3.5-Turbo-0301, one of OpenAI’s LLMs, as the closed LLM. The results show that LLMLingua maintains the original reasoning, summarization, and dialogue capabilities of the prompt, even at a maximum compression ratio of 20x, as reflected in the evaluation metric (EM) columns in Tables 1 and 2. At the same time, other compression methods failed to retain key semantic information in prompts, especially in logical reasoning details. For a more in-depth discussion of these results, refer to section 5.2 of the paper.

These are the experimental results on GSM8K and BBH using GPT-3.5-turbo, demonstrating the in-context learning and reasoning capabilities based on different methods and compression constraints. The results show that LLMLingua can achieve up to a 20x compression rate while only experiencing a 1.5-point performance loss.
Table 1. Performance of different methods at different target compression ratios on the GSM8K and BBH datasets.
These are the experimental results for ShareGPT (Conversation) and Arxiv-March23 (Summarization) using GPT-3.5-turbo, based on different methods and compression constraints. The results indicate that LLMLingua can effectively retain the semantic information from the original prompts while achieving a compression rate of 3x-9x.
Table 2. Performance of different methods at different target compression ratios for conversation and summarization tasks.

LLMLingua is robust, cost-effective, efficient, and recoverable

LLMLingua also showed impressive results across various small language models and different closed LLMs. When using GPT-2-small, LLMLingua achieved a strong performance score of 76.27 under the ¼-shot constraint, close to the LLaMA-7B’s result of 77.33 and surpassing the standard prompt results of 74.9. Similarly, even without aligning Claude-v1.3, one of the post powerful LLMs, LLMLingua’s score was 82.61 under the ½-shot constraint, outperforming the standard prompt result of 81.8.

LLMLingua also proved effective in reducing response length, leading to significant reductions in latency in the LLM’s generation process, with reductions ranging between 20 to 30 percent, as shown in Figure 2.

The figure demonstrates the relationship between the compression ratio and the number of response tokens. In different tasks, as the compression ratio increases, the response length decreases to varying extents, with a maximum reduction of 20%-30%.
Figure 2. The distribution of token lengths generated at varying compression ratios.

What makes LLMLingua even more impressive is its recoverability feature. When we used GPT-4 to restore the compressed prompts, it successfully recovered all key reasoning information from the full nine-step chain-of-thought (CoT) prompting, which enables LLMs to address problems through sequential intermediate steps. The recovered prompt was almost identical to the original, and its meaning was retained. This is shown in Tables 3 and 4.

This figure illustrates the original prompt, the compressed prompt, and the result of using GPT-4 to recover the compressed prompt. The original prompt consists of a 9-step Chain-of-Thought, and the compressed prompt is difficult for humans to understand. However, the recovered text includes all 9 steps of the Chain-of-Thought.
Table 3. Latency comparison on GSM8K. LLMLingua can accelerate LLMs’ end-to-end inference by a factor of 1.7–5.7x. 
This figure shows the end-to-end latency when using LLMLingua, without using LLMLingua, and the latency when compressing prompts. As the compression ratio increases, both the LLMLingua and end-to-end latency decrease, achieving up to a 5.7x acceleration with a 10x token compression rate.
Table 4. Recovering the compressed prompt from GSM8K using GPT-4.

Enhancing the user experience and looking ahead

LLMLingua is already proving its value through practical application. It has been integrated into LlamaIndex (opens in new tab), a widely adopted retrieval-augmented generation (RAG) framework. Currently, we are collaborating with product teams to reduce the number of tokens required in LLM calls, particularly for tasks like multi-document question-answering. Here, our goal is to significantly improve the user experience with LLMs. 

For the long-term, we have proposed LongLLMLingua, a prompt-compression technique designed for long-context scenarios, such as retrieval-augmented question-answering tasks in applications like chatbots, useful when information evolves dynamically over time. It’s also geared for tasks like summarizing online meetings. LongLLMLingua’s primary objective is to enhance LLMs’ ability to perceive key information, making it suitable for numerous real-world applications, notably information-based chatbots. We’re hopeful that this innovation paves the way for more sophisticated and user-friendly interactions with LLMs.

Learn more about our work on the LLMLingua (opens in new tab) page.

The post LLMLingua: Innovating LLM efficiency with prompt compression appeared first on Microsoft Research.

Read More

Abstracts: December 6, 2023

Abstracts: December 6, 2023

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Xing Xie, a Senior Principal Research Manager at Microsoft Research, joins host Gretchen Huizinga to discuss “Evaluating General-Purpose AI with Psychometrics.” As AI capabilities move from task specific to more general purpose, the paper explores psychometrics, a subfield of psychology, as an alternative to traditional methods for evaluating model performance and for supporting consistent and reliable systems.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Dr. Xing Xie, a Senior Principal Research Manager at Microsoft Research. Dr. Xie is coauthor of a vision paper on large language models called “Evaluating General-Purpose AI with Psychometrics,” and you can find a preprint of this paper now on arXiv. Xing Xie, thank you for joining us on Abstracts!

XING XIE: Yes, thank you. It’s my pleasure to be here. 

HUIZINGA: So in a couple sentences, tell us what issue or problem your research addresses and why people should care about it. 


XIE: Yeah, in a sense, actually, we are exploring the potential of psychometrics to revolutionize how we evaluate general-purpose AI. Because AI is advancing at a very rapid pace, traditional evaluation methods face significant challenges, especially when it comes to predicting a model’s performance in unfamiliar scenarios. And this method also lacks a robust mechanism to assess their own quality. Additionally, we, in this paper, we delve into the complexity of directly applying psychometrics to this domain and underscore several promising directions for future research. We believe that this research is of great importance. As AI continues to be integrated into novel application scenarios, it could have significant implications for both individuals and society at large. It’s crucial that we ensure their performance is both consistent and reliable.

HUIZINGA: OK, so I’m going to drill in a little bit in case there’s people in our audience that don’t understand what psychometrics is. Could you explain that a little bit for the audience? 

XIE: Yeah, psychometrics could be considered as a subdomain of psychology. Basically, psychology just studies everything about humans, but psychometrics is specifically developed to study how we can better evaluate, we could also call this general-purpose intelligence, but it’s human intelligence. So there are, actually, a lot of methodologies and approaches in how we develop this kind of test and what tasks we need to carry out. The previous AI is designed for specific tasks like machine translation, like summarization. But now I think people are already aware of many progress in big models, in large language models. AI, actually, currently can be considered as some kind of solving general-purpose tasks. Sometimes we call it few-shot learning, or sometimes we call it like zero-shot learning. We don’t need to train a model before we bring new tasks to them. So this brings a question in how we evaluate this kind of general-purpose AI, because traditionally, we evaluate AI usually using some specific benchmark, specific dataset, and specific tasks. This seems to be unsuitable to this new general-purpose AI. 

HUIZINGA: So how does your approach build on and/or differ from what’s been done previously in this field? 

XIE: Yeah, we actually see a lot of efforts have been investigated into evaluating the performance of these new large language models. But we see a significant portion of these evaluations are task specific. They’re still task specific. And also, frankly speaking, they are easily affected by changes. That means even slight alterations to a test could lead to substantial drops in performance. So our methodology differs from these approaches in that rather than solely testing how AI performs on those predetermined tasks, we actually are evaluating those latent constructs because we believe that pinpointing these latent constructs is very important.

HUIZINGA: Yeah. 

XIE: It’s important in forecasting AI’s performance in evolving and unfamiliar contexts. We can use an example like game design. With humans, even if an individual has never worked on game design—it’s just a whole new task for her—we might still confidently infer their potential if we know they possess the essential latent constructs, or abilities, which are important for game design. For example, creativity, critical thinking, and communication. 

HUIZINGA: So this is a vision paper and you’re making a case for using psychometrics as opposed to regular traditional benchmarks for assessing AI. So would you say there was a methodology involved in this as a research paper, and if so, how did you conduct the research for this? What was the overview of it? 

XIE: As you said, this is a vision paper. So instead of describing a specific methodology, we are collaborating with several experienced psychometrics researchers. Collectively, we explore the feasibility of integrating psychometrics into AI evaluation and discerning which concepts are viable and which are not. In February this year, we hosted a workshop on this topic. Over the past months, we have engaged in, in numerous discussions, and the outcome of these discussions is articulated in this paper. And additionally, actually, we are also in the middle of drafting another paper; that paper will apply insights from this paper to devise a rigorous methodology for assessing the latent capability of the most cutting-edge language models. 

HUIZINGA: When you do a regular research paper, you have findings. And when you did this paper and you workshopped it, what did you come away with in terms of the possibilities for what you might do on assessing AI with psychometrics? What were your major findings? 

XIE: Yeah, our major findings can be divided into two areas. First, we underscore the significant potential of psychometrics. This includes exploring how these metrics can be utilized to enhance predictive accuracy and guarantee test quality. Second, we also draw attention to the new challenges that arise when directly applying these principles to AI. For instance, test results could be misinterpreted, as assumptions verified for human tests might not necessarily apply to AI. Furthermore, capabilities that are essential for humans may not hold the same importance for AI.

HUIZINGA: Hmm …  

XIE: Another notable challenge is the lack of a consistent and defined population of AI, especially considering their rapid evolution. But this population is essential for traditional psychometrics, and we need to have a population of humans for verifying either the reliability or the validity of a test. But for AI, this becomes a challenge. 

HUIZINGA: Based on those findings, how do you think your work is significant in terms of real-world impact at this point? 

XIE: We believe that our approach will signal the start of a new era in the evaluation of general-purpose AI, shifting from earlier, task-specific methodologies to a more rigorous scientific method. Fundamentally, there’s an urgent demand to establish a dedicated research domain focusing solely on AI evaluation. We believe psychometrics will be at the heart of this domain. Given AI’s expanding role in society and its growing significance as an indispensable assistant, this evolution will be crucial. I think one missing part of current AI evaluation is how we can make sure the test, the benchmark, or these evaluation methods of AI themselves, is scientific. Actually, previously, I used the example of game design. Suppose in the future, I think there are a lot of people discussing language model agents, AI agents … they could be used to not only write in code but also develop software by collaborating among different agents. Then what kind of capabilities, or we call them latent constructs, of these AI models they should have before they make success in game design or any other software development. For example, like creativity, critical thinking, communication. Because this could be important when there are multiple AI models—they communicate with each other, they check the result of the output of other models. 

HUIZINGA: Are there other areas that you could say, hey, this would be a relevant application of having AI evaluated with psychometrics instead of the regular benchmarks because of the generality of intelligence?

XIE: We are mostly interested in maybe doing research, because a lot of researchers have started to leverage AI for their own research. For example, not only for writing papers, not only for generating some ideas, but maybe they could use AI models for more tasks in the whole pipeline of research. So this may require AI to have some underlying capabilities, like, as we have said, like critical thinking—how AI should define the new ideas and how they check whether these ideas are feasible and how they propose creative solutions and how they work together on research. This could be another domain. 

HUIZINGA: So if there was one thing that you want our listeners to take away from this work, what would it be? 

XIE: Yeah, I think the one takeaway I want to say is we should be aware of the vital importance of AI evaluation. We are still far from achieving a truly scientific standard, so we need to still work hard to get that done. 

HUIZINGA: Finally, what unanswered questions or unsolved problems remain in this area? What’s next on your research agenda that you’re working on? 

XIE: Yeah, actually, there are a lot of unanswered questions as highlighted at the later part of this paper. Ultimately, our goal is to adapt psychometric theories and the techniques to fit AI contexts. So we have discussed with our collaborators in both AI and psychometrics … some examples would be, how can we develop guidelines, extended theories, and techniques to ensure a rigorous evaluation that prevents misinterpretation? And how can we best evaluate assistant AI and the dynamics of AI-human teaming? This actually is particularly proposed by one of our collaborators in the psychometrics domain. And how do we evaluate the value of general-purpose AI and ensure their alignment with human objectives? And then how can we employ semiautomatic methods to develop psychometric tests, theories, and techniques with the help of general-purpose AI? That means we use AI to solve these problems by themselves. This is also important because, you know, psychometrics or psychology have developed for hundreds, or maybe thousands, of years to come to all the techniques today. But can we shorten that period? Can we leverage AI to speed up this development? 

HUIZINGA: Would you say there’s wide agreement in the AI community that this is a necessary direction to head?

XIE: This is only starting. I think there are several papers discussing how we can apply some part of psychology or some part of psychometrics to AI. But there is no systematic discussion or thinking along this line. So I, I don’t think there is agreement, but there’s already initial thoughts and initial perspectives showing in the academic community. 

[MUSIC PLAYS]

HUIZINGA: Well, Xing Xie, thanks for joining us today, and to our listeners, thank you for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts (opens in new tab), or you can find a preprint of the paper on arXiv. See you next time on Abstracts!

The post Abstracts: December 6, 2023 appeared first on Microsoft Research.

Read More

Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow

Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow

These research papers were presented at the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (opens in new tab) (ESEC/FSE 2023), a premier conference in the field of software engineering.

ESEC/FSE 2023
Two papers on a blue/green gradient: InterFix and AdaptivePaste

The practice of software development inevitably involves the challenge of handling bugs and various coding irregularities. These issues can become pronounced when developers engage in the common practice of copying and pasting code snippets from the web or other peer projects. While this approach might offer a quick solution, it can introduce a host of potential complications, including compilation issues, bugs, and even security vulnerabilities into the developer’s codebase.

To address this, researchers at Microsoft have been working to advance different aspects of the software development lifecycle, from code adaptation to automated bug detection and repair. At ESEC/FSE 2023 (opens in new tab), we introduced two techniques aimed at enhancing coding efficiency. AdaptivePaste utilizes a learning-based approach to adapt and refine pasted code snippets in an integrated development environment (IDE). InferFix is an end-to-end program repair framework designed to automate bug detection and resolution. This blog outlines these technologies.

Microsoft Research Podcast

AI Frontiers: Models and Systems with Ece Kamar

Ece Kamar explores short-term mitigation techniques to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 


AdaptivePaste: Intelligent copy-paste in IDE

A widespread practice among developers involves adapting pasted code snippets to specific use cases. However, current code analysis and completion techniques, such as masked language modeling and CodeT5, do not achieve an acceptable level of accuracy in identifying and adapting variable identifiers within these snippets to align them with the surrounding code. In the paper, “AdaptivePaste: Intelligent Copy-Paste in IDE,” we propose a learning-based approach to source code adaptation, aiming to capture meaningful representations of variable usage patterns. First, we introduce a specialized dataflow-aware de-obfuscation pretraining objective for pasted code snippet adaptation. Next, we introduce a transformer-based model of two variants: a traditional unidecoder and parallel-decoder model with tied weights.

Diagram depicting AdaptivePaste architecture. Starting with a program with a pasted code snippet, AdaptivePaste extracts and prioritizes syntax hierarchies most relevant for the learning task, analyzes the data-flow, and then anonymizes the pasted code. The resulting program serves as input for neural model. The output is serialized as a sequence of tokens.
Figure 1. AdaptivePaste architecture. For a program with a pasted code snippet, AdaptivePaste extracts and prioritizes syntax hierarchies most relevant for the learning task, analyzes the data flow, and anonymizes variable identifiers in the pasted code snippet. The resulting program serves as input for neural model. The output is serialized as a sequence of tokens entries.

The unidecoder follows a standard autoregressive decoder formulation, mapping each variable in the pasted snippet to a unique symbol in the context or declaring a new variable. The parallel decoder duplicates the decoder for each anonymized symbol in the anonymized pasted snippet, predicting names independently and factorizing the output distribution per symbol. This enables selective code snippet adaptation by surfacing model predictions above a specified threshold and outputting “holes” where uncertainty exists.

To establish a dataflow-aware de-obfuscation pretraining objective for pasted code snippet adaptation, we assigned mask symbols to variable identifiers at the granularity of whole code tokens. The pre-existing code context was unanonymized, allowing the model to attend to existing identifier names defined in scope.

Our evaluation of AdaptivePaste showed promising results. It successfully adapted Python source code snippets with 67.8 percent exact match accuracy. When we analyzed the impact of confidence thresholds on model predictions, we observed that the parallel decoder transformer model improves precision to 85.9 percent in a selective code adaptation setting.

InferFix: End-to-end program repair with LLMs

Addressing software defects accounts for a significant portion of development costs. To tackle this, the paper, “InferFix: End-to-End Program Repair with LLMs over Retrieval-Augmented Prompts,” introduces a program repair framework that combines the capabilities of a state-of-the-art static analyzer called Infer, a semantic retriever model called Retriever, and a transformer-based model called Generator to address crucial security and performance bugs in Java and C#.

The Infer static analyzer is used to reliably detect, classify, and locate critical bugs within complex systems through formal verification. The Retriever uses a transformer encoder model to search for semantically equivalent bugs and corresponding fixes in large datasets of known bugs. It’s trained using a contrastive learning objective to excel at finding relevant examples of the same bug type.

The Generator employs a 12 billion-parameter codex model, fine-tuned on supervised bug-fix data. To enhance its performance, the prompts provided to the Generator are augmented with bug type annotations, bug contextual information, and semantically similar fixes retrieved from an external nonparametric memory by the Retriever. The Generator generates the candidate to fix the bug.

Diagram depicting the InferFix approach workflow. Starting with a Pull Request, the Infer Static Analyzer conducts bug detection, classification, and localization. Subsequently, Context Extraction gathers pertinent details of the bugs and the surrounding context, and then Retriever identifies semantically similar bugs. The process concludes with the LLM Generator proposing a fix based on the generated prompt.
Figure 2: The InferFix workflow. An error-prone code modification is detected by the Infer static analyzer, which is used to craft a prompt with bug type annotation, location information, relevant syntax hierarchies, and similar fixes identified by the Retriever. The large language model (LLM) Generator provides a candidate fix to the developer.

To test InferFix, we curated a dataset called InferredBugs (opens in new tab), which is rich in metadata and comprises bugs identified through executing the Infer static analyzer on thousands of Java and C# repositories. The results are noteworthy. InferFix outperforms strong LLM baselines, achieving a top-1 accuracy of 65.6 percent in C# and an impressive 76.8 percent in Java on the InferredBugs dataset.

Looking ahead

With AdaptivePaste and InferFix, we hope to significantly streamline the coding process, minimizing errors and enhancing efficiency. This includes reducing the introduction of bugs when code snippets are added and providing automated bug detection, classification, and patch validation. We believe that these tools hold promise for an enhanced software development workflow, leading to reduced costs and an overall boost in project efficiency.

Looking ahead, the rapid advancement of LLMs like GPT-3.5 and GPT-4 has sparked our interest in exploring ways to harness their potential in bug management through prompt engineering and other methods. Our goal is to empower developers by streamlining the bug detection and repair process, facilitating a more robust and efficient development environment.

The post Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow appeared first on Microsoft Research.

Read More

Research Focus: Week of December 4, 2023

Research Focus: Week of December 4, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus
December 6th, 2023

Leveraging Large Language Models for Automated Proof Synthesis in Rust

Formal verification can probably guarantee the correctness of critical system software, but the high proof burden has long hindered its wide adoption. Recently, large language models (LLMs) have shown success in code analysis and synthesis. In a recent paper: Leveraging Large Language Models for Automated Proof Synthesis in Rust, researchers from Microsoft present a combination of LLMs and static analysis to synthesize invariants, assertions, and other proof structures for a Rust-based formal verification framework called Verus.

In a few-shot setting, GPT-4 demonstrates impressive logical ability in generating postconditions and loop invariants, especially when analyzing short code snippets. However, GPT-4 does not consistently retain and propagate the full context information needed by Verus, a task that can be straightforwardly accomplished through static analysis. Based on these observations, the researchers developed a prototype based on OpenAI’s GPT-4 model. This prototype decomposes the verification task into multiple smaller ones, iteratively queries GPT-4, and combines its output with lightweight static analysis. Evaluating the prototype with a developer in the automation loop on 20 vector-manipulating programs showed that it significantly reduces human effort in writing entry-level proof code.

MICROSOFT RESEARCH PODCAST

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

In this episode, PhD students Jennifer Scurrell and Alejandro Cuevas talk to Senior Researcher Dr. Madeleine Daepp. They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers to the teamwork they say helped make it possible for them to succeed, and the impact they hope to have with their work.


Don’t Forget the User: It’s Time to Rethink Network Measurements

The goal of network measurement is to characterize how and how well a network is performing. This has traditionally meant a focus on the bits and bytes — low-level network metrics such as latency and throughput, which have the advantage of being objective but are limited in representativeness and reach. In a recent paper: Don’t Forget the User: It’s Time to Rethink Network Measurements, researchers from Microsoft argue that users also provide a rich and largely untapped source of implicit and explicit signals that could complement and expand the coverage of traditional measurement methods. Implicit feedback leverages user actions to indirectly infer network performance and the resulting quality of user experience. Explicit feedback leverages user input, typically provided offline, to expand the reach of network measurement, especially for newer ones.

The researchers analyze example scenarios, including capturing implicit feedback through user actions such as the user (un)muting the mic or turning on/off the camera in a large-scale conferencing service. These techniques complement existing measurement methods and open a broad set of research directions, ranging from rethinking measurements tools, to designing user-centric networked systems and applications.


Ghana 3D international telemedicine proof of concept study

A real-time 3D telemedicine system – leveraging Holoportation™ communication technology – was used to facilitate consultations with complex reconstructive patients prior, during, and after an overseas surgical collaboration. The system was used in a proof-of-concept clinic in November 2022 between Canniesburn Plastic Surgery Unit, UK, and the National Reconstructive Plastic Surgery and Burns Centre, Korle Bu Teaching Hospital, Ghana.

Four patients in Ghana were followed through their patient journey (mandibular ameloblastoma, sarcoma thigh, maxillary tumor, sarcoma back). A new report: Ghana 3D Telemedicine International MDT: A Proof-of-concept study details the responses of 13 participants (4 patients, 4 Ghana clinicians, 5 UK clinicians) completed feedback on the 3D multidisciplinary team (MDT). Outcome measures were rated highly with satisfaction 84.31/100, perceived benefit 4.54/5, overall quality 127.3/ 147, and usability 83.2/100. This data shows close alignment with that previously published on high income countries.

This novel technology has potential to enhance overseas surgical visits in low-to-middle income countries through improved planning, informed discussion with patients, expert consensus on complex cases, and fostering engagement with professionals who may be thousands of miles away.


The post Research Focus: Week of December 4, 2023 appeared first on Microsoft Research.

Read More