Microsoft AI – Page 19

Steering at the Frontier: Extending the Power of Prompting

December 12, 2023

by Brenda Potts Microsoft AI

three conversation bubbles on a blue, purple, and pink gradient background

We’re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise. Even seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts. Beyond basic, out-of-the-box prompting, we’ve been exploring new prompting strategies, showcased in our Medprompt work, to evoke the powers of specialists.

Today, we’re sharing information on Medprompt and other approaches to steering frontier models in promptbase (opens in new tab), a collection of resources on GitHub. Our goal is to provide information and tools to engineers and customers to evoke the best performance from foundation models. We’ll start by including scripts that enable replication of our results using the prompting strategies that we present here. We’ll be adding more sophisticated general-purpose tools and information over the coming weeks.

As an illustration of the capabilities of the frontier models and on opportunities to harness and extend the recent efforts with reaching state-of-the-art (SoTA) results via steering GPT-4, we’ll review SoTA results on benchmarks that Google chose for evaluating Gemini Ultra. Our end-to-end exploration, prompt design, and computing of performance took just a couple of days.

Let’s focus on the well-known MMLU (opens in new tab) (Measuring Massive Multitask Language Understanding) challenge that was established as a test of general knowledge and reasoning powers of large language models. The complete MMLU benchmark contains tens of thousands of challenge problems of different forms across 57 areas from basic mathematics to United States history, law, computer science, engineering, medicine, and more.

In our Medprompt study, we focused on medical challenge problems, but found that the prompt strategy could have more general-purpose application and examined its performance on several out-of-domain benchmarks—despite the roots of the work on medical challenges. Today, we report that steering GPT-4 with a modified version of Medprompt achieves the highest score ever achieved on the complete MMLU.

In our explorations, we initially found that applying the original Medprompt to GPT-4 on the comprehensive MMLU achieved a score of 89.1%. By increasing the number of ensembled calls in Medprompt from five to 20, performance by GPT-4 on the MMLU further increased to 89.56%. To achieve a new SoTA on MMLU, we extended Medprompt to Medprompt+ by adding a simpler prompting method and formulating a policy for deriving a final answer by integrating outputs from both the base Medprompt strategy and the simple prompts. The synthesis of a final answer is guided by a control strategy governed by GPT-4 and inferred confidences of candidate answers. More details on Medprompt+ are provided in the promptbase repo. A related method for coupling complex and simple queries was harnessed by the Google Gemini team. GPT-4 steered with the modified Medprompt+ reaches a record score of 90.10%. We note that Medprompt+ relies on accessing confidence scores (logprobs) from GPT-4. These are not publicly available via the current API but will be enabled for all in the near future.

A graph showing the reported performance of baseline multiple models and methods on the MMLU benchmark. Moving from left to right, Palm 2-L (5-shot) achieved 78.4% accuracy, Claude 2 (5-shot CoT) achieved 78.5% accuracy, Inflection-2 (5-shot) achieved 79.6% accuracy, Google Pro (CoT@8) achieved 79.13% accuracy, Gemini Ultra (CoT@32) achieved 90.04% accuracy, GPT-4-1106 (5-Shot) achieved 86.4% accuracy, GPT-4-1106 (Medprompt @ 5) achieved 89.1% accuracy, GPT-4-1106 (Medprompt @ 20) achieved 89.56% accuracy, and GPT-4-1106 (Medprompt @ 31) achieved 90.10% accuracy. — Figure1. Reported performance of multiple models and methods on the MMLU benchmark.

While systematic prompt engineering can yield maximal performance, we continue to explore the out-of-the-box performance of frontier models with simple prompts. It’s important to keep an eye on the native power of GPT-4 and how we can steer the model with zero- or few-shot prompting strategies. As demonstrated in Table 1, starting with simple prompting is useful to establish baseline performance before layering in more sophisticated and expensive methods.

Benchmark	GPT-4 Prompt	GPT-4 Results	Gemini Ultra Results
MMLU	Medprompt+	90.10%	90.04%
GSM8K	Zero-shot	95.27%	94.4%
MATH	Zero-shot	68.42%	53.2%
HumanEval	Zero-shot	87.8%	74.4%
BIG-Bench-Hard	Few-shot + CoT*	89.0%	83.6%
DROP	Zero-shot + CoT	83.7%	82.4%
HellaSwag	10-shot**	95.3%**	87.8%

^{* followed the norm of evaluations and used standard few-shot examples from dataset creators
** source: Google}
Table 1: Model, strategies, and results

We encourage you to check out the promptbase repo (opens in new tab) on GitHub for more details about prompting techniques and tools. This area of work is evolving with much to learn and share. We’re excited about the directions and possibilities ahead.

The post Steering at the Frontier: Extending the Power of Prompting appeared first on Microsoft Research.

Phi-2: The surprising power of small language models

December 12, 2023

by Alyssa Hughes Microsoft AI

Contributors

Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Anh Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, Yi Zhang

Satya Nadella on stage at Microsoft Ignite 2023 announcing Phi-2. — **Figure 1.** Satya Nadella announcing Phi-2 at Microsoft Ignite 2023.

Over the past few months, our Machine Learning Foundations team at Microsoft Research has released a suite of small language models (SLMs) called “Phi” that achieve remarkable performance on a variety of benchmarks. Our first model, the 1.3 billion parameter Phi-1 (opens in new tab), achieved state-of-the-art performance on Python coding among existing SLMs (specifically on the HumanEval and MBPP benchmarks). We then extended our focus to common sense reasoning and language understanding and created a new 1.3 billion parameter model named Phi-1.5 (opens in new tab), with performance comparable to models 5x larger.

We are now releasing Phi-2 (opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2 (opens in new tab) available on the Azure model catalog to foster research and development on language models.

Key Insights Behind Phi-2

The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.

Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:

Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.

A bar plot comparing the performance of Phi-2 (with 2.7B parameters) and Phi-1.5 (with 1.3B parameters) on common sense reasoning, language understanding, math, coding, and the Bigbench-hard benchmark. Phi-2 outperforms Phi1.5 in all categories. The commonsense reasoning tasks are PIQA, WinoGrande, ARC easy and challenge, and SIQA. The language understanding tasks are HellaSwag, OpenBookQA, MMLU, SQuADv2, and BoolQ. The math task is GSM8k, and coding includes the HumanEval and MBPP benchmarks. — **Figure 2.** Comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) models. All tasks are evaluated in 0-shot except for BBH and MMLU which use 3-shot CoT and 5-shot, respectively.

Training Details

Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned. Despite this, we observed better behavior with respect to toxicity and bias compared to existing open-source models that went through alignment (see Figure 3). This is in line with what we saw in Phi-1.5 due to our tailored data curation technique, see our previous tech report (opens in new tab) for more details on this. For more information about the Phi-2 model, please visit Azure AI | Machine Learning Studio (opens in new tab).

A barplot comparing the safety score of Phi-1.5, Phi-2, and Llama-7B models on 13 categories of the ToxiGen benchmark. Phi-1.5 achieves the highest score on all categories, Phi-2 achieves the second-highest scores and Llama-7B achieves the lowest scores across all categories. — **Figure 3.** Safety scores computed on 13 demographics from ToxiGen. A subset of 6541 sentences are selected and scored between 0 to 1 based on scaled perplexity and sentence toxicity. A higher score indicates the model is less likely to produce toxic sentences compared to benign ones.

Phi-2 Evaluation

Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report “Textbooks Are All You Need.” Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).

Model	Size	BBH	Commonsense Reasoning	Language Understanding	Math	Coding
Llama-2	7B	40.0	62.2	56.7	16.5	21.0
	13B	47.8	65.0	61.9	34.2	25.4
	70B	66.5	69.2	67.6	64.1	38.3
Mistral	7B	57.2	66.4	63.7	46.4	39.4
Phi-2	2.7B	59.2	68.8	62.0	61.1	53.7

Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.

Model	Size	BBH	BoolQ	MBPP	MMLU
Gemini Nano 2	3.2B	42.4	79.3	27.2	55.8
Phi-2	2.7B	59.3	83.3	59.1	56.7

Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.

In addition to these benchmarks, we also performed extensive testing on commonly used prompts from the research community. We observed a behavior in accordance with the expectation we had given the benchmark results. For example, we tested a prompt used to probe a model’s ability to solve physics problems, most recently used to evaluate the capabilities of the Gemini Ultra model, and achieved the following result:

An example prompt is given to Phi-2 which says “A skier slides down a frictionless slope of height 40m and length 80m. What's the skier’s speed at the bottom?”. Phi-2 then answers the prompt by explaining the conversion of potential energy to kinetic energy and providing the formulas to compute each one. It then proceeds to compute the correct speed using the energy formulas. — **Figure 4.** Phi-2’s output on a simple physics problem, which includes an approximately correct square root calculation.

The model is then provided with a student’s wrong answer to the skier physics problem and asked if it can correct the student’s mistake. Phi-2 replies with the student’s mistake, i.e., using the wrong formula for potential energy, and provides the correct formula. — **Figure 5.** Similarly to Gemini’s test we also further queried Phi-2 with a student’s wrong answer to see if Phi-2 could identify where the mistake is (it did, despite Phi-2 being not fine-tuned for chat or instruction-following). We note however that it is not fully an apple-to-apple comparison with the Gemini Ultra’s output described in the Gemini report, in particular in the latter case the student’s answer was given as an image with handwritten text rather than raw text in our case.

The post Phi-2: The surprising power of small language models appeared first on Microsoft Research.

Abstracts: December 11, 2023

December 11, 2023

by Alyssa Hughes Microsoft AI

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Principal Researcher Alessandro Sordoni joins host Gretchen Huizinga to discuss “Joint Prompt Optimization of Stacked LLMs using Variational Inference.” In the paper, which was accepted at the 2023 Conference on Neural Information Processing Systems (NeurIPS), Sordoni and his coauthors introduce Deep Language Networks, or DLNs, an architecture that treats large language models as layers within a network and natural language prompts as each layer’s learnable parameters.

Read the paper

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Dr. Alessandro Sordoni, a Principal Researcher from Microsoft Research. Dr. Sordoni is coauthor of a paper titled “Joint Prompt Optimization of Stacked LLMs using Variational Inference,” and this paper, which was accepted for the 2023 Conference on Neural Information Processing Systems, or NeurIPS, is available now on arXiv. Alessandro, thanks for joining us on Abstracts!

ALESSANDRO SORDONI: Hi, Gretchen, thank you for having me.

HUIZINGA: So in a few sentences, tell us about the issue or problem that your research addresses and why we should care about it.

SORDONI: So in this paper, our starting points are large language models, and to make large language models solve tasks, one of the ways that is currently used is to prompt them. By prompting that means just giving instruction to them, and hopefully by joining instruction and the input of the task, the language model can solve the task following the rules specified in the instructions. And there has been some approaches already in the literature to actually infer what that instruction is without human intervention. And in this paper, we operate in that space, which is called kind of automatic prompt engineering. And our specific problem is to, one, how to actually infer those prompts for a language model. And, two, what happens if actually the output of that large language model gets into another language model and both language model needs prompt to operate? And so basically, we give sort of an algorithm to solve that joint prompt optimization. That’s why it’s called joint.

HUIZINGA: So what’s the underlying issue there that we should care about as potential users of this technology?

SORDONI: There are some problems that cannot be just solved by kind of one instruction or rule, I would say, but they necessitate some sort of higher-level reasoning or some sort of decomposition. And in that sense, it would maybe be useful to actually have multiple calls to the LLM, where each call is modulated by a different instruction. So the first instruction could be something very general, for example, decompose or visualize the problem into a different language that is formulated in. And the second call is now recompose this visualization that you have produced to solve the problem itself. And so basically, in that context, you can think about this as kind of augmenting the computational power of the language model by splitting the one call in multiple calls.

HUIZINGA: Well, go in a little deeper on the work that this builds on. All research kind of gets a prompt—no pun intended—from previous work. So how does your work build on and/or differ from what’s been done previously in this field?

SORDONI: I would say that our work started more with this intuition that LLMs are just kind of black-box computation units. Now this sort of black box can accept input as input language. The computation is modulated by an instruction and it outputs language, so you can stack these layers, right. So if the weights of this language layer now are the instructions and you can stack them together, how can you optimize them, right? And then we start to think, OK, but this is very related to kind of automatic prompt optimization. The overall kind of prompt engineering and prompt optimization approaches right now work by proposing some prompts and accepting some prompts. So we did some modifications with respect to how we propose new prompts to language models and how do we evaluate and accept then those that work given some task inputs and outputs. Our goal in the future—I would say in the near future—is going to be to basically integrate optimization that can really express arbitrary graphs …

HUIZINGA: Gotcha …

SORDONI: … of LLM calls right now. But in our paper, we started with the first step, which is, OK, say that I just have two calls. Can I just optimize prompts for that very simple graph? And we proposed an algorithm to do so. So basically, I guess our main contribution is, one, getting a better prompt optimizer for one layer and also devising an algorithm that works for two layers right now and that can be extended to multiple layers. But that’s also an engineering problem that needs to be tackled.

HUIZINGA: [LAUGHS] Yeah, always got to get the engineering in there! Well, listen, let’s keep going on this because it sounds like you’re talking about methodology and, and how you conducted this research. So expand a little bit on what you did actually to experiment in this arena.

SORDONI: Yeah, so I think that, uh, really the birth of this paper started from this kind of view of these language models as layers modulated by instructions that can be stacked upon each other. And from there, we said, OK, what can we do with this, basically? And so some of us worked on datasets that could be interesting for this new sort of methodology, I would say, or architecture. So basically, one question was, how do you go forward to actually test if this works in any way? And so we tried to select some datasets that were more of natural language tasks—for example, sentiment classification—and some datasets that were more about reasoning tasks. And our hunch was that basically stacking multiple layers together would help more in those tasks that would require some sort of decomposition of reasoning.

HUIZINGA: Right.

SORDONI: And for the reasoning task, we worked with this BIG-Bench Hard setting. And so parallel to that, there were some of us that worked, for example myself, in the optimization part, really in the algorithm part. And at first, we tried to do some sort of back propagation. But I quickly saw that there were some sort of issues with that … probably empirically issues. And so we tried to actually have a more formal understanding of this optimization algorithm by recurring to variational inference basically, so basically, to understand actually the first layer as producing some text and considering this text as a latent variable. When you open that box, it links also in your head to all … a bunch of kind of related works in the literature that have studied this problem very, very thoroughly. And so you can use those techniques into this context.

HUIZINGA: Interesting. So what were the results of this research? What did you find?

SORDONI: So what we found was that, indeed, the tasks in which these approaches seem to help the most are the tasks that require this sort of decomposition and reasoning. The first thing that was really, really kind of cool, it was that kind of you can go a long way in improving the performance of these language models by accurate prompt optimization. Because in some models, prompt optimization can be understood as kind of really tweaking the models towards solving the task. But in some other tasks, actually, when humans write prompts, they tend to maybe underspecify the prompt or tend to basically be not very clear to how to instruct the model. So the model has to do a lot of work to understand …

HUIZINGA: Right …

SORDONI: … what the human really wants to say to them. And so basically, this sort of prompt optimization acts as a sort of translator where it formulates a prompt that more comprehensively describes the task and more comprehensively contains some rules to solve the task. So it was very interesting to me, that kind of level of abstraction that was sort of required and needed in the prompt to really solve this task very, very well. The other finding is that this problem is very hard. It’s very tricky to optimize, to prompt, this type of optimization because this type of optimization doesn’t really follow a gradient direction like in deep neural networks.

HUIZINGA: Yeah.

SORDONI: It’s basically a sort of trial and error. And this trial and error is very finicky. There’s a lot of problems there. But I feel like I’m hopeful in the sense that this paper allowed us, I think, to hone in some very specific problem that if we solve them, we can make the problem much easier.

HUIZINGA: Let’s talk for a second about real-world impact of this research. Let’s extrapolate out from the lab and move into life. Who benefits from this most, and how do they benefit?

SORDONI: I think that, as I said before, like these automatic prompt optimization methods could benefit, I think, a large audience, or large amount of users, I would say, because they could be understood as a sort of translator between the user needs and what the LLM can do. For example, one effort here in Montréal that was led by my colleagues was kind of building this sort of interactive agent that would, through interaction with the user, form a prompt but interactively. So, for example, in DLN, like in our paper, we assume that we have a task and we do not have input or interaction with the user, right. But in more realistic scenarios, you might want to actually instruct your model to do something by some sort of active learning process where the model actually propose you whether what it did was favorable or desirable or not.

HUIZINGA: Right.

SORDONI: And the user can actually interact with that output, right. For the multilayer case, my hope is that that would be useful to build and optimize these large sort of graphs of LLM calls.

HUIZINGA: I want to take a second here to spell out some acronyms. You’ve referred to DLNs, and I don’t think our audience might know what that means. I’m assuming they know LLM means “large language model.” That’s sort of in the parlance. But talk a little bit about what that other acronym is.

SORDONI: Yeah, sorry I didn’t mention this. So DLN is basically how we refer to these architectures that are composed of language model layers. So DLN is, spells as “Deep Language Network.”

HUIZINGA: Gotcha.

SORDONI: People are free to use this name or not.

HUIZINGA: No, I like it …

SORDONI: I’m not a big fan of imposing acronyms on the world [LAUGHS], but that’s a, that’s a shorter version of it. So, yeah, so it’s really the idea that a language model is a layer in this hierarchy, and the layer accepts as input a text, it outputs a text, and really is modulated by an instruction or prompt that we want to learn.

HUIZINGA: And so the DLN is a deep language network and it sort of acts as a deep neural network but using language models as your layer.

SORDONI: Exactly, exactly, yes.

HUIZINGA: So this is a question I ask everyone, and it’s sort of like, how could you boil this down to one little takeaway if you’re standing on an elevator with somebody and they say, what do you do, Alessandro? So if there’s one thing you want people to take away from this work, what would it be?

SORDONI: The first thing that came to my mind is really the fact that these models can be understood really as a class, I would say, of probability distributions and that they are modulated by these prompts. And so basically, once you have that, once a language model just defines a (p) over sentences given some prompt, you can apply a lot of algorithms with those models. You can apply algorithms that resembles to EM, expectation maximization, or … I mean, we applied a form of that with variational inference, but maybe kind of it could open the path for other types of usages where kind of these are just very, very powerful probability distributions over these sentences that are considered as latent variable. I hope that our paper can show like a more or less practical kind of implementation of that idea. And that basically if you have to optimize, for example, prompts with one or two layers, you can definitely try our approach.

HUIZINGA: Well, finally, and we’ve been talking about this kind of already, but there seem to be some unresolved problems in the area. What do researchers like you need to be looking at in order to solve those? Sort of what’s next on the research agenda, whether it’s you or other researchers in this field?

SORDONI: So let me try to answer by something that really excites me now. What we are doing is that we are producing text, right. With the language model. But we are producing this text in such a way that it helps to solve a problem. And basically, this variational inference method and kind of framework gives us a way of understanding what does it mean to be a good text? Like what does it mean to be a good kind of latent variable or useful latent variable?

HUIZINGA: Right.

SORDONI: What does it mean to produce good data? So, for example, these big models kind of are really data creators, like this generative AI, right. But can we actually teach them to produce data such that this data can be helpful to solve tasks or to condition those same models to solve a task?

HUIZINGA: Right.

SORDONI: And what are the objective functions that promote the production of this useful data? What useful means from a mathematical perspective. I think that, apart from the prompt optimization angle, I feel like DLN to me kind of opened a little bit my spirit into kind of investigating ways of understanding what does it mean for some generated text to be useful to solve a task, I would say. Yeah.

HUIZINGA: Alessandro Sordoni, thanks for joining us today. And thanks to our listeners for tuning in. If you’re interested in learning more about this work, you can find a link to the paper at aka.ms/abstracts or you can find it on arXiv. See you next time on Abstracts!

The post Abstracts: December 11, 2023 appeared first on Microsoft Research.

NeurIPS 2023 highlights breadth of Microsoft’s machine learning innovation

December 11, 2023

by Alyssa Hughes Microsoft AI

Microsoft is proud to sponsor the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). This interdisciplinary forum brings together experts in machine learning, neuroscience, statistics, optimization, computer vision, natural language processing, life sciences, natural sciences, social sciences, and other adjacent fields. We are pleased to share that Microsoft has over 100 accepted papers and is offering 18 workshops at NeurIPS 2023.

This year’s conference includes three papers from Microsoft that were chosen for oral presentations, which feature groundbreaking concepts, methods, or applications, addressing pressing issues in the field. Additionally, our spotlights posters, also highlighted below, have been carefully curated by conference organizers, exhibiting novelty, technical rigor, and the potential to significantly impact the landscape of machine learning. This blog post celebrates those achievements.

Oral Presentations

Bridging Discrete and Backpropagation: Straight-Through and Beyond

Gradient computations are pivotal in deep learning’s success, yet they predominantly depend on backpropagation, a technique limited to continuous variables. The paper Bridging Discrete and Backpropagation: Straight-Through and Beyond, tackles this limitation. It introduces ReinMax, extending backpropagation’s capability to estimate gradients for models incorporating discrete variable sampling. Within extensive experiments of this study, ReinMax demonstrates consistent and significant performance gain over the state of the art. More than just a practical solution, the paper sheds light on existing deep learning practices. It elucidates that the ‘Straight-Through’ method, once considered merely a heuristic trick, is actually a viable first-order approximation for the general multinomial case. Correspondingly, ReinMax achieves second-order accuracy in this context without the complexities of second-order derivatives, thus having negligible computation overheads.

The MineRL BASALT Competition on Learning from Human Feedback

The growth of deep learning research, including its incorporation into commercial products, has created a new challenge: How can we build AI systems that solve tasks when a crisp, well-defined specification is lacking? To encourage research on this important class of techniques, researchers from Microsoft led The MineRL BASALT Competition on Learning from Human Feedback (opens in new tab), an update to a contest first launched in 2021 (opens in new tab) by researchers at the University of California-Berkeley and elsewhere. The challenge of this competition was to complete fuzzy tasks from English language descriptions alone, with emphasis on encouraging different ways of learning from human feedback as an alternative to a traditional reward signal.

The researchers designed a suite of four tasks in Minecraft for which writing hardcoded reward functions would be difficult. These tasks are defined by natural language: for example, “create a waterfall and take a scenic picture of it”, with additional clarifying details. Participants must train a separate agent for each task. Agents are then evaluated by humans who have read the task description.

The competition aimed to encourage development of AI systems that do what their designers intended, even when the intent cannot be easily formalized. Besides allowing AI to solve more tasks, this can also enable more effective regulation of AI systems, as well as making progress on value alignment problems, in which the specified objectives of an AI agent differ from those of its users.

Publication

Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

This comprehensive evaluation platform aims to answer the question: How trustworthy are generative pre-trained transformer (GPT) models? In DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models, researchers focus specifically on GPT-4, GPT-3.5, and a series of open LLMs. They consider diverse perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.

The researchers’ evaluations identified previously unpublished vulnerabilities relating to trustworthiness. The team worked with Microsoft product groups to confirm that the potential vulnerabilities identified do not impact current customer-facing services. This is in part true because finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology. They also shared their findings with GPT’s developer, OpenAI, which has noted the potential vulnerabilities in the system cards for relevant models.

This research aims to encourage others in the research community to utilize and build upon this work, potentially pre-empting adversaries who would exploit vulnerabilities to cause harm. To facilitate collaboration, the benchmark code is very extensible and easy to use: a single command is sufficient to run the complete evaluation on a new model.

Spotlight Posters

Differentially Private Approximate Near Neighbor Counting in High Dimensions

Differential privacy (DP) is a widely used tool for preserving the privacy of sensitive personal information. It allows a data structure to provide approximate answers to queries about the data it holds, while ensuring that the removal or addition of a single database entry does not significantly affect the outcome of any analysis.

Range counting (counting the number of data points falling into a given query ball) under differential privacy has been studied extensively. However, current algorithms for this problem come with challenges. One class of algorithms suffers from an additive error that is a fixed polynomial in the number of points. Another class of algorithms allows for polylogarithmic additive error, but the error grows exponentially in the dimension. To achieve the latter, the problem is relaxed to allow a “fuzzy” definition of the range boundary, e.g., a count of the points in a ball of radius r might also include points in a ball of radius cr for some c > 1.

In Differentially Private Approximate Near Neighbor Counting in High Dimensions, researchers present an efficient algorithm that offers a sweet spot between these two classes. The algorithm has an additive error that is an arbitrary small power of the data set size, depending on how fuzzy the range boundary is, as well as a small (1 + o(1)) multiplicative error. Crucially, the amount of noise added has no dependence on the dimension. This new algorithm introduces a variant of Locality-Sensitive Hashing, utilizing it in a novel manner.

Exposing Attention Glitches with Flip-Flop Language Modeling

Why do large language models sometimes output factual inaccuracies and exhibit erroneous reasoning? The brittleness of these models, particularly when executing long chains of reasoning, seems to be an inevitable price to pay for their advanced capabilities of coherently synthesizing knowledge, pragmatics, and abstract thought.

To help make sense of this fundamentally unsolved problem, Exposing Attention Glitches with Flip-Flop Language Modeling identifies and analyzes the phenomenon of attention glitches, in which the Transformer architecture’s inductive biases intermittently fail to capture robust reasoning. To isolate the issue, the researchers introduce flip-flop language modeling (FFLM), a parametric family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models. This simple generative task requires a model to copy binary symbols over long-range dependencies, ignoring the tokens in between. This research shows how Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which can be eliminated using various regularization techniques. The preliminary mechanistic analyses show why the remaining errors may be very difficult to diagnose and resolve. The researchers hypothesize that attention glitches account for some of the closed-domain errors occurring in natural LLMs.

In-Context Learning Unlocked for Diffusion Models

An emergent behavior of large language models (LLMs) is the ability to learn from context, or in-context learning. With a properly designed prompt structure and in-context learning, LLMs can combine the pre-training of multiple language tasks and generalize well to previously unseen tasks. While in-context learning has been extensively studied in natural language processing (NLP), its applications in the field of computer vision are still limited.

In-Context Learning Unlocked for Diffusion Models presents Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images and text guidance, this model understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, the researchers propose a vision-language prompt that can model a wide range of vision-language tasks, and a diffusion model that takes it as input. The diffusion model is trained jointly over six different tasks using these prompts. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation on the trained tasks and generalizes to new, unseen vision tasks with their respective prompts. This model also shows compelling text-guided image editing results.

Optimizing Prompts for Text-to-Image Generation

Generative foundation models can be prompted to follow user instructions, including language models and text-to-image models. Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, Optimizing Prompts for Text-to-Image Generation proposes prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts.

The researchers use reinforcement learning to explore better prompts with a language model. They define a reward function that encourages the policy network (i.e., language model) to generate more aesthetically pleasing images while preserving the original user intentions. Experimental results on Stable Diffusion show that this method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. Reinforcement learning further boosts performance, especially on out-of-domain prompts.

Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck

Algorithm design in deep learning can appear to be more like “hacking” than an engineering practice. There are numerous architectural choices and training heuristics, which can often modulate model performance and resource costs in unpredictable and entangled ways. As a result, when training large-scale neural networks (such as state-of-the-art language models), algorithmic decisions and resource allocations are foremost empirically-driven, involving the measurement and extrapolation of scaling laws. A precise mathematical understanding of this process is elusive, and cannot be explained by statistics or optimization in isolation.

In Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck, researchers from Microsoft, Harvard, and the University of Pennsylvania explore these algorithmic intricacies and tradeoffs through the lens of a single synthetic task: the finite-sample sparse parity learning problem. In this setting, the above complications are not only evident, but also provable: intuitively, due to the task’s computational hardness, a neural network needs a sufficient combination of resources (“data × model size × training time × luck”) to succeed. This research shows that standard algorithmic choices in deep learning give rise to a Pareto frontier, in which successful learning is “bought” with interchangeable combinations of these resources. They show that algorithmic improvements on this toy problem can transfer to the real world, improving the data-efficiency of neural networks on small tabular datasets.

PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers

Time-dependent partial differential equations (PDEs) are ubiquitous in science and engineering. The high computational cost of traditional solution techniques has spurred increasing interest in deep neural network based PDE surrogates. The practical utility of such neural PDE solvers depends on their ability to provide accurate, stable predictions over long time horizons, which is a notoriously hard problem.

PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers presents a large-scale analysis of common temporal rollout strategies, identifying the neglect of non-dominant spatial frequency information, often associated with high frequencies in PDE solutions, as the primary pitfall limiting stable, accurate rollout performance. Motivated by recent advances in diffusion models, the researchers developed PDE-Refiner, a novel model class that enables more accurate modeling of all frequency components via a multistep refinement process. They validate PDE-Refiner on challenging benchmarks of complex fluid dynamics, demonstrating stable and accurate rollouts that consistently outperform state-of-the-art models, including neural, numerical, and hybrid neural-numerical architectures. They also demonstrate that PDE-Refiner greatly enhances data efficiency, since the denoising objective implicitly induces a novel form of spectral data augmentation. Finally, PDE-Refiner’s connection to diffusion models enables an accurate and efficient assessment of the model’s predictive uncertainty, allowing researchers to estimate when the surrogate becomes inaccurate.

Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations

Randomized experiments are the gold-standard method of determining causal effects, whether in clinical trials to evaluate medical treatments or in A/B tests to evaluate online product offerings. But randomized experiments often need to be stopped prematurely when the treatment or test causes an unintended harmful effect. Existing methods that determine when to stop an experiment early are typically applied to the data in aggregate and do not account for treatment effect heterogeneity.

Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations examines the early stopping of experiments for harm on heterogeneous populations. The paper shows that current methods often fail to stop experiments when the treatment harms a minority group of participants. The researchers use causal machine learning to develop Causal Latent Analysis for Stopping Heterogeneously (CLASH), the first broadly-applicable method for heterogeneous early stopping. They demonstrate CLASH’s performance on simulated and real data and show that it yields effective early stopping for both clinical trials and A/B tests.

Survival Instinct in Offline Reinforcement Learning

In offline reinforcement learning (RL), an agent optimizes its performance given an offline dataset. Survival Instinct in Offline Reinforcement Learning presents a novel observation: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL’s return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design.

This research demonstrates that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and a certain bias implicit in common data collection practices. This work shows that this pessimism endows the agent with a “survival instinct”, i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. The researchers argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones.

Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics

Molecular dynamics (MD) is a well-established technique for simulating physical systems at the atomic level. When performed accurately, it provides unrivalled insight into the detailed mechanics of molecular motion, without the need for wet lab experiments. MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution (opens in new tab). However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed from scratch for each molecular system studied.

Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics presents an enhanced sampling method which uses a normalizing flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of 10^5−10^6fs. Crucially, Timewarp is transferable between molecular systems: the researchers show that, once trained, Timewarp generalizes to unseen small peptides (2-4 amino acids), exploring their metastable states and providing wall-clock acceleration when sampling compared to standard MD. This new method constitutes an important step towards developing general, transferable algorithms for accelerating MD.

The post NeurIPS 2023 highlights breadth of Microsoft’s machine learning innovation appeared first on Microsoft Research.

MatterGen: Property-guided materials design

December 7, 2023

by Alyssa Hughes Microsoft AI

Generative AI has revolutionized how we create text and images. How about designing novel materials? We at Microsoft Research AI4Science are thrilled to announce MatterGen, our generative model that enables broad property-guided materials design.

Read the paper

The central challenge in materials science is to discover materials with desired properties, e.g., high Li-ion conductivity for battery materials. Traditionally, this has been done by first finding novel materials and then filtering down based on the application. This is like trying to create the image of a cat by first generating a million different images and then searching for the one with a cat. In MatterGen, we directly generate novel materials with desired properties, similar to how DALL·E 3 tackles image generation.

MatterGen is a diffusion model specifically designed for generating novel, stable materials. MatterGen also has adapter modules that can be fine-tuned to generate materials given a broad range of constraints, including chemistry, symmetry, and properties. MatterGen generates 2.9 times more stable (≤ 0.1 eV/atom of our training + test data convex hull), novel, unique structures than a SOTA model (CDVAE). It also generates structures 17.5 times closer to energy local minimum. MatterGen can directly generate materials satisfying desired magnetic, electronic, mechanical properties via classifier-free guidance. We verify generated materials with DFT-based workflows.

Figure 1 (alt text)

This figure displays six pairs of crystalline structures, two for each property constrain. The property constraints are, top to bottom and left to right, high space group symmetry, high bulk modulus, target chemical system, target band gap, high magnetic density, combined high magnetic density and low HHI index. — *Figure 1: Stable and new materials generated by MatterGen while constrained on properties.*

Additionally, MatterGen can keep generating novel materials that satisfy a target property like high bulk modulus while screening methods instead saturate due to the exhaustion of materials in the database.

This is a line plot. The x axis indicates the number of DFT property calculations calls; the y axis reports the number of structures found. The title of the plot says — Figure 2: MatterGen discovers more novel stable high bulk modulus materials than the screening baseline, and does not plateau for increasing computational resources. MatterGen can find more than 250 materials with a bulk modulus > 400 GPa, while only 2 such materials are found in the reference dataset.

MatterGen can also generate materials given target chemical systems. It outperforms substitution and random structure search baselines equipped with MLFF filtering, especially in challenging 5-element systems. MatterGen also generates structures given target space groups. Finally, we tackle the multi-property materials design problem of finding low-supply-chain risk magnets. MatterGen proposes structures that have both high magnetic density and a low supply-chain risk chemical composition.

We believe MatterGen is an important step forward in AI for materials design. Our results are currently verified via DFT, which has many known limitations. Experimental verification remains the ultimate test for real-word impact, and we hope to follow up with more results.

None of this would be possible without the highly collaborative work between Andrew Fowler, Claudio Zeni, Daniel Zügner, Matthew Horton, Robert Pinsler, Ryota Tomioka, Tian Xie and our amazing interns Xiang Fu, Sasha Shysheya, and Jonathan Crabbé, as well as Jake Smith, Lixin Sun and the entire AI4Science Materials Design team.

We are also grateful for all the help from Microsoft Research, AI4Science, and Azure Quantum.

The post MatterGen: Property-guided materials design appeared first on Microsoft Research.

LLMLingua: Innovating LLM efficiency with prompt compression

December 7, 2023

by Alyssa Hughes Microsoft AI

This research paper was presented at the 2023 Conference on Empirical Methods in Natural Language Processing (opens in new tab) (EMNLP 2023), the premier conference on natural language processing and artificial intelligence.

As large language models (LLMs) models advance and their potential becomes increasingly apparent, an understanding is emerging that the quality of their output is directly related to the nature of the prompt that is given to them. This has resulted in the rise of prompting technologies, such as chain-of-thought (CoT) and in-context-learning (ICL), which facilitate an increase in prompt length. In some instances, prompts now extend to tens of thousands of tokens, or units of text, and beyond. While longer prompts hold considerable potential, they also introduce a host of issues, such as the need to exceed the chat window’s maximum limit, a reduced capacity for retaining contextual information, and an increase in API costs, both in monetary terms and computational resources.

To address these challenges, we introduce a prompt-compression method in our paper, “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (opens in new tab),” presented at EMNLP 2023 (opens in new tab). Using a well-trained small language model, such as GPT2-small or LLaMA-7B, LLMLingua identifies and removes unimportant tokens from prompts. This compression technique enables closed LLMs to make inferences from the compressed prompt. Although the token-level compressed prompts may be difficult for humans to understand, they prove highly effective for LLMs. This is illustrated in Figure 1.

This is an illustration of the LLMLingua framework, which estimates the important tokens of a prompt based on a small language model. It consists of three modules: a budget controller, iterative token-level prompt compression, and distribution alignment. The framework can compress a complex prompt of 2,366 tokens down to 117 tokens, achieving a 20x compression while maintaining almost unchanged performance. — Figure 1. LLMLingua’s framework

LLMLingua’s method and evaluation

To develop LLMLingua’s framework, we employed a budget controller to balance the sensitivities of different modules in the prompt, preserving the language’s integrity. Our two-stage process involved course-grained prompt compression. We first streamlined the prompt by eliminating certain sentences and then individually compressed the remaining tokens. To preserve coherence, we employed an iterative token-level compression approach, refining the individual relationships between tokens. Additionally, we fine-tuned the smaller model to capture the distribution information from different closed LLMs by aligning it with the patterns in the LLMs’ generated data. We did this through instruction tuning.

To assess LLMLingua’s performance, we tested compressed prompts on four different datasets, GSM8K, BBH, ShareGPT, and Arxiv-March23, encompassing ICL, reasoning, summarization, and conversation. Our approach achieved impressive results, achieving up to 20x compression while preserving the original prompt’s capabilities, particularly in ICL and reasoning. LLMLingua also significantly reduced system latency.

During our test, we used LLaMA-7B as the small language model and GPT-3.5-Turbo-0301, one of OpenAI’s LLMs, as the closed LLM. The results show that LLMLingua maintains the original reasoning, summarization, and dialogue capabilities of the prompt, even at a maximum compression ratio of 20x, as reflected in the evaluation metric (EM) columns in Tables 1 and 2. At the same time, other compression methods failed to retain key semantic information in prompts, especially in logical reasoning details. For a more in-depth discussion of these results, refer to section 5.2 of the paper.

These are the experimental results on GSM8K and BBH using GPT-3.5-turbo, demonstrating the in-context learning and reasoning capabilities based on different methods and compression constraints. The results show that LLMLingua can achieve up to a 20x compression rate while only experiencing a 1.5-point performance loss. — Table 1. Performance of different methods at different target compression ratios on the GSM8K and BBH datasets.

These are the experimental results for ShareGPT (Conversation) and Arxiv-March23 (Summarization) using GPT-3.5-turbo, based on different methods and compression constraints. The results indicate that LLMLingua can effectively retain the semantic information from the original prompts while achieving a compression rate of 3x-9x. — Table 2. Performance of different methods at different target compression ratios for conversation and summarization tasks.

LLMLingua is robust, cost-effective, efficient, and recoverable

LLMLingua also showed impressive results across various small language models and different closed LLMs. When using GPT-2-small, LLMLingua achieved a strong performance score of 76.27 under the ¼-shot constraint, close to the LLaMA-7B’s result of 77.33 and surpassing the standard prompt results of 74.9. Similarly, even without aligning Claude-v1.3, one of the post powerful LLMs, LLMLingua’s score was 82.61 under the ½-shot constraint, outperforming the standard prompt result of 81.8.

LLMLingua also proved effective in reducing response length, leading to significant reductions in latency in the LLM’s generation process, with reductions ranging between 20 to 30 percent, as shown in Figure 2.

The figure demonstrates the relationship between the compression ratio and the number of response tokens. In different tasks, as the compression ratio increases, the response length decreases to varying extents, with a maximum reduction of 20%-30%. — Figure 2. The distribution of token lengths generated at varying compression ratios.

What makes LLMLingua even more impressive is its recoverability feature. When we used GPT-4 to restore the compressed prompts, it successfully recovered all key reasoning information from the full nine-step chain-of-thought (CoT) prompting, which enables LLMs to address problems through sequential intermediate steps. The recovered prompt was almost identical to the original, and its meaning was retained. This is shown in Tables 3 and 4.

This figure illustrates the original prompt, the compressed prompt, and the result of using GPT-4 to recover the compressed prompt. The original prompt consists of a 9-step Chain-of-Thought, and the compressed prompt is difficult for humans to understand. However, the recovered text includes all 9 steps of the Chain-of-Thought. — Table 3. Latency comparison on GSM8K. LLMLingua can accelerate LLMs’ end-to-end inference by a factor of 1.7–5.7x.

This figure shows the end-to-end latency when using LLMLingua, without using LLMLingua, and the latency when compressing prompts. As the compression ratio increases, both the LLMLingua and end-to-end latency decrease, achieving up to a 5.7x acceleration with a 10x token compression rate. — Table 4. Recovering the compressed prompt from GSM8K using GPT-4.

Enhancing the user experience and looking ahead

LLMLingua is already proving its value through practical application. It has been integrated into LlamaIndex (opens in new tab), a widely adopted retrieval-augmented generation (RAG) framework. Currently, we are collaborating with product teams to reduce the number of tokens required in LLM calls, particularly for tasks like multi-document question-answering. Here, our goal is to significantly improve the user experience with LLMs.

For the long-term, we have proposed LongLLMLingua, a prompt-compression technique designed for long-context scenarios, such as retrieval-augmented question-answering tasks in applications like chatbots, useful when information evolves dynamically over time. It’s also geared for tasks like summarizing online meetings. LongLLMLingua’s primary objective is to enhance LLMs’ ability to perceive key information, making it suitable for numerous real-world applications, notably information-based chatbots. We’re hopeful that this innovation paves the way for more sophisticated and user-friendly interactions with LLMs.

Learn more about our work on the LLMLingua (opens in new tab) page.

The post LLMLingua: Innovating LLM efficiency with prompt compression appeared first on Microsoft Research.

Abstracts: December 6, 2023

December 6, 2023

by Brenda Potts Microsoft AI

In this episode, Xing Xie, a Senior Principal Research Manager at Microsoft Research, joins host Gretchen Huizinga to discuss “Evaluating General-Purpose AI with Psychometrics.” As AI capabilities move from task specific to more general purpose, the paper explores psychometrics, a subfield of psychology, as an alternative to traditional methods for evaluating model performance and for supporting consistent and reliable systems.

Read the paper

Transcript

[MUSIC PLAYS]

[MUSIC FADES]

Today I’m talking to Dr. Xing Xie, a Senior Principal Research Manager at Microsoft Research. Dr. Xie is coauthor of a vision paper on large language models called “Evaluating General-Purpose AI with Psychometrics,” and you can find a preprint of this paper now on arXiv. Xing Xie, thank you for joining us on Abstracts!

XING XIE: Yes, thank you. It’s my pleasure to be here.

HUIZINGA: So in a couple sentences, tell us what issue or problem your research addresses and why people should care about it.

XIE: Yeah, in a sense, actually, we are exploring the potential of psychometrics to revolutionize how we evaluate general-purpose AI. Because AI is advancing at a very rapid pace, traditional evaluation methods face significant challenges, especially when it comes to predicting a model’s performance in unfamiliar scenarios. And this method also lacks a robust mechanism to assess their own quality. Additionally, we, in this paper, we delve into the complexity of directly applying psychometrics to this domain and underscore several promising directions for future research. We believe that this research is of great importance. As AI continues to be integrated into novel application scenarios, it could have significant implications for both individuals and society at large. It’s crucial that we ensure their performance is both consistent and reliable.

HUIZINGA: OK, so I’m going to drill in a little bit in case there’s people in our audience that don’t understand what psychometrics is. Could you explain that a little bit for the audience?

XIE: Yeah, psychometrics could be considered as a subdomain of psychology. Basically, psychology just studies everything about humans, but psychometrics is specifically developed to study how we can better evaluate, we could also call this general-purpose intelligence, but it’s human intelligence. So there are, actually, a lot of methodologies and approaches in how we develop this kind of test and what tasks we need to carry out. The previous AI is designed for specific tasks like machine translation, like summarization. But now I think people are already aware of many progress in big models, in large language models. AI, actually, currently can be considered as some kind of solving general-purpose tasks. Sometimes we call it few-shot learning, or sometimes we call it like zero-shot learning. We don’t need to train a model before we bring new tasks to them. So this brings a question in how we evaluate this kind of general-purpose AI, because traditionally, we evaluate AI usually using some specific benchmark, specific dataset, and specific tasks. This seems to be unsuitable to this new general-purpose AI.

HUIZINGA: So how does your approach build on and/or differ from what’s been done previously in this field?

XIE: Yeah, we actually see a lot of efforts have been investigated into evaluating the performance of these new large language models. But we see a significant portion of these evaluations are task specific. They’re still task specific. And also, frankly speaking, they are easily affected by changes. That means even slight alterations to a test could lead to substantial drops in performance. So our methodology differs from these approaches in that rather than solely testing how AI performs on those predetermined tasks, we actually are evaluating those latent constructs because we believe that pinpointing these latent constructs is very important.

HUIZINGA: Yeah.

XIE: It’s important in forecasting AI’s performance in evolving and unfamiliar contexts. We can use an example like game design. With humans, even if an individual has never worked on game design—it’s just a whole new task for her—we might still confidently infer their potential if we know they possess the essential latent constructs, or abilities, which are important for game design. For example, creativity, critical thinking, and communication.

HUIZINGA: So this is a vision paper and you’re making a case for using psychometrics as opposed to regular traditional benchmarks for assessing AI. So would you say there was a methodology involved in this as a research paper, and if so, how did you conduct the research for this? What was the overview of it?

XIE: As you said, this is a vision paper. So instead of describing a specific methodology, we are collaborating with several experienced psychometrics researchers. Collectively, we explore the feasibility of integrating psychometrics into AI evaluation and discerning which concepts are viable and which are not. In February this year, we hosted a workshop on this topic. Over the past months, we have engaged in, in numerous discussions, and the outcome of these discussions is articulated in this paper. And additionally, actually, we are also in the middle of drafting another paper; that paper will apply insights from this paper to devise a rigorous methodology for assessing the latent capability of the most cutting-edge language models.

HUIZINGA: When you do a regular research paper, you have findings. And when you did this paper and you workshopped it, what did you come away with in terms of the possibilities for what you might do on assessing AI with psychometrics? What were your major findings?

XIE: Yeah, our major findings can be divided into two areas. First, we underscore the significant potential of psychometrics. This includes exploring how these metrics can be utilized to enhance predictive accuracy and guarantee test quality. Second, we also draw attention to the new challenges that arise when directly applying these principles to AI. For instance, test results could be misinterpreted, as assumptions verified for human tests might not necessarily apply to AI. Furthermore, capabilities that are essential for humans may not hold the same importance for AI.

HUIZINGA: Hmm …

XIE: Another notable challenge is the lack of a consistent and defined population of AI, especially considering their rapid evolution. But this population is essential for traditional psychometrics, and we need to have a population of humans for verifying either the reliability or the validity of a test. But for AI, this becomes a challenge.

HUIZINGA: Based on those findings, how do you think your work is significant in terms of real-world impact at this point?

XIE: We believe that our approach will signal the start of a new era in the evaluation of general-purpose AI, shifting from earlier, task-specific methodologies to a more rigorous scientific method. Fundamentally, there’s an urgent demand to establish a dedicated research domain focusing solely on AI evaluation. We believe psychometrics will be at the heart of this domain. Given AI’s expanding role in society and its growing significance as an indispensable assistant, this evolution will be crucial. I think one missing part of current AI evaluation is how we can make sure the test, the benchmark, or these evaluation methods of AI themselves, is scientific. Actually, previously, I used the example of game design. Suppose in the future, I think there are a lot of people discussing language model agents, AI agents … they could be used to not only write in code but also develop software by collaborating among different agents. Then what kind of capabilities, or we call them latent constructs, of these AI models they should have before they make success in game design or any other software development. For example, like creativity, critical thinking, communication. Because this could be important when there are multiple AI models—they communicate with each other, they check the result of the output of other models.

HUIZINGA: Are there other areas that you could say, hey, this would be a relevant application of having AI evaluated with psychometrics instead of the regular benchmarks because of the generality of intelligence?

XIE: We are mostly interested in maybe doing research, because a lot of researchers have started to leverage AI for their own research. For example, not only for writing papers, not only for generating some ideas, but maybe they could use AI models for more tasks in the whole pipeline of research. So this may require AI to have some underlying capabilities, like, as we have said, like critical thinking—how AI should define the new ideas and how they check whether these ideas are feasible and how they propose creative solutions and how they work together on research. This could be another domain.

HUIZINGA: So if there was one thing that you want our listeners to take away from this work, what would it be?

XIE: Yeah, I think the one takeaway I want to say is we should be aware of the vital importance of AI evaluation. We are still far from achieving a truly scientific standard, so we need to still work hard to get that done.

HUIZINGA: Finally, what unanswered questions or unsolved problems remain in this area? What’s next on your research agenda that you’re working on?

XIE: Yeah, actually, there are a lot of unanswered questions as highlighted at the later part of this paper. Ultimately, our goal is to adapt psychometric theories and the techniques to fit AI contexts. So we have discussed with our collaborators in both AI and psychometrics … some examples would be, how can we develop guidelines, extended theories, and techniques to ensure a rigorous evaluation that prevents misinterpretation? And how can we best evaluate assistant AI and the dynamics of AI-human teaming? This actually is particularly proposed by one of our collaborators in the psychometrics domain. And how do we evaluate the value of general-purpose AI and ensure their alignment with human objectives? And then how can we employ semiautomatic methods to develop psychometric tests, theories, and techniques with the help of general-purpose AI? That means we use AI to solve these problems by themselves. This is also important because, you know, psychometrics or psychology have developed for hundreds, or maybe thousands, of years to come to all the techniques today. But can we shorten that period? Can we leverage AI to speed up this development?

HUIZINGA: Would you say there’s wide agreement in the AI community that this is a necessary direction to head?

XIE: This is only starting. I think there are several papers discussing how we can apply some part of psychology or some part of psychometrics to AI. But there is no systematic discussion or thinking along this line. So I, I don’t think there is agreement, but there’s already initial thoughts and initial perspectives showing in the academic community.

[MUSIC PLAYS]

HUIZINGA: Well, Xing Xie, thanks for joining us today, and to our listeners, thank you for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts (opens in new tab), or you can find a preprint of the paper on arXiv. See you next time on Abstracts!

The post Abstracts: December 6, 2023 appeared first on Microsoft Research.

Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow

December 6, 2023

by Alyssa Hughes Microsoft AI

These research papers were presented at the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (opens in new tab) (ESEC/FSE 2023), a premier conference in the field of software engineering.

ESEC/FSE 2023
Two papers on a blue/green gradient: InterFix and AdaptivePaste

The practice of software development inevitably involves the challenge of handling bugs and various coding irregularities. These issues can become pronounced when developers engage in the common practice of copying and pasting code snippets from the web or other peer projects. While this approach might offer a quick solution, it can introduce a host of potential complications, including compilation issues, bugs, and even security vulnerabilities into the developer’s codebase.

To address this, researchers at Microsoft have been working to advance different aspects of the software development lifecycle, from code adaptation to automated bug detection and repair. At ESEC/FSE 2023 (opens in new tab), we introduced two techniques aimed at enhancing coding efficiency. AdaptivePaste utilizes a learning-based approach to adapt and refine pasted code snippets in an integrated development environment (IDE). InferFix is an end-to-end program repair framework designed to automate bug detection and resolution. This blog outlines these technologies.

AdaptivePaste: Intelligent copy-paste in IDE

A widespread practice among developers involves adapting pasted code snippets to specific use cases. However, current code analysis and completion techniques, such as masked language modeling and CodeT5, do not achieve an acceptable level of accuracy in identifying and adapting variable identifiers within these snippets to align them with the surrounding code. In the paper, “AdaptivePaste: Intelligent Copy-Paste in IDE,” we propose a learning-based approach to source code adaptation, aiming to capture meaningful representations of variable usage patterns. First, we introduce a specialized dataflow-aware de-obfuscation pretraining objective for pasted code snippet adaptation. Next, we introduce a transformer-based model of two variants: a traditional unidecoder and parallel-decoder model with tied weights.

Diagram depicting AdaptivePaste architecture. Starting with a program with a pasted code snippet, AdaptivePaste extracts and prioritizes syntax hierarchies most relevant for the learning task, analyzes the data-flow, and then anonymizes the pasted code. The resulting program serves as input for neural model. The output is serialized as a sequence of tokens. — Figure 1. AdaptivePaste architecture. For a program with a pasted code snippet, AdaptivePaste extracts and prioritizes syntax hierarchies most relevant for the learning task, analyzes the data flow, and anonymizes variable identifiers in the pasted code snippet. The resulting program serves as input for neural model. The output is serialized as a sequence of tokens entries.

The unidecoder follows a standard autoregressive decoder formulation, mapping each variable in the pasted snippet to a unique symbol in the context or declaring a new variable. The parallel decoder duplicates the decoder for each anonymized symbol in the anonymized pasted snippet, predicting names independently and factorizing the output distribution per symbol. This enables selective code snippet adaptation by surfacing model predictions above a specified threshold and outputting “holes” where uncertainty exists.

To establish a dataflow-aware de-obfuscation pretraining objective for pasted code snippet adaptation, we assigned mask symbols to variable identifiers at the granularity of whole code tokens. The pre-existing code context was unanonymized, allowing the model to attend to existing identifier names defined in scope.

Our evaluation of AdaptivePaste showed promising results. It successfully adapted Python source code snippets with 67.8 percent exact match accuracy. When we analyzed the impact of confidence thresholds on model predictions, we observed that the parallel decoder transformer model improves precision to 85.9 percent in a selective code adaptation setting.

InferFix: End-to-end program repair with LLMs

Addressing software defects accounts for a significant portion of development costs. To tackle this, the paper, “InferFix: End-to-End Program Repair with LLMs over Retrieval-Augmented Prompts,” introduces a program repair framework that combines the capabilities of a state-of-the-art static analyzer called Infer, a semantic retriever model called Retriever, and a transformer-based model called Generator to address crucial security and performance bugs in Java and C#.

The Infer static analyzer is used to reliably detect, classify, and locate critical bugs within complex systems through formal verification. The Retriever uses a transformer encoder model to search for semantically equivalent bugs and corresponding fixes in large datasets of known bugs. It’s trained using a contrastive learning objective to excel at finding relevant examples of the same bug type.

The Generator employs a 12 billion-parameter codex model, fine-tuned on supervised bug-fix data. To enhance its performance, the prompts provided to the Generator are augmented with bug type annotations, bug contextual information, and semantically similar fixes retrieved from an external nonparametric memory by the Retriever. The Generator generates the candidate to fix the bug.

Diagram depicting the InferFix approach workflow. Starting with a Pull Request, the Infer Static Analyzer conducts bug detection, classification, and localization. Subsequently, Context Extraction gathers pertinent details of the bugs and the surrounding context, and then Retriever identifies semantically similar bugs. The process concludes with the LLM Generator proposing a fix based on the generated prompt. — Figure 2: The InferFix workflow. An error-prone code modification is detected by the Infer static analyzer, which is used to craft a prompt with bug type annotation, location information, relevant syntax hierarchies, and similar fixes identified by the Retriever. The large language model (LLM) Generator provides a candidate fix to the developer.

To test InferFix, we curated a dataset called InferredBugs (opens in new tab), which is rich in metadata and comprises bugs identified through executing the Infer static analyzer on thousands of Java and C# repositories. The results are noteworthy. InferFix outperforms strong LLM baselines, achieving a top-1 accuracy of 65.6 percent in C# and an impressive 76.8 percent in Java on the InferredBugs dataset.

Looking ahead

With AdaptivePaste and InferFix, we hope to significantly streamline the coding process, minimizing errors and enhancing efficiency. This includes reducing the introduction of bugs when code snippets are added and providing automated bug detection, classification, and patch validation. We believe that these tools hold promise for an enhanced software development workflow, leading to reduced costs and an overall boost in project efficiency.

Looking ahead, the rapid advancement of LLMs like GPT-3.5 and GPT-4 has sparked our interest in exploring ways to harness their potential in bug management through prompt engineering and other methods. Our goal is to empower developers by streamlining the bug detection and repair process, facilitating a more robust and efficient development environment.

The post Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow appeared first on Microsoft Research.

Research Focus: Week of December 4, 2023

December 6, 2023

by Alyssa Hughes Microsoft AI

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Leveraging Large Language Models for Automated Proof Synthesis in Rust

Formal verification can probably guarantee the correctness of critical system software, but the high proof burden has long hindered its wide adoption. Recently, large language models (LLMs) have shown success in code analysis and synthesis. In a recent paper: Leveraging Large Language Models for Automated Proof Synthesis in Rust, researchers from Microsoft present a combination of LLMs and static analysis to synthesize invariants, assertions, and other proof structures for a Rust-based formal verification framework called Verus.

In a few-shot setting, GPT-4 demonstrates impressive logical ability in generating postconditions and loop invariants, especially when analyzing short code snippets. However, GPT-4 does not consistently retain and propagate the full context information needed by Verus, a task that can be straightforwardly accomplished through static analysis. Based on these observations, the researchers developed a prototype based on OpenAI’s GPT-4 model. This prototype decomposes the verification task into multiple smaller ones, iteratively queries GPT-4, and combines its output with lightweight static analysis. Evaluating the prototype with a developer in the automation loop on 20 vector-manipulating programs showed that it significantly reduces human effort in writing entry-level proof code.

Read the paper

NEW RESEARCH

Don’t Forget the User: It’s Time to Rethink Network Measurements

The goal of network measurement is to characterize how and how well a network is performing. This has traditionally meant a focus on the bits and bytes — low-level network metrics such as latency and throughput, which have the advantage of being objective but are limited in representativeness and reach. In a recent paper: Don’t Forget the User: It’s Time to Rethink Network Measurements, researchers from Microsoft argue that users also provide a rich and largely untapped source of implicit and explicit signals that could complement and expand the coverage of traditional measurement methods. Implicit feedback leverages user actions to indirectly infer network performance and the resulting quality of user experience. Explicit feedback leverages user input, typically provided offline, to expand the reach of network measurement, especially for newer ones.

The researchers analyze example scenarios, including capturing implicit feedback through user actions such as the user (un)muting the mic or turning on/off the camera in a large-scale conferencing service. These techniques complement existing measurement methods and open a broad set of research directions, ranging from rethinking measurements tools, to designing user-centric networked systems and applications.

Read the paper

TECHNOLOGY UPDATE

Ghana 3D international telemedicine proof of concept study

A real-time 3D telemedicine system – leveraging Holoportation communication technology – was used to facilitate consultations with complex reconstructive patients prior, during, and after an overseas surgical collaboration. The system was used in a proof-of-concept clinic in November 2022 between Canniesburn Plastic Surgery Unit, UK, and the National Reconstructive Plastic Surgery and Burns Centre, Korle Bu Teaching Hospital, Ghana.

Four patients in Ghana were followed through their patient journey (mandibular ameloblastoma, sarcoma thigh, maxillary tumor, sarcoma back). A new report: Ghana 3D Telemedicine International MDT: A Proof-of-concept study details the responses of 13 participants (4 patients, 4 Ghana clinicians, 5 UK clinicians) completed feedback on the 3D multidisciplinary team (MDT). Outcome measures were rated highly with satisfaction 84.31/100, perceived benefit 4.54/5, overall quality 127.3/ 147, and usability 83.2/100. This data shows close alignment with that previously published on high income countries.

This novel technology has potential to enhance overseas surgical visits in low-to-middle income countries through improved planning, informed discussion with patients, expert consensus on complex cases, and fostering engagement with professionals who may be thousands of miles away.

Read the paper

The post Research Focus: Week of December 4, 2023 appeared first on Microsoft Research.

Exploring LLMs’ potential to help facilitators enhance online healthcare communities

December 5, 2023

by Alyssa Hughes Microsoft AI

This research paper was presented at the Fourth African Human Computer Interaction Conference (opens in new tab) (AfriCHI 2023), the pan-African conference on interactive digital technology design.

Online health communities can be a lifeline for people seeking healthcare support, enabling them to share experiences, ask questions, and receive help. These are particularly vital in low-and-middle-income countries (LMICs), where access to quality healthcare can be limited and online health communities function as a doorway for receiving expert advice and accessing trustworthy content. One platform that is widely used for this purpose is WhatsApp due to its popularity and ability to host facilitated communities for specific groups, like patients affiliated with a particular clinic.

For all their benefits, online health communities also face challenges due to the myriad responsibilities and equal lack of support for facilitators, who must answer questions, respond to ongoing discussions, and review reports. Facilitation requires staying abreast of ongoing chat threads, verifying facts, and generally just being available. Given that most healthcare professionals already have a full day of in-person healthcare work, facilitation occurs during lunch breaks, evenings, and even mornings before the workday begins.

Our paper, “Can Large Language Models Support Medical Facilitation Work? A Speculative Analysis (opens in new tab),” presented at AfriCHI 2023 (opens in new tab), discusses research conducted in collaboration with the University of Washington, where we examined facilitated WhatsApp groups created for young people living with HIV in informal settlements in Kenya. Facilitation involved moderating chats, providing emotional support, conducting administrative tasks, sharing information, and resolving conflicts. Because many discussions occurred at night, facilitators struggled to keep up with the chats, often missing important questions or responding to them a few days after they were posted. Facilitators also found it difficult to defuse tensions, which occurred from time to time.

LLMs’ potential in supporting online health facilitators

To help resolve these challenges, we explored ways large language models (LLMs) could potentially support facilitators, for example, by flagging important messages and helping with content authoring. LLMs’ language translation capabilities and capacity to answer questions and summarize information made them great candidates for online heath communities, understanding that facilitators should always verify the content that LLMs create. To explore their potential, we tested their application on chat log data. We concluded that an LLM-enabled copilot could help facilitators in several ways, such as:

Coproducing compelling content: LLMs could help facilitators create educational and informative content for group members. They can summarize frequently asked questions, patient stories, and best practices for managing chronic conditions.
Summarizing messages: LLMs could summarize long discussions in the chat, making it easier for facilitators to get up to date and identify important issues. Summarization can also help participants who need to be offline and might otherwise miss important information.
Providing recommendations: LLMs could help facilitators conduct research when answering questions. However, facilitators must exercise due diligence and verify any suggestions the LLM makes.
Performing sentiment analysis: LLMs could flag potential trouble spots in messages, such as declines in mental health, tension among participants, harmful advice, and misinformation.
Assigning badges: LLMs could assign badges to group members in recognition for participating in discussions, completing tasks, or achieving milestones. This could help to motivate and engage members.

Importance of human facilitation

While LLMs offer numerous potential benefits for healthcare facilitation, it’s important to consider their challenges and limitations. We strongly believe that LLMs should be used to augment, not replace, human facilitation. One crucial reason is that this technology cannot provide the emotional support essential in these groups. Another challenge involves the potential for bias and harm. LLMs are trained on massive datasets of text and code, which might contain harmful biases and stereotypes. Additionally, LLMs can produce errors when dealing with content from outside the training data, such as cultural backgrounds that are underrepresented in this data.

Our research shows that the benefits these groups provide lie beyond merely providing information. Their success, gauged by participation levels, perceived value by members, and adherence to medical protocols, is attributed not only to the facilitators’ expertise but also to their empathy, humor, and care. These are human qualities that LLMs cannot replace.

a medical professional in scrubs holding a stethoscope posing for the camera

Looking forward

When used to augment and support existing medical professionals, LLMs show promise in healthcare solutions, such as those for patients with chronic diseases in LMICs. We recommend that future research and practice in this area prioritize the following:

Developing and testing LLM-enabled copilot systems that are tailored to specific patient populations and online health communities.
Ensuring that design supports medical professionals, taking special care to preserve their agency.
Designing copilot systems so that users can easily evaluate output as well as identify and correct erroneous content.
Developing guidelines and regulations to ensure quality and safety when using LLMs for healthcare purposes.

Overall, the use of LLMs to support the work of online health community facilitation is an exciting new area of research. By making the facilitators’ tasks easier, they can pave the way for groups supporting more patients, improve adherence to medical protocols, and enhance well-being. While our research focused on a specific type of WhatsApp group, the potential of LLMs reaches far beyond. These models have the potential to support facilitators of online health communities across a diverse range of platforms.

The post Exploring LLMs’ potential to help facilitators enhance online healthcare communities appeared first on Microsoft Research.

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

Contributors

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Key Insights Behind Phi-2

Training Details

Phi-2 Evaluation

Subscribe to the Microsoft Research Podcast:

Transcript

Oral Presentations

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

Related

Spotlight Posters

LLMLingua’s method and evaluation

LLMLingua is robust, cost-effective, efficient, and recoverable

Enhancing the user experience and looking ahead

Subscribe to the Microsoft Research Podcast:

Transcript

AI Frontiers: Models and Systems with Ece Kamar

AdaptivePaste: Intelligent copy-paste in IDE

InferFix: End-to-end program repair with LLMs

Looking ahead

NEW RESEARCH

Leveraging Large Language Models for Automated Proof Synthesis in Rust

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

NEW RESEARCH

Don’t Forget the User: It’s Time to Rethink Network Measurements

TECHNOLOGY UPDATE

Ghana 3D international telemedicine proof of concept study

LLMs’ potential in supporting online health facilitators

Importance of human facilitation

Looking forward

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.