Members of the research community at Microsoft work continuously to advance their respective fields. *Abstracts* brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Principal Researcher Alessandro Sordoni joins host Gretchen Huizinga to discuss “Joint Prompt Optimization of Stacked LLMs using Variational Inference.” In the paper, which was accepted at the 2023 Conference on Neural Information Processing Systems (NeurIPS), Sordoni and his coauthors introduce *Deep Language Networks*, or DLNs, an architecture that treats large language models as layers within a network and natural language prompts as each layer’s learnable parameters.

## Subscribe to the Microsoft Research Podcast:

## Transcript

[MUSIC PLAYS]**GRETCHEN HUIZINGA:** Welcome to *Abstracts*, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a *podcast abstract*—of their new and noteworthy papers.

Today I’m talking to Dr. Alessandro Sordoni, a Principal Researcher from Microsoft Research. Dr. Sordoni is coauthor of a paper titled “Joint Prompt Optimization of Stacked LLMs using Variational Inference,” and this paper, which was accepted for the 2023 Conference on Neural Information Processing Systems, or NeurIPS, is available now on arXiv. Alessandro, thanks for joining us on *Abstracts*!

**ALESSANDRO SORDONI:** Hi, Gretchen, thank you for having me.

**HUIZINGA:** So in a few sentences, tell us about the issue or problem that your research addresses and why we should care about it.

**SORDONI:** So in this paper, our starting points are large language models, and to make large language models solve tasks, one of the ways that is currently used is to prompt them. By prompting that means just giving instruction to them, and hopefully by joining instruction and the input of the task, the language model can solve the task following the rules specified in the instructions. And there has been some approaches already in the literature to actually infer what that instruction is without human intervention. And in this paper, we operate in that space, which is called kind of automatic prompt engineering. And our specific problem is to, one, how to actually infer those prompts for a language model. And, two, what happens if actually the output of that large language model gets into another language model and both language model needs prompt to operate? And so basically, we give sort of an algorithm to solve that joint prompt optimization. That’s why it’s called joint.

**HUIZINGA:** So what’s the underlying issue there that we should care about as potential users of this technology?

**SORDONI:** There are some problems that cannot be just solved by kind of one instruction or rule, I would say, but they necessitate some sort of higher-level reasoning or some sort of decomposition. And in that sense, it would maybe be useful to actually have multiple calls to the LLM, where each call is modulated by a different instruction. So the first instruction could be something very general, for example, decompose or visualize the problem into a different language that is formulated in. And the second call is now recompose this visualization that you have produced to solve the problem itself. And so basically, in that context, you can think about this as kind of augmenting the computational power of the language model by splitting the one call in multiple calls.

**HUIZINGA:** Well, go in a little deeper on the work that this builds on. All research kind of gets a prompt—no pun intended—from previous work. So how does your work build on and/or differ from what’s been done previously in this field?

**SORDONI:** I would say that our work started more with this intuition that LLMs are just kind of black-box computation units. Now this sort of black box can accept input as input language. The computation is modulated by an instruction and it outputs language, so you can stack these layers, right. So if the weights of this language layer now are the instructions and you can stack them together, how can you optimize them, right? And then we start to think, OK, but this is very related to kind of automatic prompt optimization. The overall kind of prompt engineering and prompt optimization approaches right now work by proposing some prompts and accepting some prompts. So we did some modifications with respect to how we propose new prompts to language models and how do we evaluate and accept then those that work given some task inputs and outputs. Our goal in the future—I would say in the near future—is going to be to basically integrate optimization that can really express arbitrary graphs …

**HUIZINGA: **Gotcha …

**SORDONI: **… of LLM calls right now. But in our paper, we started with the first step, which is, OK, say that I just have two calls. Can I just optimize prompts for that very simple graph? And we proposed an algorithm to do so. So basically, I guess our main contribution is, one, getting a better prompt optimizer for one layer and also devising an algorithm that works for two layers right now and that can be extended to multiple layers. But that’s also an engineering problem that needs to be tackled.

**HUIZINGA:** [LAUGHS] Yeah, always got to get the engineering in there! Well, listen, let’s keep going on this because it sounds like you’re talking about methodology and, and how you conducted this research. So expand a little bit on what you did actually to experiment in this arena.

**SORDONI:** Yeah, so I think that, uh, really the birth of this paper started from this kind of view of these language models as layers modulated by instructions that can be stacked upon each other. And from there, we said, OK, what can we do with this, basically? And so some of us worked on datasets that could be interesting for this new sort of methodology, I would say, or architecture. So basically, one question was, how do you go forward to actually test if this works in any way? And so we tried to select some datasets that were more of natural language tasks—for example, sentiment classification—and some datasets that were more about reasoning tasks. And our hunch was that basically stacking multiple layers together would help more in those tasks that would require some sort of decomposition of reasoning.

**HUIZINGA: **Right.

**SORDONI: **And for the reasoning task, we worked with this BIG-Bench Hard setting. And so parallel to that, there were some of us that worked, for example myself, in the optimization part, really in the algorithm part. And at first, we tried to do some sort of back propagation. But I quickly saw that there were some sort of issues with that … probably empirically issues. And so we tried to actually have a more formal understanding of this optimization algorithm by recurring to variational inference basically, so basically, to understand actually the first layer as producing some text and considering this text as a latent variable. When you open that box, it links also in your head to all … a bunch of kind of related works in the literature that have studied this problem very, very thoroughly. And so you can use those techniques into this context.

**HUIZINGA:** Interesting. So what were the results of this research? What did you find?

**SORDONI:** So what we found was that, indeed, the tasks in which these approaches seem to help the most are the tasks that require this sort of decomposition and reasoning. The first thing that was really, really kind of cool, it was that kind of you can go a long way in improving the performance of these language models by accurate prompt optimization. Because in some models, prompt optimization can be understood as kind of really tweaking the models towards solving the task. But in some other tasks, actually, when humans write prompts, they tend to maybe underspecify the prompt or tend to basically be not very clear to how to instruct the model. So the model has to do a lot of work to understand …

**HUIZINGA: **Right …

**SORDONI: **… what the human really wants to say to them. And so basically, this sort of prompt optimization acts as a sort of translator where it formulates a prompt that more comprehensively describes the task and more comprehensively contains some rules to solve the task. So it was very interesting to me, that kind of level of abstraction that was sort of required and needed in the prompt to really solve this task very, very well. The other finding is that this problem is very hard. It’s very tricky to optimize, to prompt, this type of optimization because this type of optimization doesn’t really follow a gradient direction like in deep neural networks.

**HUIZINGA: **Yeah.

**SORDONI: **It’s basically a sort of trial and error. And this trial and error is very finicky. There’s a lot of problems there. But I feel like I’m hopeful in the sense that this paper allowed us, I think, to hone in some very specific problem that if we solve them, we can make the problem much easier.

**HUIZINGA:** Let’s talk for a second about real-world impact of this research. Let’s extrapolate out from the lab and move into life. Who benefits from this most, and how do they benefit?

**SORDONI:** I think that, as I said before, like these automatic prompt optimization methods could benefit, I think, a large audience, or large amount of users, I would say, because they could be understood as a sort of translator between the user needs and what the LLM can do. For example, one effort here in Montréal that was led by my colleagues was kind of building this sort of interactive agent that would, through interaction with the user, form a prompt but interactively. So, for example, in DLN, like in our paper, we assume that we have a task and we do not have input or interaction with the user, right. But in more realistic scenarios, you might want to actually instruct your model to do something by some sort of active learning process where the model actually propose you whether what it did was favorable or desirable or not.

**HUIZINGA: **Right.

**SORDONI: **And the user can actually interact with that output, right. For the multilayer case, my hope is that that would be useful to build and optimize these large sort of graphs of LLM calls.

**HUIZINGA:** I want to take a second here to spell out some acronyms. You’ve referred to *DLNs*, and I don’t think our audience might know what that means. I’m assuming they know *LLM* means “large language model.” That’s sort of in the parlance. But talk a little bit about what that other acronym is.

**SORDONI:** Yeah, sorry I didn’t mention this. So DLN is basically how we refer to these architectures that are composed of language model layers. So *DLN* is, spells as “Deep Language Network.”

**HUIZINGA: **Gotcha.

**SORDONI: **People are free to use this name or not.

**HUIZINGA: **No, I like it …

**SORDONI: **I’m not a big fan of imposing acronyms on the world [LAUGHS], but that’s a, that’s a shorter version of it. So, yeah, so it’s really the idea that a language model is a layer in this hierarchy, and the layer accepts as input a text, it outputs a text, and really is modulated by an instruction or prompt that we want to learn.

**HUIZINGA:** And so the DLN is a deep language network and it sort of acts as a deep neural network but using language models as your layer.

**SORDONI:** Exactly, exactly, yes.

**HUIZINGA:** So this is a question I ask everyone, and it’s sort of like, how could you boil this down to one little takeaway if you’re standing on an elevator with somebody and they say, what do you do, Alessandro? So if there’s one thing you want people to take away from this work, what would it be?

**SORDONI:** The first thing that came to my mind is really the fact that these models can be understood really as a class, I would say, of probability distributions and that they are modulated by these prompts. And so basically, once you have that, once a language model just defines a (p) over sentences given some prompt, you can apply a lot of algorithms with those models. You can apply algorithms that resembles to EM, *expectation maximization*, or … I mean, we applied a form of that with variational inference, but maybe kind of it could open the path for other types of usages where kind of these are just very, very powerful probability distributions over these sentences that are considered as latent variable. I hope that our paper can show like a more or less practical kind of implementation of that idea. And that basically if you have to optimize, for example, prompts with one or two layers, you can definitely try our approach.

**HUIZINGA:** Well, finally, and we’ve been talking about this kind of already, but there seem to be some unresolved problems in the area. What do researchers like you need to be looking at in order to solve those? Sort of what’s next on the research agenda, whether it’s you or other researchers in this field?

**SORDONI:** So let me try to answer by something that really excites me now. What we are doing is that we are producing text, right. With the language model. But we are producing this text in such a way that it helps to solve a problem. And basically, this variational inference method and kind of framework gives us a way of understanding what does it mean to be a good text? Like what does it mean to be a good kind of latent variable or useful latent variable?

**HUIZINGA: **Right.

**SORDONI: **What does it mean to produce good data? So, for example, these big models kind of are really data creators, like this generative AI, right. But can we actually teach them to produce data such that this data can be helpful to solve tasks or to condition those same models to solve a task?

**HUIZINGA: **Right.

**SORDONI: **And what are the objective functions that promote the production of this useful data? What useful means from a mathematical perspective. I think that, apart from the prompt optimization angle, I feel like DLN to me kind of opened a little bit my spirit into kind of investigating ways of understanding what does it mean for some generated text to be useful to solve a task, I would say. Yeah.

**HUIZINGA:** Alessandro Sordoni, thanks for joining us today. And thanks to our listeners for tuning in. If you’re interested in learning more about this work, you can find a link to the paper at aka.ms/abstracts or you can find it on arXiv. See you next time on *Abstracts*!

The post Abstracts: December 11, 2023 appeared first on Microsoft Research.