AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

MSR Podcast | AI Frontiers | Ahmed Awadallah

Episode 149 | Sept. 14, 2023

Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.  

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as healthcare and education, and its potential to benefit humanity.

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI. Awadallah discusses the shift in dynamics between model size and amount—and quality—of data when it comes to model training; the recently published paper “Orca: Progressive Learning from Complex Explanation Traces of GPT-4,” which further explores the use of large-scale AI models to improve the performance of smaller, less powerful ones; and the need for better evaluation strategies, particularly as we move into a future in which Awadallah hopes to see gains in these models’ ability to continually learn.

Transcript

[MUSIC PLAYS]

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more inspired to work in the field than right now. The release of GPT-4 was a watershed moment in the pursuit of artificial intelligence, and yet progress continues to accelerate. The latest large-scale AI models and the systems they power are continuing to exhibit improvements in reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’m sharing conversations with fellow researchers about the latest developments in large-scale AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.

Today, I’ll speak with Ahmed Awadallah. Ahmed is a Senior Principal Researcher at Microsoft Research in Redmond. Much of his work focuses on machine learning, helping to create foundation models that excel at key tasks while using less compute and energy. His work has been at the leading edge of recent progress in AI and gives him a unique perspective on where it will go next.


[MUSIC FADES] 

All right, Ahmed, let’s dive right in. Among other things, I find that people are hungry to understand the drivers of the progress we’re seeing in AI. Over these last few years when people like you or I have tried to explain this, we’ve often pointed to some measure of scale. You know, I know many times as I’ve given talks in AI, I’ve shown plots that feature some kind of up-and-to-the-right trend in scale over time—the increasing size of the AI models we’re training, the increasing size of the datasets we’re using to train them on, or even the corresponding increase in the overall compute budget. But when you double-click into this general notion of scale related to large AI models, what gets exposed is really a rapidly evolving frontier of experimental science. So, Ahmed, I’m going to start with a big question and then we can kind of decompose it from there. As someone at the forefront of all of this, how has your understanding of what’s driving progress in AI changed over this last year?

AHMED AWADALLAH: Thanks, Ashley. That’s a very good question. And the short answer is it’s changed a lot. I think I have never been learning as much as I have been throughout my career. Things are moving really, really fast. The progress is amazing to witness, and we’re just learning more and more every day. To your point, for quite some time, we were thinking of scale as the main driver of progress, and scale is clearly very important and necessary. But over the last year, we have been also seeing many different things. Maybe the most prominent one is the importance of data being used for training these models. And that’s not very separate from scale, because when we think about scale, what really matters is how much compute we are spending in training these models. And you can choose to spend that compute in making the model bigger or in training it on more and more data, training it for longer. And it has been over the past few years a lot of iterations in trying to understand that. But it has been very clear over the last year that we were, in a sense, underestimating the value of data in different ways: number one, in having more data but even more important, the quality of the data, having cleaner data, having more representative data, and also the distribution or the mixing of the data that we are using. Like, for example, one of the very interesting things we have witnessed maybe over the last year to year and a half is that a lot of the language models are being trained on text and code. And surprisingly, the training on code is actually helping the model a lot—not just in coding tasks but in normal other tasks that do not really involve coding. More importantly, I think one of the big shifts last year in particular—it has been happening for quite some time but we have been seeing a lot of value for it last year—is that there are now like two stages of training these models: the pretraining stage, where you are actually training the language model in an autoregressive manner to predict the next word. And that just makes it a very good language model. But then the post-training stage with the instruction tuning and RLHF (reinforcement learning from human feedback) and reward models, using a very different form of data; this is not self-supervised, freely available data on the internet anymore. This is human-generated, human-curated, maybe a mixture of model- and human-curated data that’s trying to get the model to be better at very specific elements like being more helpful or being harmless. 

LLORENS: There’s so much to unpack even in that, in that short answer. So let’s, let’s dig in to some of these core concepts here. You, you teed up this notion of ways to spend compute, you know, ways to spend a compute budget. And one of the things you said was, you know, one of the things we can do is make the model bigger. And I think to really illustrate this concept, we need to, we need to dig in to what that means. One, one concept that gets obfuscated there a little bit is the architecture of the model. So what does it mean to make the model bigger? Maybe you can tell us something about, you know, how to think about parameters in the model and how important is architecture in that, in that conversation.

AWADALLAH: So most of the progress, especially in language and other domains, as well, have been using the transformer model. And the transformer model have been actually very robust to change over the years. I don’t … I think a lot … I’ve asked a lot of experts over the years whether they had expected the transformer model to be still around five, six years later, and most of them thought we would have something very different. But it has been very robust and very universal, and, yes, there have been improvements and changes, but the core idea has still been the same. And with dense transformer models, the size of the model tends to be around the number of layers that you have in the model and then the number of parameters that you have in each layer, which is basically the depths and the widths of the model. And we have been seeing very steady exponential increase in that. It’s very, it’s very interesting to think that just like five years ago when BERT came up, the large model was like 300-something million parameters and the smaller one was 100 million parameters. And we consider these to be really large ones. Now that’s a very, very small scale. So things have been moving and moving really fast in making these models bigger. But over the time, there started to be an understanding being developed of how big should the model be. If I were to invest a certain amount of compute, what should I do with that in terms of the model size and especially on how it relates to the data side? And, perhaps, one of the most significant efforts there was the OpenAI scaling laws, which came up in 2020, late 2020, I think. And it was basically saying that if you are … if you have 10x more compute to spend, then you should dedicate maybe five of that … 5x of that to making the model bigger—more layers, more width—and maybe 2x to making the data bigger. And that translated to … for, like say, GPT-3-like model being trained on almost 300 billion tokens, and for quite some time, the 300 billion tokens was stuck, like it became the standard, and a lot of people were using that. But then fast-forward less than two years later came the second iteration of the scaling laws, the Chinchilla paper, where the, the recommendation was slightly different. It was like we were not paying enough attention to the size of the data. Actually, you should now think of the data and the size as equally … and the size of the model … as equally important. So if you were to invest in X more, you should just split them evenly between bigger models and more data. And that was quite a change, and it actually got all the people to pay more attention to the data. But then fast-forward one more year, in 2023—and maybe pioneered mostly with the Llama work from Meta and then many, many others followed suit—we started finding out that we don’t have to operate at this optimal point. We can actually push for more data and the model will continue to improve. And that’s interesting because when you are thinking about the training versus the deployment or the inference parts of the life cycle of the model, they are actually very different. When you are training the model, you would like the model to learn to generalize as best as possible. When you are actually using the model, the size of the model becomes a huge difference. I actually recall an interesting quote from a 2015 paper by Geoff Hinton and others. That’s the paper that introduced the idea of distillation for neural networks. Distillation was there before from the work of, of Rich Caruana, our colleague here at Microsoft, and others. But in 2015, there was this paper specifically discussing distilling models for neural network models, and one of the motivating sentences at the very beginning of the paper was basically talking about insects and how insects would have different forms throughout their life cycles. At the beginning of their life, they are optimized for extracting energy and nutrients from the environment, and then later on, in their adult form, they have very different forms as optimized for flying and traveling and reproduction and so on and so forth. So that, that analogy is very interesting here because like you can think about the same not just in the context of distillation, as this paper was describing, but just for pretraining the models in general. Yes, the optimal point might have been to equally split your compute between the data and the size, but actually going more towards having more and more data actually is beneficial. As long as the model is getting better, it will give you a lot more benefit because you have a smaller model to use during the inference time. And we would see that with the latest iteration of the Llama models, we are now seeing models as small as 7 billion parameters being trained on 1 to 2 trillion tokens of data, which was unheard before.

LLORENS: Let’s talk a bit more about evaluating performance. Of course, the neural scaling laws that you referenced earlier really predict how the performance of a model on the task of next word prediction will improve with the size of the model or the size of the data. But of course, that’s not what we really care about. What we’re really after is better performance on any number of downstream tasks like reasoning, document summarization, or even writing fiction. How do we predict and measure performance in that broader sense? 

AWADALLAH: Yeah, that’s a very good question. And that’s another area where our understanding of evaluating generative models in general has been challenged quite a bit over the last year in particular. And I think one of the areas that I would recommend to spend a lot of time working on right now is figuring out a better strategy around evaluating generative language models. We … this field has been very benchmark driven for many, many years, and we have been seeing a lot of very well-established benchmarks that have been helping the community in general make a lot of progress. We have seen leaderboards like GLUE and SuperGLUE, and many, many others play a very important role in the development of pretrained models. But over the last year, there has been a lot of changes. One is that these benchmarks are being saturated really, really quickly. There was … this paper that I was reading a few, reading a few months back talking about how we went from times where benchmarks like Switchboard and MNIST for speech and image processing lasted for 10 to 20 years before they get saturated to times where things like SQuAD and GLUE and SuperGLUE are getting saturated in a year or two to now where many of the benchmarks just get like maybe two or three submissions and that’s it. It gets saturated very quickly after that. BIG-Bench is a prime example of that, where it was like a collaborative effort, over 400 people coming together from many different institutions designing, a benchmark to challenge language models. And then came GPT-4, and we’re seeing that it’s doing really, really, really well, even in like zero-shot and, and, and few-shot settings, where the tasks are completely new to the models. So the model out of the box is basically solving a lot of the benchmarks that we have. That’s an artifact of the significant progress that we have been seeing and the speed of that progress, but it’s actually making that, that answer to that question even harder. But there’s another thing that’s making it even harder is that the benchmarks are giving us a much more limited view of the actual capabilities of these models compared to what they can actually do, especially models like GPT-4. The, the breadth of capabilities of the model is beyond what we had benchmarks to measure it with. And we have seen once it was released, then once people started interacting with it, there are so many experiences and so many efforts just thinking about what can we do with that model. Now we figured out that it can do this new task; it can do that new task. I can use it in this way that I didn’t think about before. So that expansion in the surface of capabilities of the model is making the question of evaluating them even, even harder and, and moving forward, I think this would be one of the most interesting areas to really spend time on.

LLORENS: Why don’t we talk a bit about a paper that you recently published with some Microsoft Research colleagueS called “Orca: Progressive Learning from Complex Explanation Traces of GPT-4.” And there’s a couple of, of concepts that we’ve been talking about that I want to pull through to, to a discussion around, around this work. One is the idea of quality of data. And so it would be great to hear, you know, some of the intuitions around … yeah, what, what drove you to focus on data quality versus, you know, number of parameters or number of tokens? And then we can also come back to this notion of benchmarks, because to publish, you have to pick some benchmarks, right? [LAUGHS] So, so first, why don’t we talk about the intuitions behind this paper and what you did there, and then I’d love to understand how you thought through the process of picking benchmarks to evaluate these models. 

AWADALLAH: Yeah, so, so in this paper, we were basically thinking about like … there has been a lot of work actually on thinking about how do we have a very powerful model and use it to improve a less powerful model. This is not a new concept. It has been there forever, and I mentioned the Hinton et al. paper on distillation, one of the pioneer papers applying that to neural networks. And over time, this field actually continued getting better and better. And the way the large, more powerful models were used just continued evolving. So people were using the logits generated by the model and then maybe looking at intermediate layers and their output, maybe looking at attention maps and trying to map that between the models and coming up with more and more complex ways of distilling information from the powerful model to improve a less powerful model. But with models like GPT-4, we were thinking that GPT-4 is so good that you can actually start thinking about different ways of having a model teaching another model. And in that particular case, the idea was, can we actually have the powerful model explain in step by step how to do the task, and can we actually have a smaller model learn from that? And how far can this actually help the smaller one? A big part of this has to do with the data quality but also with the teacher model quality. You wouldn’t be able to … and this gets us into the whole notion of synthesized data and the role of synthesized data can play in making models better. Models like GPT-4, the level of capability where you could actually generate a lot of synthetic data at a very high quality comparable in some cases to what you’d get from a human, better in some cases than what you could get from a human. And even more than that, when you are working with a model like GPT-4, there has been a lot of work over the last few months demonstrating that you can even get the model to be a lot better by having the model reflect on what it’s doing, having the model critique what it’s doing and try to come up with even corrections and improvements to its own generation. And once you have this going, you see that you can actually create very high-quality synthetic data in so many ways, mostly because of the quality of the model but also because of like these different ways of generating the data on top of the model. And then it was really an experiment of how far can another model learn from these models. And by the way—and there is … we’re seeing some work like that, as well—it doesn’t even have to be a different model. It can be the same model improving itself. It can be the same model giving feedback to itself. That coincided with actually us having, having … we have been spending a lot of time thinking about this idea of learning from feedback or like continual improvement. How can we take a language model and continue to improve it based on interaction, based on feedback? So we started connecting these two concepts and basically thinking of it like the powerful model is just giving feedback to our much less powerful model and trying to help it improve across certain dimensions. And that’s where that line of work started. And what we were finding out is that you can actually have the more powerful model teach a smaller model. It would have definitely much narrower capabilities than the bigger model because like by virtue of this training cycle, you are just focused on teaching it particular concepts. You cannot teach it everything that the larger model can do. But also because this is another example of this like post-training step, like this model has already been pretrained language model and it’s always limited by the basic capabilities that it has. So, yes, the large language model can teach it a little bit more, but it will always be limited by that.

LLORENS: Now you mentioned … you’ve sketched out now the idea of using a powerful general-purpose model through some process of distillation to train a, a smaller, more special, more specialized model. And in the paper, you, you and your colleagues offer a number of case studies. So can you, can you pick one? Give, give us, you know, give us an example of a specialized domain and the way that you utilize GPT-4 to accomplish this training and what the performance outcome was. 

AWADALLAH: Yeah, actually, when we were working on this paper, the team was thinking that what capability should we try to focus on to, to demonstrate that the small model can improve from, from the guidance of the much more powerful model. And we were thinking it would be very cool if we can demonstrate that the small model can get better at reasoning, because reasoning has been one of the capabilities that have been clearly emerging with larger and larger models, and models like GPT-4 demonstrate the level of reasoning that we have never seen with any of our systems before. So we were thinking can we … can, can GPT-4 help actually get the smaller model to be better at reasoning. And that had a lot of implications on the selection of what datasets to use for, for creating the synthetic data. In this particular paper, by the way, we’re not, we’re not using GPT-4 to answer the questions. We already have the questions and the answers. We are just asking GPT-4 to explain it in step by step. This is similar to what we have been seeing with chain-of-thought reasoning, chain-of-thought prompting, and other different prompting techniques showing that if you actually push the language model to go step by step, it can actually do a lot better. So we are basically saying, can we have these explanations and step-by-step traces and have them help the smaller language model learn to reason a little bit better. And because of that, actually—and this goes back to your earlier questions about benchmarks—in this particular paper, we chose two main benchmarks. There were more than two, but like the two main benchmarks where BIG-Bench Hard and AGIEval. BIG-Bench Hard is a 23 subset of BIG-Bench that we were just talking about earlier, and a lot of the tasks are very heavy on reasoning. AGIEval is a set of questions that are SAT-, LSAT-, GRE-, and GMAT-type of questions. They are also very heavy on reasoning. The benchmarks were selected to highlight the reasoning improvement and the reasoning capability of the model. And we had, we had a bunch of use cases there, and you would see one of the common themes there is that there is actually … even before the use cases, if you look at the, the results, the reasoning ability as measured by these two benchmarks at least of the base model significantly improved. Still far behind the teacher. The teacher is much, much more powerful and there’s no real comparison, but still the fact that collecting synthetic data from a model like GPT-4 explaining reasoning steps could help a much smaller model get better at reasoning and get better by that magnitude was a very interesting finding. We were, we were quite a bit surprised, actually, by the results. We thought that it will improve the model reasoning abilities, but it actually improved it beyond what we expected. And again, this goes back to like imagine if we were … if we wanted to do that without a model like GPT-4, that would entail having humans generate explanations for a very large number of tasks and make sure that these explanations remain faithful and align with the answers of the question. It would have been a very hard task, and the type of annotator that you would like to recruit in order to do that, it would have been … even made it harder and slower. But having, having the capabilities of a model like GPT-4 is really what made it possible to do that.

LLORENS: You’ve, you’ve outlined now, you know, your experiments around using GPT-4 to train a smaller model, but earlier, you also alluded to a pretty compelling idea that maybe even a large, powerful model could, I guess, self-improve by generate, you know, performing a generation, critiquing itself, and then somehow guiding, you know, the parameter weights in a way that, that was informed by the critique. Is that, was that part of these experiments, or what … or, or is that … does that work? [LAUGHS] Have, have we … do we have experimental evidence of that?  

AWADALLAH: Yeah, I think, I think that’s a very good question. That was really how we started. That was really what we were aiming and still trying to do. The value … we started off by asking that question: can we actually have a model self-improve, self-improve itself? From an experimental perspective, it was much easier to have a powerful model help a smaller model improve. But self-improvement is really what we, what got us excited about this direction from the beginning. There has been evidence from other work showing up over the last short period actually showing that this is actually a very promising direction, too. For example, one of the very interesting findings about these powerful models—I think that the term frontier models is being used to refer to them now—is that they have a very good ability at critiquing and verifying output. And sometimes that’s even better than their ability at solving the task. So you can basically go to GPT-4 and ask it to solve a coding question. Write a Python function to do something. And then you can go again to GPT-4 and ask it to look back at that code and see if there are any bugs in there. And surprisingly, it would identify bugs in its own generation with a very high quality. And then you can go back to GPT-4 again and ask it to improve its own generation and fix the bugs. And it does that. So we actually have a couple of experiments with that. One of them in a toolkit called LIDA that one of my colleagues here, Victor [Dibia], has been working on for some time. LIDA is a tool for visualizations, and you basically go there and submit a query. The query would be, say, create a graph that shows the trends of stocks over the last year. And it’ll actually go to the data basically, engineer Python code. The Python code, when compiled and executed, would generate a visualization. But then we were finding out that we don’t have to stop there. We can actually ask GPT-4 again to go back to that visualization and critique it, and it doesn’t have to be open critique. We can define the dimensions that we would like to improve on and ask GPT-4 to critique and provide feedback across these dimensions. Like it could be the readability of the chart. It could be, is the type of chart the best fit for the data? And surprisingly it does that quite well. And then that opens the door to so many interesting experiences where you can, after coming up with the initial answer, you can actually suggest some of these improvements to a human. Or maybe if you are confident enough, you just go ahead and apply them even without involving the human in the loop and you actually get a lot better. There was another experiment like that where another colleague of mine has been working on a library called AutoGen, which basically helps with these iterative loops on top of language models, as well as figuring out values of hyperparameters and so on and so forth. And the experiments were very similar. There was a notion there of like having a separate agent that the team refers to as a user proxy agent, and that agent basically has a criteria of what the user is trying to do. And it keeps asking GPT-4 to critique the output and improve the output up until this criteria is met. And we see that we get much, much better value with using GPT-4 this way. That cycle is expensive, though, because you have to iterate and go back multiple times. The whole idea of self-improvement is basically, can we literally distill that cycle into the model itself again so that as the model is being used and being asked to maybe critique and provide feedback or maybe also getting some critique and feedback from the human user, can we use that data to continue to improve the model itself?

LLORENS: It is pretty fascinating that these models can be better at evaluating a candidate solution to a task than generating a novel solution to the task. On the other hand, maybe it’s not so surprising. One of the things that’s hard about or one of the things that can be challenging is this idea of, you know, prompt engineering, by which I’m trying to specify a task for the, for the model to solve or for the AI system to solve. But if you think about it, the best I can do at specifying the task is to actually try my best to complete the task. I’ve now specified the task to the greatest extent that I possibly can. So the machine kind of has my best task specification. With that, that information, now it becomes a kind of maybe even in some cases a superhuman evaluator. It’s doing better than I can at evaluating my own work. So that’s kind of an interesting twist there. Back, you know, back to the Orca paper, one of the things that you wouldn’t have seen … you know, earlier in the talk, you, you harkened back to say a decade ago, when benchmarks lasted a long, a longer time, one of the things that we would not necessarily have seen in a paper from that era, you know, say the CNN era of AI, is, is, a, is a safety evaluation, you know, for a specialized object recognition model. But in the Orca paper, we do have a safety evaluation. Can you, you talk a little bit about the thought process behind the particular evaluations that you did conduct and, and why these are necessary in the first place in this era of AI? 

AWADALLAH: Yeah, I think in this era of AI, this is one of the most important parts of the development cycle of any LLM—large or small. And as we were just describing, we are discovering abilities of these models as we go. So just as there will be a lot of emerging capabilities that are surprising and useful and interesting, this would also open the door to a lot of misuse. And safety evaluation is at least … is the least we can do in order to make sure that we understand how, how can this model be used and what are some of the possible harms or the possible misuses that can come from using these models? So I think, I think this is, this is now definitely should be a standard for any work on language models. And here we are not, we’re not really training a language model from scratch. This is more of like a post-training or a fine-tuning of an existing language model. But even for, for, for research like that, I think safety evaluation should be a critical component of that. And, yes, we did some, and we, we, we actually have a couple of paragraphs in the paper where we say we need to do a lot more, and we are doing a lot more of that right now. I think … what we did in the paper that … we focused on only two dimensions: truthfulness and toxicity. And we were basically trying to make sure that we are trying to see the additional fine-tuning and training that we do, is it improving the model across these dimensions or is it not? And the good news that it was actually improving it in both dimensions, at least with the benchmarks that we have tried. I, I think it was interesting that actually on the, on the toxicity aspect in particular, we found that this particular type of post-training is actually improving the base model in terms of its tendency to generate toxic or biased content. But I think a big part of that is that we, we’re using Azure APIs in part of the data cleaning and data processing, and Azure has invested a lot of time and effort in making sure that we have a lot of tools and classifiers for identifying unsafe content, so the training data, the post-training data, benefited from that, which ended up helping the model, as well. But to your point, I think this is a critical component that should go into any work related to pretraining or post-training or even fine-tuning in many cases. And we did some in the paper, but I think, I think there’s a lot more to be done there. 

LLORENS: Can you talk a little bit more about post-training as distinct from pretraining? How that, how that process has evolved, and, and where you see it going from here?

AWADALLAH: I, I, I see a ton of potential and, and opportunity there actually. And pretraining is the traditional language model training as we have always done it. Surprisingly, actually, if you go back to … like I, I was … in, in one of the talks, I was showing like a 20-years-ago paper by Bengio et al. doing the language model training with neural networks, and we’re still training neural networks the same way, autoregressive next word prediction. Very different architecture, a lot of detail that goes into the training process, but we are still training them as a language model to predict the next word. In a big departure from that—and it started with the InstructGPT paper and then a lot of other work had followed—there was this introduction of other steps of the language model training process. The first step is instruction tuning, which is showing the model prompts and responses and asking it to … and training the model on these prompts and responses. Often these responses are originated by a human. So you are not just training the model to learn the language model criteria only anymore, you are actually training it to respond to a way the human would want it to respond. And this was very interesting because you could see that the language models are really very good text-completion engines. And at some time actually, a lot of folks were working on framing the task such that it looks like this text completion. So if you are doing classification, you would basically list your input and then ask a question where the completion of that question would be the class that you are looking for. But then the community started figuring out that you can actually introduce this additional step of instruction tuning, where now out of all the possible ways of completing a sentence like if I’m asking a question, maybe listing other similar questions is a very good way of completion. Maybe repeating that question with more details is another way of completion, or answering the question is a third way of completion, and all of them could be highly probable. The instruction tuning is basically teaching the model the way to respond, and a big part of that has to do with safety, as well, because you could demonstrate how we want the model to be helpful, how we want the model to be harmless, in this instruction-tuning step. But the instruction tuning step is only showing the model what to do. It’s not showing it what not to do. And this is where the RLHF step came in, the reinforcement learning from human feedback. What’s happening really is that instead of showing the model a single answer, we’re showing them a little more than one answer. And we are basically showing them only a preference. We’re basically telling the model Answer A is better than Answer B. It could be better for many reasons. We are just encoding our criteria of better into these annotations, and we are training a reward model first that basically it’s job is, given any response, would assign a scalar value to it on how good it is. And then we are doing the RLHF training loop, where the reward model is used to update the original model such that it learns what are better responses or not or worse responses and tries to align more with the better responses. The post-training is, as a concept, is very related and, and sometimes referred to also as alignment, because the way post-training has been mostly used is to align the model to human values, whether this be being helpful or being harmless. 

LLORENS: Ahmed, as we, as we wrap up here, typically, I would ask something like, you know, what’s next for your research, and maybe you can tell us a little bit about what’s next for your research. [LAUGHS] But, but before you do that, I’d love to understand what, what key limitation you see in the current era of AI that you would … would be on your wish list, right, as something that maybe you and your team or maybe the broader field has accomplished in the next five years. What, what new capabilities would be on your wish list for AI over the next five years? 

AWADALLAH: Yeah, given, given the progress, I would say even much shorter than five years. 

LLORENS: Five months. [LAUGHS]

AWADALLAH: But I would say … actually the answer to the two questions are, are very similar. Actually, I think where we are with these models right now is much better than many people anticipated, and we are able to solve problems that we didn’t think we could solve before. One of the key capabilities that I would like to see getting better over the next, few months to a few years—hopefully more toward few months—is the ability of the model to continue to learn. This like continual learning loop where the model is learning as it interacts with the humans. The model is reflecting on past experiences and getting better as we use it, and maybe also getting better in an adaptive way. Like we sometimes use this term adaptive alignment, where we are basically saying we want the model actually to continue to align and continue to align in the way it behaves across multiple dimensions. Like maybe the model will get more personal as I use it, and it will start acting more and, and behaving more in a way I want it to be. Or maybe I am developing a particular application, and for that application, I want the model to be a lot more creative or I want the model to be a lot more grounded. We can do some of that with prompting right now, but I think having more progress along this notion of continual learning, lifelong learning … this has been a heavily studied subject in machine learning in general and has been the holy grail of machine learning for many, many, many years. Having a model that’s able to continue to learn, continue to adapt, gets better every time you use it, so just when I use it today and I interact with it and it could learn about my preferences, and next time along, I don’t have to state these preferences again. Or maybe when it makes a mistake and I provide a feedback, next time along, it already knows that it had made that mistake and it already gives me a better solution.  

LLORENS: That should have been the last question. But I think I have one more. That is, how will we know that the models are getting better at that, right? That’s a metric that’s sort of driven by interaction versus, you know, static evaluation. So how do you, how do you measure progress in adaptive alignment that way?

AWADALLAH: I think, I think that’s a very interesting point. And this actually ties this back with two concepts that we brought up earlier: the evaluation side and the safety side. Because from the evaluation perspective, I do think we need to move beyond static benchmark evaluation to a more dynamic human-in-the-loop evaluation, and there’s already been attempts and progress at that just over the past few months, and there is still a lot more to do there. The evaluation criteria will not also be universal. Like there will be a lot … like a lot of people talk about the, let’s say, fabrications—the models making up information, facts. Well, if I am using the model to help me write fictional stories, like this becomes a feature; it’s not a bug. But if I’m using the model to ask questions, especially in the high-stakes scenario, it becomes a very big problem. So having a way of evaluating these models that are dynamic, that are human-in-the-loop, that are adaptive, that aligns with objectives of how we are using the models will be a very important research area, and that ties back to the safety angles, as well, because if I … if we are barely … we’re, we’re …  everybody is working really hard to try to understand the safety of the models after the models are being trained and they are fixed. But what if the models continue to improve? What if it’s continuing to learn? What if it’s learning things from me that are different than what it’s learning from you? Then that notion of alignment and safety and evaluation of that becomes also a very open and interesting question.  

LLORENS: Well, look, I love the ambition there, Ahmed, and thanks for a fascinating discussion. 

AWADALLAH: Thank you so much, Ashley.

The post AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens appeared first on Microsoft Research.

Read More

Microsoft at ACM SIGCOMM 2023: Innovating the future of networking

Microsoft at ACM SIGCOMM 2023: Innovating the future of networking

Innovating the future of networking

Modern applications heavily rely on robust network infrastructure, requiring continuous innovation. In this evolving landscape, Microsoft is at the forefront, spearheading innovation efforts in networking and strengthening the foundational network infrastructure that underpins the cloud ecosystem. By investing in and enhancing this critical infrastructure, Microsoft not only ensures the resilience and scalability of cloud services but also lays the groundwork for the sophisticated and transformative applications that will continue to define the technological landscape.

ACM SIGCOMM (opens in new tab), the premier annual conference of the Association for Computing Machinery’s special interest group on data communication (opens in new tab) (SIGCOMM), is dedicated to the study of communication and computer networks. Microsoft was proud to be a Gold Sponsor of this year’s conference, publishing 10 papers and participating in the organizing committee. Dave Maltz (opens in new tab), technical fellow and corporate vice president of Azure Networking, served as one of the program committee chairs, helping to oversee the conference’s technical program. Additionally, we are proud to acknowledge the significant achievement of one of our youngest researchers, Siva Kakarla (opens in new tab), recognized as the ACM SIGCOMM Dissertation Award (opens in new tab) runner up for his thesis, “Formal Methods for a Robust Domain Name System (opens in new tab).”  

Microsoft also had a booth showcasing some of our latest technologies, including hollow core biber-based connectivity, SoNIC on smart switches, container networking, technologies for L3/L4-based DDoS protection, and technologies that we are building to extend the cloud into space—for both earth observation and satellite communication.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.


Paper highlights 

The papers Microsoft published at SIGCOMM 2023 span a wide spectrum of networking domains, ranging from 5G and wide area networks (WAN) to enterprise networks. They also explore various aspects of networking, including traffic engineering, network offload strategies, and specialized network designs tailored for applications like gaming, video conferencing, and financial services.   

Here are some of the highlights:

Switchboard: Efficient Resource Management for Conferencing Services 

Efficient resource management is crucial for conferencing services, such as Microsoft Teams, to balance user experience and cost-effectiveness. This involves optimizing the allocation of media processing servers, responsible for handling media streams during calls. Rahul Bothra, Rohan Gandhi, Ranjita Bhagwan, Venkat Padmanabhan, Rui Liang, Steve Carlson, Vinayaka Kamath, Sreangsu Acharyya, Ken Sueda, Somesh Chaturmohta, and Harsha Sharm introduce Switchboard, a significant advancement in resource management controllers. Switchboard is peak-aware, recognizing that resource costs vary with peak usage times and across time zones, allowing servers to serve calls during peak times and act as backups during off-peak hours. Additionally, it enhances efficiency by coordinating network and compute provisioning and application-aware resource allocation. Evaluation using Microsoft Teams data demonstrates that Switchboard reduces provisioning costs by up to 51 percent while maintaining or improving latency compared to existing solutions.

Resilient Baseband Processing in Virtualized RANs with Slingshot 

In the realm of cellular networks, virtualized radio access networks (vRANs) are gaining prominence, replacing traditional specialized hardware with software on commodity servers. However, current vRAN setups lack resilience, making it challenging to implement failover mechanisms and upgrades without prolonged service interruptions. Nikita Lazarev, Tao Ji, Anuj Kalia, Daehyeok Kim, Ilias Marinos, Francis Y. Yan, Christina Delimitrou, Zhiru Zhang, and Aditya Akella propose Slingshot, an innovative system designed to seamlessly introduce resilience to the most critical layer of vRANs, the physical layer (PHY). Slingshot accomplishes this by employing novel techniques for real-time workload migration, incorporating fast RAN protocol middleboxes, and implementing real-time RAN failure detection. A key breakthrough in Slingshot’s design is its approach to treat transient disruptions from resilience events as akin to regular wireless signal impairments, using the inherent resilience of cellular networks to these occurrences. Experiments conducted on a cutting-edge 5G vRAN testbed demonstrate Slingshot’s capability to manage PHY failover without interrupting video conferencing and causing under 110 microseconds of disruption to a TCP connection. Furthermore, it enables seamless zero-downtime upgrades in vRAN deployments.

DBO: Response Time Fairness for Cloud-Hosted Financial Exchanges 

When hosting financial exchanges in cloud environments, ensuring equal and predictable latency for all market participants is critical, especially in tasks like high-speed trading. Existing cloud deployments often struggle to maintain such fairness due to factors like congestion and varying network paths. In this paper, Prateesh Goyal, Eashan Gupta, Ilias Marinos, Chenxingyu Zhao, Radhika Mittal, and myself (Ranveer Chandra), tackle the issue arising from the lack of determinism in cloud networks, showing that achieving predictable or bounded latency isn’t a necessity to ensure fairness. Inspired by the concept of logical clocks in distributed systems, the paper introduces Delivery Based Ordering (DBO) as a novel approach to rectifying latency discrepancies among participants, helping ensure fairness. The evaluation of DBO, conducted both in a hardware testbed and a public cloud environment, demonstrates its feasibility in achieving guaranteed fairness and sustaining sub-100 microsecond latency, even at high transaction rates.

For the complete list of accepted publications by Microsoft researchers, please see the publications list on Microsoft at SIGCOMM 2023.

a group of researchers attending SIGCOMM 2023. They are standing in front of multiple buildings.

Learn about opportunities

Microsoft welcomes talented individuals across various roles at Microsoft Research, Azure Networking, and other departments. Whether you’re a networking partner or researcher, we welcome your collaboration and exploration to advance computer networking and invite you to be part of the team crafting cutting-edge solutions for industry challenges. Review our open positions at the Microsoft Research website.

The post Microsoft at ACM SIGCOMM 2023: Innovating the future of networking appeared first on Microsoft Research.

Read More

Shout at the Devil: Capcom’s ‘Devil May Cry 5’ Joins GeForce NOW

Shout at the Devil: Capcom’s ‘Devil May Cry 5’ Joins GeForce NOW

GFN Thursday is downright demonic, as Devil May Cry 5 comes to GeForce NOW.

Capcom’s action-packed third-person brawler leads 15 titles joining the GeForce NOW library this week, including Gears Tactics and The Crew Motorfest.

It’s also the last week to take on the Ultimate KovaaK’s Challenge. Get on the leaderboard today for a chance to win a 240Hz gaming monitor, a gaming Chromebook, GeForce NOW memberships or other prizes. The challenge ends on Thursday, Sept. 21.

The Devil Returns

Devil May Cry 5 on GeForce NOW
Jackpot!

Devil May Cry 5 is the next title from Capcom’s catalog to come to GeForce NOW. Members can stream all of its high-octane, stylish action at GeForce RTX quality to nearly any device, thanks to the power of GeForce NOW cloud gaming servers.

The threat of demonic power has returned to menace the world once again. Take on hordes of enemies as Nero, V or the legendary Dante with the ramped-up sword-and-gun gameplay that the series is known for. Battle epic bosses in adrenaline-fueled fights across the overrun Red Grave City — all to the beat of a truly killer soundtrack.

Take the action on the go thanks to the power of the cloud. GeForce NOW Priority members can take the fight with them across nearly any device at up to 1080p and 60 frames per second.

Kickin’ It Into High Gear

Gears Tactics on GeForce NOW
A squad of survivors is all it takes to stop the Locust threat.

Rise up and fight, members. Gears Tactics is the next PC Game Pass title to arrive in the cloud.

Gears Tactics is a fast-paced, turn-based strategy game from one of the most acclaimed video game franchises — Gears of War. Set a dozen years before the first Gears of War game, the Gears Tactics story opens as cities on the planet Sera begin falling to the monstrous threat rising from underground: the Locust Horde. With the government in disarray, a squad of survivors emerges as humanity’s last hope. Play as the defiant soldier Gabe Diaz to recruit, develop and command squads on a desperate mission to hunt down the relentless and powerful leader of the Locust army, Ukkon, the group’s monster-making mastermind.

Fight for survival and outsmart the enemy with the sharpness of 4K resolution streaming from the cloud with a GeForce NOW Ultimate membership.

Hit the Road, Jack

The Crew Motorfest on GeForce NOW
The best way to see Hawaii is by car, at 100 mph.

The Crew Motorfest also comes to GeForce NOW this week. The latest entry in Ubisoft’s racing franchise drops drivers into the open roads of Oahu, Hawaii. Get behind the wheel of 600+ iconic vehicles from the past, present and future, including sleek sports cars, rugged off-road vehicles and high-performance racing machines. Race alone or with friends through the bustling city of Honolulu, test off-roading skills on the ashy slopes of a volcano or kick back on the sunny beaches behind the wheel of a buggy.

Members can take a test drive from Sept. 14-17 with a five-hour free trial. Explore the vibrant Hawaiian open world, participate in thrilling driving activities and collect prestigious cars, with all progress carrying over to the full game purchase.

Take the pole position with a GeForce NOW Ultimate membership to stream The Crew Motorfest and more than 1,600 other titles at the highest frame rates. Upgrade today.

A New Challenge

Gunbrella on GeForce NOW
Rain, rain, go away. The umbrella is also a gun today.

With GeForce NOW, there’s always something new to play. Here’s what’s hitting the playlist this week:

  • Tavernacle! (New release on Steam, Sept. 11)
  • Gunbrella (New release on Steam, Sept. 13)
  • The Crew Motorfest (New release on Ubisoft Connect, Sept. 14)
  • Amnesia: The Bunker (Xbox, available on PC Game Pass)
  • Descenders (Xbox, available on PC Game Pass)
  • Devil May Cry 5 (Steam)
  • Gears Tactics (Steam and Xbox, available on PC Game Pass)
  • Last Call BBS (Xbox)
  • The Matchless Kungfu (Steam)
  • Mega City Police (Steam)
  • Opus Magnum (Xbox)
  • Remnant II (Epic Games Store)
  • Space Hulk: Deathwing – Enhanced Edition (Xbox)
  • Superhot (Xbox)
  • Vampyr (Xbox)

What are you planning to play this weekend? Let us know on Twitter or in the comments below.

Read More

Visualize an Amazon Comprehend analysis with a word cloud in Amazon QuickSight

Visualize an Amazon Comprehend analysis with a word cloud in Amazon QuickSight

Searching for insights in a repository of free-form text documents can be like finding a needle in a haystack. A traditional approach might be to use word counting or other basic analysis to parse documents, but with the power of Amazon AI and machine learning (ML) tools, we can gather deeper understanding of the content.

Amazon Comprehend is a fully, managed service that uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend develops insights by recognizing the entities, key phrases, sentiment, themes, and custom elements in a document. Amazon Comprehend can create new insights based on understanding the document structure and entity relationships. For example, with Amazon Comprehend, you can scan an entire document repository for key phrases.

Amazon Comprehend lets non-ML experts easily do tasks that normally take hours of time. Amazon Comprehend eliminates much of the time needed to clean, build, and train your own model. For building deeper custom models in NLP or any other domain, Amazon SageMaker enables you to build, train, and deploy models in a much more conventional ML workflow if desired.

In this post, we use Amazon Comprehend and other AWS services to analyze and extract new insights from a repository of documents. Then, we use Amazon QuickSight to generate a simple yet powerful word cloud visual to easily spot themes or trends.

Overview of solution

The following diagram illustrates the solution architecture.

To begin, we gather the data to be analyzed and load it into an Amazon Simple Storage Service (Amazon S3) bucket in an AWS account. In this example, we use text formatted files. The data is then analyzed by Amazon Comprehend. Amazon Comprehend creates a JSON formatted output that needs to be transformed and processed into a database format using AWS Glue. We verify the data and extract specific formatted data tables using Amazon Athena for a QuickSight analysis using a word cloud. For more information about visualizations, refer to Visualizing data in Amazon QuickSight.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Upload data to an S3 bucket

Upload your data to an S3 bucket. For this post, we use UTF-8 formatted text of the US Constitution as the input file. Then you’re ready to analyze the data and create visualizations.

Analyze data using Amazon Comprehend

There are many types of text-based and image information that can be processed using Amazon Comprehend. In addition to text files, you can use Amazon Comprehend for one-step classification and entity recognition to to accept image files, PDF files, and Microsoft Word files as input, which are not discussed in this post.

To analyze your data, complete the following steps:

  1. On the Amazon Comprehend console, choose Analysis jobs in the navigation pane.
  2. Choose Create analysis job.
  3. Enter a name for your job.
  4. For Analysis type, choose Key phrases.
  5. For Language¸ choose English.
  6. For Input data location, specify the folder you created as a prerequisite.
  7. For Output data location, specify the folder you created as a prerequisite.
  8. Choose Create an IAM role.
  9. Enter a suffix for the role name.
  10. Choose Create job.

The job will run and the status will be displayed on the Analysis jobs page.

Wait for the analysis job to complete. Amazon Comprehend will create a file and place it in the output data folder you provided. The file is in .gz or GZIP format.

This file needs to be download and converted to a non-compressed format. You can download an object from the data folder or S3 bucket using the Amazon S3 console.

  1. On the Amazon S3 console, select the object and choose Download. If you want to download the object to a specific folder, choose Download on the Actions menu.
  2. After you download the file to your local computer, open the zipped file and save it as an uncompressed file.

The uncompressed file must be uploaded to the output folder before the AWS Glue crawler can process it. For this example, we upload the uncompressed file into the same output folder that we use in later steps.

  1. On the Amazon S3 console, navigate to your S3 bucket and choose Upload.
  2. Choose Add files.
  3. Choose the uncompressed files from your local computer.
  4. Choose Upload.

After you upload the file, delete the original zipped file.

  1. On the Amazon S3 console, select the bucket and choose Delete.
  2. Confirm the file name to permanently delete the file by entering the file name in the text box.
  3. Choose Delete objects.

This will leave one file remaining in the output folder: the uncompressed file.

Convert JSON data to table format using AWS Glue

In this step, you prepare the Amazon Comprehend output to be used as input into Athena. The Amazon Comprehend output is in JSON format. You can use AWS Glue to convert JSON into a database structure to ultimately be read by QuickSight.

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. Enter a name for your crawler.
  4. Choose Next.
  5. For Is your data already mapped to Glue tables, select Not yet.
  6. Add a data source.
  7. For S3 path, enter the location of the Amazon Comprehend output data folder.

Be sure to add the trailing / to the path name. AWS Glue will search the folder path for all files.

  1. Select Crawl all sub-folders.
  2. Choose Add an S3 data source.

  1. Create a new AWS Identity and Access Management (IAM) role for the crawler.
  2. Enter a name for the IAM role.
  3. Choose Update chosen IAM role to be sure the new role is assigned to the crawler.
  4. Choose Next to enter the output (database) information.
  5. Choose Add database.
  6. Enter a database name.
  7. Choose Next.
  8. Choose Create crawler.
  9. Choose Run crawler to run the crawler.

You can monitor the crawler status on the AWS Glue console.

Use Athena to prepare tables for QuickSight

Athena will extract data from the database tables the AWS Glue crawler created to provide a format that QuickSight will use to create the word cloud.

  1. On the Athena console, choose Query editor in the navigation pane.
  2. For Data source, choose AwsDataCatalog.
  3. For Database, choose the database the crawler created.

To create a table compatible for QuickSight, the data must be unnested from the arrays.

  1. The first step is to create a temporary database with the relevant Amazon Comprehend data:
CREATE TABLE temp AS
SELECT keyphrases, nested
FROM output
CROSS JOIN UNNEST(output.keyphrases) AS t (nested)
  1. The following statement limits to phrases of at least three words and groups by frequency of the phrases:
CREATE TABLE tableforquicksight AS
SELECT COUNT(*) AS count, nested.text
FROM temp
WHERE nested.Score > .9 AND 
 length(nested.text) - length(replace(nested.text, ' ', '')) + 1 > 2
GROUP BY nested.text
ORDER BY count desc

Use QuickSight to visualize output

Finally, you can create the visual output from the analysis.

  1. On the QuickSight console, choose New analysis.
  2. Choose New dataset.
  3. For Create a dataset, choose From new data sources.
  4. Choose Athena as the data source.
  5. Enter a name for the data source and choose Create data source.

  1. Choose Visualize.

Make sure QuickSight has access to the S3 buckets where the Athena tables are stored.

  1. On the QuickSight console, choose the user profile icon and choose Manage QuickSight.

  1. Choose Security & permissions.
  1. Look for the section QuickSight access to AWS services.

By configuring access to AWS services, QuickSight can access the data in those services. Access by users and groups can be controlled through the options.

  1. Verify Amazon S3 is granted access.

Now you can create the word cloud.

  1. Choose the word cloud under Visual types.
  2. Drag text to Group by and count to Size.


Choose the options menu (three dots) in the visualization to access the edit options. For example, you might want to hide the term “other” from the display. You can also edit items such as the title and subtitle for your visual. To download the word cloud as a PDF, choose Download on the QuickSight toolbar.

Clean up

To avoid incurring ongoing charges, delete any unused data and processes or resources provisioned on their respective service console.

Conclusion

Amazon Comprehend uses NLP to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. You can use Amazon Comprehend to create new products based on understanding the structure of documents. For example, with Amazon Comprehend, you can scan an entire document repository for key phrases.

This post described the steps to build a word cloud to visualize a text content analysis from Amazon Comprehend using AWS tools and QuickSight to visualize the data.

Let’s stay in touch via the comments section!


About the Authors

Kris Gedman is the US East sales leader for Retail & CPG at Amazon Web Services. When not working, he enjoys spending time with his friends and family, especially summers on Cape Cod. Kris is a temporarily retired Ninja Warrior but he loves watching and coaching his two sons for now.

Clark Lefavour is a Solutions Architect leader at Amazon Web Services, supporting enterprise customers in the East region. Clark is based in New England and enjoys spending time architecting recipes in the kitchen.

Read More

Research Focus: Week of September 11, 2023

Research Focus: Week of September 11, 2023

Microsoft Research Focus 24 | Week of September 11, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

PolySem: Efficient Polyglot Analytics on Semantic Data

Data scientists and data engineers spend a large portion of their time trying to understand, clean and transform their data before they can even start performing meaningful analysis. Most database vendors provide business intelligence (BI) tools as an efficient and user-friendly platform for customers to perform data cleaning, preparation and linking tasks to obtain actionable semantic data. However, customers are increasingly interested in querying semantic data through various modalities including SQL, imperative programming languages such as Python, and natural language queries. Today, customers are limited to using either the visual interfaces provided by these tools or languages that are specific to the particular tool.

In a new paper: PolySem: Efficient Polyglot Analytics on Semantic Data, researchers from Microsoft propose techniques to enable the execution of user queries expressed in different modalities on semantic datasets without having to export data out of the BI system. Their techniques include automatic translation of user queries into a language-agnostic representation of data processing operations, and subsequently into the specific query language that is amenable to execution on the BI engine. Evaluation results on BI and decision support benchmarks suggest significant improvements in query performance compared to other popular data processing engines.

Microsoft Research Podcast

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

Emre Kiciman and Amit Sharma discuss their paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” and how it examines the causal capabilities of large language models (LLMs) and their implications.


NEW RESOURCE

Generative retrieval for conversational question answering

The growth of conversational agents, including voice assistants and chatbots, has led to a shift towards dialogue-based interfaces for information-seeking activities. This has spurred the development of conversational question answering (QA) systems. Effective passage retrieval, which excludes irrelevant data from scanned documents, is crucial but challenging for such systems due to the ambiguity of questions. Current methods rely on the dual-encoder architecture to embed contextualized vectors of questions in conversations. However, this architecture is limited in the embedding bottleneck and the dot-product operation.

To alleviate these limitations, researchers from Microsoft propose generative retrieval for conversational QA (GCoQA). GCoQA assigns distinctive identifiers for passages and retrieves passages by generating their identifiers token-by-token via the encoder–decoder architecture. In this generative way, GCoQA eliminates the need for a vector-style index and could attend to crucial tokens of the conversation context at every decoding step. Experiments on three public datasets containing about twenty million passages show GCoQA achieves relative improvements of +13.6% in passage retrieval and +42.9% in document retrieval. GCoQA also reduces memory usage and improves inference speed.


NEW RESOURCE

BatteryML: An open-source tool for machine learning on battery degradation

In recent years, lithium-ion batteries have become the cornerstone of energy storage solutions, owing to their high energy density, long cycle life, and relatively low self-discharge. They have found widespread applications across various industries, including electric vehicles, consumer electronics, and renewable energy systems. Despite these advantages, lithium-ion batteries face challenges related to capacity degradation and performance optimization, which have become critical areas of focus in battery research.

Capacity degradation is a complex process influenced by various factors such as temperature, charge-discharge rate, and state of charge. Understanding and mitigating these factors is crucial for enhancing the performance and longevity of lithium-ion batteries. This has led to the development of advanced battery management systems and the application of machine learning techniques to improve prediction accuracy and optimize battery performance.

To address these challenges, researchers from Microsoft have released BatteryML (opens in new tab), a comprehensive open-source tool designed specifically for machine learning researchers, battery scientists, and materials researchers with an interest in battery performance prediction and analysis. BatteryML aims to address the challenges of capacity degradation by leveraging machine learning methods to improve various aspects of battery performance, such as capacity fade modeling, state of health prediction, and state of charge estimation.

The post Research Focus: Week of September 11, 2023 appeared first on Microsoft Research.

Read More

Abstracts: September 13, 2023

Abstracts: September 13, 2023

Microsoft Research Podcast - Abstracts

Episode 148 | September 13, 2023

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.  

In the inaugural episode of the series, Dr. Ava Amini and Dr. Kevin K. Yang, both Senior Researchers with Microsoft Health Futures, join host Dr. Gretchen Huizinga to discuss “Protein generation with evolutionary diffusion: Sequence is all you need.” The paper introduces EvoDiff, a suite of models that leverages evolutionary-scale protein data to help design novel proteins more efficiently. Improved protein engineering has the potential to help create new vaccines to prevent disease and new ways to recycle plastics.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract!—of their new and noteworthy papers.

[MUSIC FADES]

Today, I’m talking to Dr. Ava Amini and Dr. Kevin Yang, both senior researchers at Microsoft Health Futures. Ava and Kevin are co-authors of a paper titled “Protein generation with evolutionary diffusion: Sequence is all you need,” and a preprint of the paper is available now on bioRxiv. Ava and Kevin, thanks for joining us on Abstracts

KEVIN YANG: Thanks for having us. 

AVA AMINI: Thank you so much. 

HUIZINGA: So, Kevin, in just a couple sentences, tell us what problem this research addresses and why people should care.


YANG: Yeah, so proteins are this really big, important family of biomolecules, and they’re responsible for a lot of cellular processes. For example, hemoglobin carries oxygen in your blood, and insulin regulates your blood sugar levels. And people are interested in generating new proteins to do things that people care about—not necessarily in our bodies, but we’re interested in proteins as industrial enzymes so for catalysis and to make new chemicals or for therapeutics to make new drugs. And as a step towards this goal, we train a suite of models that we call EvoDiff that learns to generate realistic but novel proteins. So proteins do a lot of useful things in nature, but we can really expand their repertoire to do things that people care about but that nature may not really care about. One really good historical example of this is that most of our modern laundry detergents contain enzymes that break down things that stain your clothes. And these enzymes were based on natural proteins, but natural proteins don’t work under high heat. They don’t work in detergent. So somebody engineered those to work in the conditions of our washing machine. And they work really well nowadays. Looking forward, we look at some of the challenges facing our world, such as sustainability. So some really big things people are working on now are things like enzymes that can break down plastic and help us recycle plastic or enzymes that can perform photosynthesis more efficiently. And then on the other side, there’s therapeutics, and an obvious example there is vaccine design. So designing vaccines quickly and safely for new diseases as they emerge.  

HUIZINGA: Ava, how does your approach build on or differ from what’s been done previously in this field? 

AMINI: Yeah, so we call our approach EvoDiff, and EvoDiff has two components. The first, Evo, refers to evolutionary, and the second, Diff, refers to this notion of diffusion. And the two things that make our approach cool and powerful is the fact that we are leveraging data about proteins that is at an evolutionary scale in terms of the size and the diversity of the datasets about natural proteins that we use. And specifically, we use that data to build a type of AI model that is called a diffusion model. Now, for a little backstory on this, a few years ago, we in the AI community learned that we can do really well in generating brand-new images by taking natural images, adding small amounts of noise to them, corrupting them, and then training an AI model called a diffusion model to remove that noise. And so what we’ve done in this paper is that we have constructed and trained these diffusion models to do the same kind of process on protein data at evolutionary scale. 

HUIZINGA: Kevin, back to you, let’s go a little deeper on methodology. How did you do this?

YANG: Yeah, so we really wanted to do this in a protein sequence space. So in protein biology, you have sequences of amino acids. So that’s a series of amino acid monomers that form a chain, and then that chain folds oftentimes into a 3D structure. And function is usually mediated by that 3D structure. Unfortunately, it’s difficult and can be slow and expensive to obtain experimental structures for all these proteins. And so previous diffusion models of proteins have really focused on generating a three-dimensional structure. And then you can use some other method to find a sequence that will fold to that structure. But what we really wanted to do was generate proteins directly as sequences because it’s much easier to get sequences than it is to get structure. So there’s many, many more sequences out there than there are structures. And we know that deep learning methods scale really well as you increase the size and quality of the datasets they’re trained on. And so we … and by we, it’s me and Ava but also Nitya Thakkar, who was an undergraduate intern last summer with me and Ava, and then Sarah Alamdari, our data scientist, who also did a lot of the hands-on programming for this. And then we also got a lot of help from Rianne van den Berg, who is at AI4Science, and then Alex Lu and Nicolò Fusi, also here in New England. So we went and got these large, diverse, evolutionary datasets of protein sequences, and then we used a deep learning framework called PyTorch to train these diffusion models. And then we do a lot of computational experiments to see whether they do the things we want them to do, which Ava, I think, will talk about next. 

HUIZINGA: Right. Right. So, Ava, yes, what were your major findings?

AMINI: Yeah, the first question we really asked was, can our method, EvoDiff, generate proteins that are new, that are realistic, and that are diverse, meaning they’re not similar to proteins that exist in nature but still are realistic? And so what we found was that indeed, we can do this, and we can do this really well. In fact, the generated proteins from our method show a better coverage of the whole landscape of structural features, functional features, and features in sequence space that exist amongst proteins in nature. And so that was our first really exciting result, that we could generate proteins that were really of high quality using our method. The second thing we asked was, OK, now if we give some context to the model, a little bit of information, can we guide the generation to fulfill particular properties that we want to see in that protein? And so specifically here, we experimented with two types of experiments where first, we can give a part of the protein to the model, let’s say, a part of the protein that binds to another protein. And we hold that part constant and ask the model to generate the sequence around that. And we see that we can do really well on this task, as well. And why that’s important is because it means we can now design new proteins that meet some criteria that we, the users, want the protein to have. For example, the ability to bind to something else. And finally, the last really exciting result was … one point that we’ve talked about is why we want to do this generation in sequence space rather than structure—because structure is difficult, it’s expensive, and there are particular types of proteins that don’t actually end up folding into a final 3D structure. They’re what we call disordered. And these types of disordered proteins have really, really important roles in biology and in disease. And so what we show is that because we do our generation and design in protein sequence space, we can actually generate these types of disordered proteins that are completely inaccessible to methods that rely on using information about the protein’s 3D shape. 

HUIZINGA: So, Kevin, building on Ava’s description there of the structure and sequence space, how is your work significant in terms of real-world impact? 

YANG: Right, so there’s a lot of interest in designing or generating new proteins that do useful things as therapeutics or as industrial catalysts and for a lot of other things, as well. And what our work really does is it gives us a method that can reliably generate high-quality proteins directly in sequence space. And this is good because now we can leverage evolutionary-scale data to do this on any downstream protein engineering problem without relying on a structure-based design or structure-based data. And we’re hoping that this opens up a lot of possibilities for protein engineering, protein design, and we’re really excited about some new experimental work that we—and we hope others—will use to build on this method.

HUIZINGA: Are you guys the first to move into the evolutionary scale in this? Is that a differentiator for your work? 

YANG: So there have been a few other preprints or papers that talk about applying diffusion to protein sequences. The difference here is that, yes, like I said, we’re the first ones to do this at evolutionary scale. So people will also train these models on small sets of related protein sequences. For example, you might go look for an enzyme family and find all the sequences in nature of that family and train a model to generate new examples of that enzyme. But what we’re doing is we’re looking at data that’s from all different species and all different functional classes of proteins and giving us a model that is hopefully universal or as close to universal as we can get for protein sequence space. 

HUIZINGA: Wow. Ava, if there was one thing you want listeners to take away from this work, what would it be? 

AMINI: If there’s one thing to take away, I think it would be this idea that we can and should do protein generation over sequence because of the generality we’re able to achieve, the scale that we’re able to achieve, and the modularity and that our diffusion framework gives us the ability to do that and also to control how we design these proteins to meet specific functional goals. 

HUIZINGA: So, Kevin, to kind of wrap it up, I wonder if you could address what unanswered questions still remain, or unsolved problems in this area, and what’s next on your research agenda. 

YANG: So there’s kind of two directions we want to see here. One is, we want to test better ideas for conditioner models. And what I mean there is we want to feed in text or a desired chemical reaction or some other function directly and have it generate those things that will then go work in the lab. And that’s a really big step up from just generating sequences that work and are novel. And two is, in biology and in protein engineering, models are really good, but what really matters is, do things work in the lab? So we are actually looking to do some of our own experiments to see if the proteins we generate from EvoDiff work as desired in the lab. 

[MUSIC PLAYS]

HUIZINGA: Ava Amini and Kevin Yang, thanks so much for joining us today, and to our listeners, thanks for tuning in. If you’re interested in learning more about the paper, you can find a link at aka.ms/abstracts or you can find a preprint of the paper on bioRxiv. See you next time on Abstracts!

The post Abstracts: September 13, 2023 appeared first on Microsoft Research.

Read More

Unlocking the Language of Genomes and Climates: Anima Anandkumar on Using Generative AI to Tackle Global Challenges

Unlocking the Language of Genomes and Climates: Anima Anandkumar on Using Generative AI to Tackle Global Challenges

Generative AI-based models can not only learn and understand natural languages — they can learn the very language of nature itself, presenting new possibilities for scientific research.

Anima Anandkumar, Bren Professor at Caltech and senior director of AI research at NVIDIA, was recently invited to speak at the President’s Council of Advisors on Science and Technology.

At the talk, Anandkumar says that generative AI was described as “an inflection point in our lives,” with discussions swirling around how to “harness it to benefit society and humanity through scientific applications.”

On the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Anandkumar on generative AI’s potential to make splashes in the scientific community.

It can, for example, be fed DNA, RNA, viral and bacterial data to craft a model that understands the language of genomes. That model can help predict dangerous coronavirus variants to accelerate drug and vaccine research.

Generative AI can also predict extreme weather events like hurricanes or heat waves. Even with an AI boost, trying to predict natural events is challenging because of the sheer number of variables and unknowns.

“Those are the aspects we’re working on at NVIDIA and Caltech, in collaboration with many other organizations, to say, ‘How do we capture the multitude of scales present in the natural world?’” she said. “With the limited data we have, can we hope to extrapolate to finer scales? Can we hope to embed the right constraints and come up with physically valid predictions that make a big impact?”

Anandkumar adds that to ensure AI models are responsibly and safely used, existing laws must be strengthened to prevent dangerous downstream applications.

She also talks about the AI boom, which is transforming the role of humans across industries, and problems yet to be solved.

“This is the research advice I give to everyone: the most important thing is the question, not the answer,” she said.

You Might Also Like

Jules Anh Tuan Nguyen Explains How AI Lets Amputee Control Prosthetic Hand, Video Games
A postdoctoral researcher at the University of Minnesota discusses his efforts to allow amputees to control their prosthetic limb — right down to the finger motions — with their minds.

Overjet’s Ai Wardah Inam on Bringing AI to Dentistry
Overjet, a member of NVIDIA Inception, is moving fast to bring AI to dentists’ offices. Dr. Wardah Inam, CEO of the company, discusses using AI to improve patient care.

Immunai CTO and Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs
Luis Voloch talks about tackling the challenges of the immune system with a machine learning and data science mindset.

Subscribe to the AI Podcast: Now Available on Amazon Music

The AI Podcast is now available through Amazon Music.

In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better. Have a few minutes to spare? Fill out this listener survey.

Read More

World scale inverse reinforcement learning in Google Maps

World scale inverse reinforcement learning in Google Maps

Routing in Google Maps remains one of our most helpful and frequently used features. Determining the best route from A to B requires making complex trade-offs between factors including the estimated time of arrival (ETA), tolls, directness, surface conditions (e.g., paved, unpaved roads), and user preferences, which vary across transportation mode and local geography. Often, the most natural visibility we have into travelers’ preferences is by analyzing real-world travel patterns.

Learning preferences from observed sequential decision making behavior is a classic application of inverse reinforcement learning (IRL). Given a Markov decision process (MDP) — a formalization of the road network — and a set of demonstration trajectories (the traveled routes), the goal of IRL is to recover the users’ latent reward function. Although past research has created increasingly general IRL solutions, these have not been successfully scaled to world-sized MDPs. Scaling IRL algorithms is challenging because they typically require solving an RL subroutine at every update step. At first glance, even attempting to fit a world-scale MDP into memory to compute a single gradient step appears infeasible due to the large number of road segments and limited high bandwidth memory. When applying IRL to routing, one needs to consider all reasonable routes between each demonstration’s origin and destination. This implies that any attempt to break the world-scale MDP into smaller components cannot consider components smaller than a metropolitan area.

To this end, in “Massively Scalable Inverse Reinforcement Learning in Google Maps“, we share the result of a multi-year collaboration among Google Research, Maps, and Google DeepMind to surpass this IRL scalability limitation. We revisit classic algorithms in this space, and introduce advances in graph compression and parallelization, along with a new IRL algorithm called Receding Horizon Inverse Planning (RHIP) that provides fine-grained control over performance trade-offs. The final RHIP policy achieves a 16–24% relative improvement in global route match rate, i.e., the percentage of de-identified traveled routes that exactly match the suggested route in Google Maps. To the best of our knowledge, this represents the largest instance of IRL in a real world setting to date.

Google Maps improvements in route match rate relative to the existing baseline, when using the RHIP inverse reinforcement learning policy.

The benefits of IRL

A subtle but crucial detail about the routing problem is that it is goal conditioned, meaning that every destination state induces a slightly different MDP (specifically, the destination is a terminal, zero-reward state). IRL approaches are well suited for these types of problems because the learned reward function transfers across MDPs, and only the destination state is modified. This is in contrast to approaches that directly learn a policy, which typically require an extra factor of S parameters, where S is the number of MDP states.

Once the reward function is learned via IRL, we take advantage of a powerful inference-time trick. First, we evaluate the entire graph’s rewards once in an offline batch setting. This computation is performed entirely on servers without access to individual trips, and operates only over batches of road segments in the graph. Then, we save the results to an in-memory database and use a fast online graph search algorithm to find the highest reward path for routing requests between any origin and destination. This circumvents the need to perform online inference of a deeply parameterized model or policy, and vastly improves serving costs and latency.

Reward model deployment using batch inference and fast online planners.

Receding Horizon Inverse Planning

To scale IRL to the world MDP, we compress the graph and shard the global MDP using a sparse Mixture of Experts (MoE) based on geographic regions. We then apply classic IRL algorithms to solve the local MDPs, estimate the loss, and send gradients back to the MoE. The worldwide reward graph is computed by decompressing the final MoE reward model. To provide more control over performance characteristics, we introduce a new generalized IRL algorithm called Receding Horizon Inverse Planning (RHIP).

IRL reward model training using MoE parallelization, graph compression, and RHIP.

RHIP is inspired by people’s tendency to perform extensive local planning (“What am I doing for the next hour?”) and approximate long-term planning (“What will my life look like in 5 years?”). To take advantage of this insight, RHIP uses robust yet expensive stochastic policies in the local region surrounding the demonstration path, and switches to cheaper deterministic planners beyond some horizon. Adjusting the horizon H allows controlling computational costs, and often allows the discovery of the performance sweet spot. Interestingly, RHIP generalizes many classic IRL algorithms and provides the novel insight that they can be viewed along a stochastic vs. deterministic spectrum (specifically, for H=∞ it reduces to MaxEnt, for H=1 it reduces to BIRL, and for H=0 it reduces to MMP).

Given a demonstration from so to sd, (1) RHIP follows a robust yet expensive stochastic policy in the local region surrounding the demonstration (blue region). (2) Beyond some horizon H, RHIP switches to following a cheaper deterministic planner (red lines). Adjusting the horizon enables fine-grained control over performance and computational costs.

Routing wins

The RHIP policy provides a 15.9% and 24.1% lift in global route match rate for driving and two-wheelers (e.g., scooters, motorcycles, mopeds) relative to the well-tuned Maps baseline, respectively. We’re especially excited about the benefits to more sustainable transportation modes, where factors beyond journey time play a substantial role. By tuning RHIP’s horizon H, we’re able to achieve a policy that is both more accurate than all other IRL policies and 70% faster than MaxEnt.

Our 360M parameter reward model provides intuitive wins for Google Maps users in live A/B experiments. Examining road segments with a large absolute difference between the learned rewards and the baseline rewards can help improve certain Google Maps routes. For example:

Nottingham, UK. The preferred route (blue) was previously marked as private property due to the presence of a large gate, which indicated to our systems that the road may be closed at times and would not be ideal for drivers. As a result, Google Maps routed drivers through a longer, alternate detour instead (red). However, because real-world driving patterns showed that users regularly take the preferred route without an issue (as the gate is almost never closed), IRL now learns to route drivers along the preferred route by placing a large positive reward on this road segment.

Conclusion

Increasing performance via increased scale – both in terms of dataset size and model complexity – has proven to be a persistent trend in machine learning. Similar gains for inverse reinforcement learning problems have historically remained elusive, largely due to the challenges with handling practically sized MDPs. By introducing scalability advancements to classic IRL algorithms, we’re now able to train reward models on problems with hundreds of millions of states, demonstration trajectories, and model parameters, respectively. To the best of our knowledge, this is the largest instance of IRL in a real-world setting to date. See the paper to learn more about this work.

Acknowledgements

This work is a collaboration across multiple teams at Google. Contributors to the project include Matthew Abueg, Oliver Lange, Matt Deeds, Jason Trader, Denali Molitor, Markus Wulfmeier, Shawn O’Banion, Ryan Epp, Renaud Hartert, Rui Song, Thomas Sharp, Rémi Robert, Zoltan Szego, Beth Luan, Brit Larabee and Agnieszka Madurska.

We’d also like to extend our thanks to Arno Eigenwillig, Jacob Moorman, Jonathan Spencer, Remi Munos, Michael Bloesch and Arun Ahuja for valuable discussions and suggestions.

Read More