AI Frontiers: The Physics of AI with Sébastien Bubeck

AI Frontiers: The Physics of AI with Sébastien Bubeck

podcast: Sebastien Bubeck

Episode 136 | March 23, 2023

Powerful new large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.

In this new Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these new models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity.

The first episode features Sébastien Bubeck, who leads the Machine Learning Foundations group at Microsoft Research in Redmond. He and his collaborators conducted an extensive evaluation of GPT-4 while it was in development, and have published their findings in a paper that explores its capabilities and limitations—noting that it shows “sparks” of artificial general intelligence.

Transcript

Ashley Llorens: I’m Ashley Llorens with Microsoft Research. I spent the last 20 years working in AI and machine learning. But I’ve never felt more fortunate to work in the field than at this moment. Just this month, March 2023, OpenAI announced GPT-4, a powerful new large-scale AI model with dramatic improvements in reasoning, problem-solving, and much more. This model, and the models that will come after it, represent a phase change in the decades-long pursuit of artificial intelligence.

In this podcast series, I’ll share conversations with fellow researchers about our initial impressions of GPT-4, the nature of intelligence, and ultimately how innovations like these can have the greatest benefit for humanity.


Today I’m sitting down with Sébastien Bubeck, who leads the Machine Learning Foundations Group at Microsoft Research. In recent months, some of us at Microsoft had the extraordinary privilege of early access to GPT-4. We took the opportunity to dive deep into its remarkable reasoning, problem-solving, and the many other abilities that emerge from the massive scale of GPT-4.

Sébastien and his team took this opportunity to probe the model in new ways to gain insight into the nature of its intelligence. Sébastien and his collaborators have shared some of their observations in the new paper called “Sparks of Artificial General Intelligence: Experiments with an early version of GPT-4.”
Welcome to A.I. Frontiers.

Sébastien, I’m excited for this discussion.

The place that I want to start is with what I call the AI moment. So, what do I mean by that? In my experience, everyone that’s picked up and played with the latest wave of large-scale AI models, whether it’s ChatGPT or the more powerful models coming after, has a moment.

They have a moment where they’re genuinely surprised by what the models are capable of, by the experience of the model, the apparent intelligence of the model. And in my observation, the intensity of the reaction is more or less universal. Although everyone comes at it from their own perspective, it triggers its own unique range of emotions, from awe to skepticism.

So now, I’d love from your perspective, the perspective of a machine learning theorist: what was that moment like for you?

Sébastien Bubeck: That’s a great question to start. So, when we started playing with the model, we did what I think anyone would do. We started to ask mathematical questions, mathematical puzzles. We asked it to give some poetry analysis. Peter Lee did one on Black Thought, which was very intriguing. But every time we were left wondering, okay, but maybe it’s out there on the internet. Maybe it’s just doing some kind of pattern matching and it’s finding a little bit of structure. But this is not real intelligence. It cannot be. How could it be real intelligence when it’s such simple components coming together? So, for me, I think the awestruck moment was one night when I woke up and I turned on my laptop and fired up the Playground.

And I have a three-year-old at home, my daughter, who is a huge fan of unicorns. And I was just wondering, you know what? Let’s ask GPT-4 if it can draw a unicorn. And in my professional life, I play a lot with LaTeX, this programing language for mathematical equations. And in LaTeX there is this subprogramming language called TikZ to draw images using code. And so I just asked it: can you draw a unicorn in TikZ. And it did it so beautifully. It was really amazing. You can render it and you can see the unicorn. And no, it wasn’t a perfect unicorn.

What was amazing is that it drew a unicorn, which was quite abstract. It was really the concept of a unicorn, all the bits and pieces of what makes a unicorn, the horn, the tail, the fur, et cetera. And this is what really struck me at that moment. First of all, there is no unicorn in TikZ online.

I mean, who would draw a unicorn in a mathematical language? This doesn’t make any sense. So, there is no unicorn online. I was pretty sure of that. And then we did further experiments to confirm that. And we’re sure that it really drew the unicorn by itself. But really what struck me is this getting into what is a concept of a unicorn, that there is a head, a horn, the legs, et cetera.

This has been a longstanding challenge for AI research. This has always been the problem with all those AI systems that came before, like the convolutional neural networks that were trained on ImageNet and image datasets and that can recognize whether there is a cat or dog in the image, et cetera. Those neural networks, it was always hard to interpret them. And it was not clear how they were detecting exactly whether there is a cat or dog in particular that was susceptible to these adversarial examples like small perturbations to the input that would completely change the output.

And it was understood that the big issue is that they didn’t really get the concept of a cat or dog. And then suddenly with GPT-4, it was kind of clear to me at that moment that it really understood something. It really understands what is a unicorn. So that was the moment for me.

Ashley Llorens: That’s fascinating. What did you feel in that moment? Does that change your concept of your field of study, your relationship to the field?

Sébastien Bubeck: It really changed a lot of things to me. So first of all, I never thought that I would live to see what I would call a real artificial intelligence. Of course, we’ve been talking about AI for many decades now. And the AI revolution in some sense has been happening for a decade already.

But I would argue that all the systems before were really this narrow intelligence, which does not really rise to the level of what I would call intelligence. Here, we’re really facing something which is much more general and really feels like intelligence. So, at that moment, I felt honestly lucky. I felt lucky that I had early access to this system, that I could be one of the first human beings to play with it.

And I saw that this is really going to change the world dramatically. And selfishly, (it) is going to change my field of study, as you were saying. Now suddenly we can start to attack: what is intelligence, really? We can start to approach this question, which seemed completely out of reach before.

So really deep down inside me, incredible excitement. That’s really what I felt. Then upon reflection, in the next few days, there was also some worry, of course. Clearly things are accelerating dramatically. Not only did I never think that I would live to see a real artificial intelligence, but the timeline that I had in mind ten years ago or 15 years ago when I was a Ph.D. student, I saw maybe by the end of the decade, the 2010s, maybe at that time, we will have a system that can play Go better than humans.

That was my target. And maybe 20 years after that, we will have systems that can do language. And maybe somewhere in between, we will have systems that can play multiplayer games like Starcraft II or Dota 2. All of those things got compressed into the 2010s.

And by the end of the 2010s, we had basically solved language in a way with GPT-3. And now we enter the 2020s and suddenly something totally unexpected which wasn’t in the 70 years of my life and professional career: intelligence in our hands. So, it’s just changing everything and this compressed timeline, I do worry where is this going.

There are still fundamental limitations that I’m sure we’re going to talk about. And it’s not clear whether the acceleration is going to keep going. But if it does keep going, it’s going to challenge a lot of things for us as human beings.

Ashley Llorens: As someone that’s been in the field for a while myself, I had a very similar reaction where I felt like I was interacting with a real intelligence, like something deserving of the name artificial intelligence—AI. What does that mean to you? What does it mean to have real intelligence?

Sébastien Bubeck: It’s a tough question, because, of course, intelligence has been studied for many decades. And psychologists have developed tests of your level of intelligence. But in a way, I feel intelligence is still something very mysterious. It’s kind of—we recognize it when we see it. But it’s very hard to define.

And what I’m hoping is that with this system, what I want to argue is that basically, it was very hard before to study what is intelligence, because we had only one example of intelligence. What is this one example? I’m not necessarily talking about human beings, but more about natural intelligence. By that, I mean intelligence that happened on planet Earth through billions of years of evolution.

This is one type of intelligence. And this was the only example of intelligence that we had access to. And so all our series were fine-tuned to that example of intelligence. Now, I feel that we have a new system which I believe rises to the level of being called an intelligence system. We suddenly have two examples which are very different.

GPT-4’s intelligence is comparable to human in some ways, but it’s also very, very different. It can both solve Olympiad-level mathematical problems and also make elementary school mistakes when adding two numbers. So, it’s clearly not human-like intelligence. It’s a different type of intelligence. And of course, because it came about through a very different process than natural evolution, you could argue that it came about through a process which you could call artificial evolution.

And so I’m hoping that now that we have those two different examples of intelligence, maybe we can start to make progress on defining it and understanding what it is. That was a long-winded answer to your question, but I don’t know how to put it differently.

Basically, the way for me to test intelligence is to really ask creative questions, difficult questions that you do not find, online and (through) search. In a way, you could ask: is Bing, is Google, are search engines intelligent? They can answer tough questions. Are these intelligent systems? Of course not. Everybody would say, no.

So, you have to distinguish, what is it that makes us say that GPT-4 is an intelligent system? Is it just the fact that it can answer many questions? No, it’s more that it can inspect, it answers. It can explain itself. It can interact with you. You can have a discussion. This interaction is really of the essence of intelligence to me.

Ashley Llorens: It certainly is a provocative and unsolved kind of question of: what is intelligence. And perhaps equally mysterious is how we actually measure intelligence. Which is a challenge even for humans. Which I’m reminded of with young kids in the school system, as I know you are or will be soon here as a father.

But you’ve had to think differently as you’ve tried to measure the intelligence of GPT-4. And you alluded to…I’d say the prevailing way that we’ve gone about measuring the intelligence of AI systems or intelligent systems is through this process of benchmarking, and you and your team have taken a very different approach.

Can you maybe contrast those?

Sébastien Bubeck: Of course, yeah. So maybe let me start with an example. So, we use GPT-4 to pass mock interviews for software engineer positions at Amazon and at Google and META. It passes all of those interviews very easily. Not only does it pass those interviews, but it also ranks in the very top of the human beings.

In fact, for the Amazon interview, not only did it pass all the questions, but it scored better than 100% of all the human users on that website. So, this is really incredible. And headlines would be, GPT-4 can be hired as a software engineer at Amazon. But this is a little bit misleading to view it that way because those tests, they were designed for human beings.

They make a lot of hidden assumptions about the person that they are interviewing. In particular, they will not test whether that person has a memory from one day to the next. Of course, human beings remember what they did the next day, unless there is some very terrible problem.

So, they all face those benchmarks of intelligence. At least they face this issue that they were designed to test for human beings. So, we have to find new ways to test intelligence when we’re talking about the intelligence of AI systems. That’s point number one. But number two is so far in the machine learning tradition, we have developed lots of benchmarks to test a system, a narrow AI system.

This is how the machine learning community has made progress over the decades—by beating benchmarks, by having systems that keep improving, percentage by percentage over those target benchmarks. Now, all of those become kind of irrelevant in the era of GPT-4 for two reasons. Number one is GPT-4—we don’t know exactly what data it is being trained on and in particular it might have seen all of these datasets.

So really you cannot separate anymore the training data and the test data. This is not really a meaningful way to test something like GPT-4 because it might have seen everything. For example, Google came out with a suite of benchmarks, which they called Big Bench, and in there they hid the code to make sure that you don’t know the code and you haven’t seen this data, and of course GPT-4 knows this code.

So, it has seen all of Big Bench. So, you just cannot benchmark it against Big Bench. So, that’s problem number one for the classical ML benchmark. Problem number two is that all those benchmarks are just too easy. It’s just too easy for GPT-4. It crushes all of them, hands down. Very, very easily.

In fact, it’s the same thing for the medical license exam for a multi-state bar exam. All of those things it just passes very, very easily. And the reason why we have to go beyond this is really beyond the classical ML benchmark, we really have to test the generative abilities, the interaction abilities. How is it able to interact with human beings? How is it able to interact with tools?

How creative can it be at the task? All of those questions, it’s very hard to benchmark them. Our own hard benchmark, whether there is one right solution. Now, of course, the ML community has grappled with this problem recently because generative AI has been in the works for a few years now, but the answers are still very tentative.

Just to give you an example, imagine that you want to have a benchmark where you describe a movie and you want to write a movie review. Let’s say, for example, you want to tell the system, write a positive movie review about this movie. Okay. The problem is in your benchmark. In the data, you will have examples of those reviews. And then you ask your system to write its own review, which might be very different from what you have in your training data. So, the question is, is it better to write something different or is it worse? Do you have to match what was in the training data? Maybe GPT-4 is so good that it’s going to write something better than what the humans wrote.

And in fact, we have seen that many, many times the training data was crafted by humans and GPT-4 just does a better job at it. So, it gives better labels if you want than what the humans did. It cannot even compare to humans anymore. So, this is a problem that we are facing as we are writing our paper, trying to assess GPT-4’s intelligence.

Ashley Llorens: Give me an example where the model is actually better than the humans.

Sébastien Bubeck: Sure. I mean, let me think of a good one. I mean, coding—it is absolutely superhuman at coding. We already alluded to this and this is going to have tremendous implications. But really coding is incredible. So, for example, going back to the example of movie reviews, there is this IMDB dataset which is very popular in machine learning where you can ask many basic questions that you want to ask.

But now in the era of GPT-4, you can give the IMDB dataset and you can just ask GPT-4—can you explore the dataset. And it’s going to come up with suggestions of data analysis ideas. Maybe it would say, maybe we want to do some clustering, maybe you want to cluster by the movie, directors, and you would see which movies were the most popular and why.

It can come up creatively with its own analysis. So that’s one aspect—differently coding data analysis. It can be very easily superhuman. I think in terms of writing, its writing capabilities are just astounding. For example, in the paper, we asked it many times to rewrite parts of what we wrote, and it writes it in this much more lyrical way, poetic way.

You can ask for any kind of style that you want. It’s really at the level I would say at this far in my novice eyes, I would say it’s at the level of some of the best authors out there. There is its style and this is really native. You don’t have to do anything.

Ashley Llorens: Yeah, it does it does remind me a little bit of the AlphaGo moment or maybe more specifically the AlphaZero moment, where all of a sudden, you kind of leave the human training data behind. And you’re entering into a realm where it’s its only real competition. You talked about the evolution that we need to have of how we measure intelligence from ways of measuring narrow or specialized intelligence to measuring more general kinds of intelligence.

And we’ve had these narrow benchmarks. You see a lot of this, kind of past the bar exam, these kinds of human intelligence measures. But what happens when all of those are also too easy? How do we think about measurement and assessment in that regime?

Sébastien Bubeck: So, of course, I want to say maybe it’s a good point to bring up the limitations of the system also. Right now a very clear frontier that GPT-4 is not stepping over is to produce new knowledge to discover new things, for example, let’s say in mathematics, to prove mathematical theorems that humans do not know how to prove.
Right now, the systems cannot do it. And this, I think, would be a very clean and clear demonstration, whereas there is just no ambiguity once it can start to produce this new knowledge. Now, of course, whether it’s going to happen or not is an open question. I personally believe it’s plausible. I am not 100 percent sure what’s going to happen, but I believe it is plausible that it will happen.

But then there might be another question, which is what happens if the proof that it produces becomes inscrutable to human beings. Mathematics is not only this abstract thing, but it’s also a language between humans. Of course, at the end of the day, you can come back to the axioms, but that’s not the way we humans do mathematics.

So, what happens if, let’s say, GPT-5 proves the Riemann hypothesis and it is formally proved? Maybe it gives the proof in the LEAN language, which is a formalization of mathematics, and you can formally verify that the proof is correct. But no human being is able to understand the concepts that were introduced.
What does it mean? Is the Riemann hypothesis really proven? I guess it is proven, but is that really what we human beings wanted? So this kind of question might be on the horizon. And that I think ultimately might be the real test of intelligence.

Ashley Llorens: Let’s stick with this category of the limitations of the model. And you kind of drew a line here in terms of producing new knowledge. You offered one example of that as proving mathematical theorems. What are some of the other limitations that you’ve discovered?

Sébastien Bubeck: So, GPT-4 is a large language model which was trained on the next –word-prediction objective function. So, what does it mean? It just means you give it a partial text and you’re trying to predict what is going to be the next word in that partial text. Once you want to generate content, you just keep doing that on the text that you’re producing. So, you’re producing words one by one. Now, of course, it’s a question that I have been reflecting upon myself, once I saw GPT-4. It’s a question whether human beings are thinking like this. I mean it doesn’t feel like it. It feels like we’re thinking a little bit more deeply.
We’re thinking a little bit more in advance of what we want to say. But somehow, as I reflect, I’m not so sure, at least when I speak, verbally, orally, maybe I am just coming up every time with the next word. So, this is a very interesting aspect. But the key point is certainly when I’m doing mathematics, I think I am thinking a little bit more deeply.

And I’m not just trying to see what is the next step, but I’m trying to come up with a whole plan of what I want to achieve. And right now the system is not able to do this kind of long-term planning. And we can give a very simple experiment that shows this maybe. My favorite one is, let’s say you have a very simple arithmetic equality—three times seven plus 21 times 27 equals something.

So this is part of the prompt that you give to GPT-4. And now you just ask, okay, you’re allowed to modify one digit in this so that the end result is modified in a certain way. Which one do you choose? So, the way to solve this problem is that you have to think.

You have to try. Okay, what if I were to modify the first digit? What would happen if I were to modify the second digit? What would happen? And GPT-4 is not able to do that. GPT-4 is not able to think ahead in this way. What it will say is just: I think if you modify the third digit, just randomly, it’s going to work. And it just tries and it fails. And the really funny aspect is that once it starts feigning, GPT-4, this becomes part of its context, which in a way becomes part of its truth. So, the failure becomes part of its truth and then it will do anything to justify it.

It will keep making mistakes to keep justifying it. So, these two aspects, the fact that it cannot really plan ahead and that once it makes mistakes, it just becomes part of its truths. These are very, very serious limitations, in particular for mathematics. This makes it a very uneven system, once you approach mathematics.

Ashley Llorens: You mentioned something that’s different about machine learning the way it’s conceptualized in this kind of generative AI regime, which is fundamentally different than what we’ve typically thought about as machine learning, where you’re optimizing an objective function with a fairly narrow objective versus when you’re trying to actually learn something about the structure of the data, albeit through this next word prediction or some other way.

What do you think about that learning mechanism? Are there any limitations of that?

Sébastien Bubeck: This is a very interesting question. Maybe I just want to backtrack for a second and just acknowledge that what happened there is kind of a miracle. Nobody, I think nobody in the world, perhaps, except OpenAI, expected that intelligence would emerge from this next word prediction framework just on a lot of data.

I mean, this is really crazy, if you think about it now, the way I have justified it to myself recently is like this. So, I think it is agreed that deep learning is what powers the GPT4 training. You have a big neural network that you’re training with gradient descent, just trying to fiddle with the parameters.

So, it is agreed that deep learning is this hammer, that if you give it a dataset, it will be able to extract the latent structure of that dataset. So, for example, the first breakthrough that happened in deep learning, a little bit more than ten years ago, was the Alexa.NET moment, where they trained a neural network to basically classify cats, dogs, cars accessorized with images.

And when you train this network, what happens is that you have these edge detectors that emerge on the first few layers of the neural network. And, nothing in the objective function told you that you have to come up with edge detector. This was an emergent property. Why? Because it makes sense. The structure of an image is to combine those edges to create geometric shapes.

Right now, I think what’s happening and we have seen this more and more with the large language models, is that there are more and more emerging properties that happen as you scale up the size of the network and the size of the data. Now what I believe is happening is that in the case of GPT-4 they gave it such a big dataset, so diverse with so many complex parameters in it, that the only way to make sense of it, the only latent structure that unifies all of this data is intelligence.

The only way to make sense of the data was for the system to become intelligent. This is kind of a crazy sentence. And I expect the next few years, maybe even the next few decades, will try to make sense of whether this sentence is correct or not. And hopefully, human beings are intelligent enough to make sense of that sentence.

I don’t know right now. I just feel like it’s a reasonable hypothesis that this is what happened there. And so in a way, you can say maybe there is no limitation to the next-word-prediction framework. So that’s one perspective. The other perspective is, actually, the next –word-prediction, token framework is very limiting, at least at generation time.

At least once you start to generate new sentences, you should go beyond a little bit if you want to have a planning aspect, if you want to be able to revisit mistakes that you made. So, there we believe that at least at generation time, you need to have a slightly different system. But maybe in terms of training, in terms of coming up with intelligence in the first place, maybe this is a fine way to do it.

Ashley Llorens: And maybe I’m kind of inspired to ask you a somewhat technical question, though. Yeah. Where I think one aspect of our previous notion of intelligence and maybe still the current notion of intelligence for some is this aspect of compression, the ability to take something complex and make it simple, maybe thinking grounded in Occam’s Razor, where we want to generate the simplest explanation of the data in the thing, some of the things you’re saying and some of the things we’re seeing in the model kind of go against that intuition.

So talk to me a little bit about that.

Sébastien Bubeck: Absolutely. So, I think this is really exemplified well in the project that we did here at Microsoft Research a year ago, which we called Lego. So let me tell you about this very briefly, because it will really get to the point of what you’re trying to say. So, let’s say you want train an AI system that can solve middle school systems of linear equations.

So, maybe it’s X plus Y equals Z, three X minus two, Y equals one, and so on. You have three equations with two variables. And you want to train a neural network that takes in this system of equation and outputs the answer for it. The classical perspective, the Occam’s Razor perspective would be collected dataset with lots of equations like this train the system to solve those linear equation.

And there you go. This is a way you have the same kind of distribution at training time and that they start now. What this new paradigm of deep learning and in particular of large language models would say is instead, even though your goal is to solve systems of linear equations for middle school students instead of just training data, middle school systems offline.

Now, we’re going to collect a hugely diverse list of data maybe we’re going to do next. One prediction not only on the systems of linear equation, but also on all of Wikipedia. So, this is now a very concrete experiment. You have to learn networks. Neural network A, train on the equations. Neural network B, train on the equations, plus Wikipedia. And any kind of classical thinking would tell you that neural network B is going to do worse because it has to do more things, it’s going to get more confused. It’s not the simplest way to solve the problem. But lo and behold, if you actually run the experiment for real, Network B is much, much, much better than network A. Now I need to quantify this a little bit. Network A, if it was trained with systems of linear regression with three variables, is going to be fine on systems of linear regression with three variables.

But as soon as you ask it four variables or five variables, it’s not able to do it. It didn’t really get to the essence of what it means to solve the linear equations, whereas Network B, it’s not only subsystems of equation with three variables, but it also does four, it also does five and so on.

Now the question is why? What’s going on? Why is it that making the thing more complicated, going against Occam’s Razor, why is that a good idea? And the extremely naive perspective, which in fact some people have said because it is so mysterious, would be maybe to read the Wikipedia page on solving systems of linear equation.

But of course, that’s not what happened. And this is another aspect of this whole story, which is anthropomorphication of the system is a big danger. But let’s not get into that right now. But the point is that’s not at all the reason why we became good at solving systems of linear equations.

It’s rather that it had this very diverse data and it forced it to come up with unifying principles, more canonical, component of intelligence. And then it’s able to compose this canonical component of intelligence to solve the task at hand.

Ashley Llorens: I want to go back to something you said much earlier around natural evolution versus this notion of artificial evolution. And I think that starts to allude to where I think you want to take this field next, at least in terms of your study and your group. And that is, focusing on the aspect of emergence and how intelligence emerges.

So, what do you see as the way forward from this point, from your work with Lego that you just described for you and for the field?

Sébastien Bubeck: Yes, absolutely. So, I would argue that maybe we need a new name for machine learning in a way. GPT-4 and GPT-3 and all those other large language models, in some ways, it’s not machine learning anymore. And by that I mean machine learning is all about how do you teach a machine a very well-defined task? Recognize cats and dogs, something like that. But here, that’s not what we’re doing. We’re not trying to teach it a narrow task. We’re trying to teach it everything. And we’re not trying to mimic how a human would learn. This is another point of confusion. Some people say, oh, but it’s learning language, but using more tags than any human would ever see.

But that’s kind of missing the point. The point is we’re not trying to mimic human learning. And that’s why maybe learning is not the right word anymore. We’re really trying to mimic something which is more akin to evolution. We’re trying to mimic the experience of millions, billions of entities that interact with the world. In this case, the world is the data that humans produced.

So, it’s a very different style. And I believe the reason why all the tools that we have introduced in machine learning are kind of useless and almost irrelevant in light of GPT-4 is because it’s a new field. It’s something that needs new tools to be defined. So we hope to be at the forefront of that and we want to introduce those new tools.

And of course, we don’t know what it’s going to look like, but the avenue that we’re taking to try to study this is to try to understand emergence. So emergence again is this phenomenon that as you scale up the network and the data, suddenly there are new properties that emerge at every scale. Google had this experiment where they scaled up their large language models from 8 billion to 60 billion to 500 billion.
And at 8 billion, it’s able to understand language. And it’s able to do a little bit of arithmetic at 60 billion. Suddenly it’s able to translate between languages. Before it couldn’t translate. At 60 billion parameters, suddenly it can translate. At 500 billion suddenly it can explain jokes. Why can it suddenly explain jokes?
So, we really would like to understand this. And there is another field out there that has been grappling with emergence for a long time that we’re trying to study systems of very complex particles interacting with each other and leading to some emergent behaviors.

What is this field? It’s physics. So, what we would like to propose is let’s study the physics of AI or the physics of AGI, because in a way, we are really seeing this general intelligence now. So, what would it mean to study the physics of AGI? What it would mean is, let’s try to borrow from the methodology that physicists have used for the last few centuries to make sense of reality.

And what (were) those tools? Well, one of them was to run a very controlled experiment. If you look at the waterfall and you observe the water which is flowing and is going in all kinds of ways, and you go look at it in the winter and it’s frozen. Good luck to try to make sense of the phases of water by just staring at the waterfall, GPT-4 or LAMDA or the flash language model.

These are all waterfalls. What we need are much more small scale controlled experiments where we know we have pure water. It’s not being tainted by the stone, by the algae. We need those controlled experiments to make sense of it. And LEGO is one example. So that’s one direction that we want to take. But in physics there is another direction that you can take, which is to build toy mathematical models of the real world.

You try to abstract away lots of things, and you’re left with a very simple mathematical equation that you can study. And then you have to go back to really experiment and see whether the prediction from the toy mathematical model tells you something about the real experiment. So that’s another avenue that we want to take. And then we made some progress recently also with interns at (Microsoft Research).

So, we have a paper which is called Learning Threshold Units. And here really we’re able to understand how does the most basic element, I don’t want to say intelligence, but the most basic element of reasoning emerges in those neural networks. And what is this most basic element of reasoning? It’s a threshold unit. It’s something that takes as input some value.

And if the value is too small, then it just turns it to zero. And this emergence already, it’s a very, very complicated phenomenon. And we were able to understand the non convex dynamics at play and connected to which is called the edge of stability, which is all very exciting. But the key point is that we have a toy mathematical model and there in essence what we were able to do is to say that emergence is related to the instability in training, which is very surprising because usually in classical machine learning, instability, something that you do not want, you want to erase all the instabilities.

And somehow through this physics of AI approach, where we have a toy mathematical model, we are able to say actually it’s instability in training that you’re seeing, that everybody is seeing for decades now, it actually matters for learning and for emergence. So, this is the first edge that we took.

Ashley Llorens: I want to come back to this aspect of interaction and want to ask you if you see fundamental limitations with this whole methodology around certain kinds of interactions. So right now we’ve been talking mostly about these models interacting with information in information environments, with information that people produce, and then producing new information behind that.

The source of that information is actual humans. So, I want to know if you see any limitations or if this is an aspect of your study, how we make these models better at interacting with humans, understanding the person behind the information produced. And after you do that, I’m going to come back and we’ll ask the same question of the natural world in which we as humans reside.

Sébastien Bubeck: Absolutely. So, this is one of the emergent properties of GPT-4 to put it very simply, that not only can it interact with information, but it can actually interact with humans, too. You can communicate with it. You can discuss, and you’re going to have very interesting discussions. In fact, some of my most interesting discussions in the last few months were with GPT-4.

So this is surprising. Not at all something we would have expected, But in there it’s there. Not only that, but it also has a theory of mind. So GPT-4 is able to reason about what somebody is thinking, what somebody is thinking about, what somebody else is thinking, and so on. So, it really has a very sophisticated theory of mind. There was recently a paper saying that ChatGPT is roughly at the level of a seven-year-old in terms of its theory of mind. For GPT-4, I cannot really distinguish from an adult. Just to give you an anecdote, I don’t know if I should say this, but one day in the last few months, I had an argument with my wife and she was telling me something.

And I just didn’t understand what she wanted from me. And I just talked with GPT-4 I explained the situation. And that’s what’s going on. What should I be doing? And the answer was so detailed, so thoughtful. I mean, I’m really not making this up. This is absolutely real. I learned something from GPT-4 about human interaction with my wife.

This is as real as it gets. And so, I can’t see any limitation right now in terms of interaction. And not only can it interact with humans, but it can also interact with tools. And so, this is the premise in a way of the new Bing that was recently introduced, which is that this new model, you can tell it “hey, you know what, you have access to a search engine.”

“You can use Bing if there is some information that you’re missing and you need to find it out, please make a Bing search.” And somehow natively, this is again, an emergent property. It’s able to use a search engine and make searches when it needs to, which is really, really incredible. And not only can it use those tools which are well-known, but you can also make up tools.

You can tell you can say, hey, I invented some API. Here is what the API does. Now please solve me problem XYZ using that API and it’s able to do it native. It’s able to understand your description in natural language of what the API that you built is doing and it’s able to leverage its power and use it.

This is really incredible and opens so many directions.

Ashley Llorens: We certainly see some super impressive capabilities like the new integration with Bing, for example. We also see some of those limitations come into play. Tell me about your exploration of those in this context.

Sébastien Bubeck: So, one keyword that didn’t come up yet and which is going to drive forward is a conversation at least online and on Twitter is hallucinations. So those models, GPT-4, still does hallucinate a lot. And in a way, for good reason, it’s on the spectrum where on the one hand you have bad hallucination, completely making up facts when which are contrary to the real facts in the real world.

But on the other end, you have creativity. When you create, when you generate new things, you are, in a way, hallucinating. It’s good hallucinations, but still these are hallucinations. So having a system which can both be creative, but does not hallucinate at all—it’s a very delicate balance. And GPT-4 did not solve that problem yet. It made a lot of progress, but it didn’t solve it yet.

So that’s still a big limitation, which the world is going to have to grapple with. And I think in the new Bing it’s very clearly explained that it is still making mistakes from time to time and that you need to double check the result. I still think the rough contours of what GPT-4 says and the new Bing says is really correct.
And it’s a very good first draft most of the time, and you can get started with that. But then, you need to do your research and it cannot be used for critical missions yet. Now what’s interesting is GPT-4 is also intelligent enough to look over itself. So, once it produced a transcript, you can ask another instance of GPT-4 to look over what the first instance did and to check whether there is any hallucination.

This works particularly well for what I would call in-context hallucination. So, what would be in-context hallucination? Let’s say you have a text that you’re asking it to summarize and maybe in the summary, it invents something that was not out there. Then the other instance of GPT-4 will immediately spot it. So that’s basically in-context hallucination.

We believe they can be fully solved soon. The open-world type of hallucination When you ask anything, for example, in our paper we ask: where is the McDonald’s at Sea-Tac, at the airport in Seattle. And it responds: gate C2. And the answer is not C2. The answer, it’s B3. So this type of open-world hallucination—it’s much more difficult to resolve.

And we don’t know yet how to do that. Exactly.

Ashley Llorens: Do you see a difference between a hallucination and a factual error?

Sébastien Bubeck: I have to think about this one. I would say that no, I do not really see a difference between the hallucination and the factual error. In fact, I would go as far as saying that when it’s making arithmetic mistakes, which again, it still does, when it adds two numbers, you can also view it as some kind of hallucination.

And by that I mean it’s kind of a hallucination by omission. And let me explain what I mean. So, when it does an arithmetic calculation, you can actually ask it to print each step and that improves the accuracy. It does a little bit better if it has to go through all the steps and this makes sense from the next word prediction framework.

Now what happens is very often it will skip a step. It would kind of forget something. This can be viewed as a kind of hallucination. It just, hallucinated that this step is not necessary and that it can move on to the next stage immediately. And so, this kind of factual error or like in this case in a reasoning error if you want, they are all related to the same concept of hallucination. There could be many ways to resolve those hallucinations.

Maybe we want to look inside the model a little bit more. Maybe we want to change the training pipeline a little bit. Maybe the reinforcement learning with human feedback can help. All of these are small patches and still I want to make it clear to the audience that it’s still an academic open problem whether any of those directions can eventually fix it or is it a fatal error for large language models that will never be fixed.
We do not know the answer to that question.

Ashley Llorens: I want to come back to this notion of interaction with the natural world.

As human beings, we learn about the natural world through interaction with it. We start to develop intuitions about things like gravity, for example. And there is an argument or debate right now in the community as to how much of that knowledge of how to interact with the natural world is encoded and learnable from language and the kinds of information inputs that we put into the model versus how much actually needs to be implicit, explicitly encoded in an architecture or just learned through interaction with the world.

What do you see here? Do you see a fundamental limitation with this kind of architecture for that purpose?

Sébastien Bubeck: I do think that there is a fundamental limit. I mean, that there is a fundamental limitation in terms of the current structure of the pipeline. And I do believe it’s going to be a big limitation once you ask the system to discover new facts. So, what I think is the next stage of evolution for the systems would be to hook it up with a simulator of sorts so that the system at training time when it’s going through all of the web, is going through all of the data produced by humanity.

So, it realizes, oh, maybe I need more data of a certain type, then we want to give it access to a simulator so that it can produce its own data, it can run experiments which is really what babies are doing. Infants—they run experiments when they play with a ball, when they look at their hand in front of their face.

This is an experiment. So, we do need to give access to the system a way to do experiments. Now, the problem of this is you get into this little bit of a dystopian discussion of whether do we really want to give these systems which are super intelligent in some way, do we want to give them access to simulators?
Aren’t we afraid that they will become super-human in every way if some of the experiments that they can run is to run code, is to access the internet? There are lots of questions about what could happen. And it’s not hard to imagine what could go wrong there.

Ashley Llorens: It’s a good segue into maybe a last question or topic to explore, which comes back to this phrase AGI—artificial general intelligence. In some ways, there’s kind of a lowercase version of that where we talk towards more generalizable kinds of intelligence. That’s the regime that we’ve been exploring. Then there’s a kind of a capital letter version of that, which is almost like a like a sacred cow or a kind of dogmatic pursuit within the AI community. So, what does that capital letter phrase AGI, mean to you? And maybe part B of that is: is our classic notion of AGI the right goal for us to be aiming for?

Sébastien Bubeck: Before interacting with GPT-4, to me, AGI was this unachievable dream. Some think it’s not even clear whether it’s doable. What does it even mean? And really by interacting with GPT-4, I suddenly had the realization that actually general intelligence, it’s something very concrete.

It’s able to understand any kind of topics that you bring up. It is going to be able to reason about any of the things that you want to discuss. It can bring up information, it can use tools, it can interact with humans, it can interact with an environment. This is general intelligence. Now, you’re totally right in encoding it “lowercase” AGI.

Why is it not uppercase AGI? Because it’s still lacking some of the fundamental aspects, two of them, which are really, really important. One is memory. So, every new session with GPT-4 is a completely fresh tabula rasa session. It’s not remembering what you did yesterday with it. And it’s something which is emotionally hard to take because you kind of develop a relationship with the system.

As crazy as it sounds, that’s really what happens. And so you’re kind of disappointed that it doesn’t remember all the good times that you guys had together. So this is one aspect. The other one is the learning. Right now, you cannot teach it new concepts very easily. You can turn the big crank of retraining the model.

Sure, you can do that, but I’ll give you the example of using a new API. Tomorrow, you have to explain it again. So, of course, learning and memory, those two things are very, very related, as I just explained. So, this is one huge limitation to me.

If it had that, I think it would qualify as uppercase AGI. Now, not everybody would agree even with that, because many people would say, no, it needs to be embodied, to have real world experience. This becomes a philosophical question. Is it possible to have something that you would call a generally intelligent being that only lives in the digital world?

I don’t see any problem with that, honestly. I cannot see any issue with this. Now, there is another aspect. Once you get into this philosophical territory, which is right now the systems have no intrinsic motivation. All they want to do is to generate the next token. So, is that also an obstruction to having something which is a general intelligence?

Again, to me this becomes more philosophical than really technical, but maybe there is some aspect which is technical there. Again, if you start to hook up the system to simulate those to run their own experiments, then certainly they maybe have some intrinsic motivation to just improve themselves. So maybe that’s one technical way to resolve the question. I don’t know.

Ashley Llorens: That’s interesting. And I think there’s a word around that phrase in the community. Agent. Or seeing “agentic” or goal-oriented behaviors. And that is really where you start to get into the need for serious sandboxing or alignment or other kinds of guardrails for the system that actually starts to exhibit goal-oriented behavior.

Sébastien Bubeck: Absolutely. Maybe one other point that I want to bring up about AGI, which I think is confusing a lot of people. Somehow when people hear general intelligence, they want something which is truly general that could grapple with any kind of environment. And not only that, but maybe that grapples with any kind of environment and does so in a sort of optimal way.

This universality and optimality, I think, are completely irrelevant to intelligence. Intelligence has nothing to do with universality or optimality. We as human beings are notoriously not universal. I mean, you change a little bit the condition of your environment, and you’re going to be very confused for a week. It’s going to take you months to adapt.

So, we are very, very far from universal and I think I don’t need to tell anybody that we’re very far from being optimal. The number of crazy decisions that we make every second is astounding. So, we’re not optimal in any way. So, I think it is not realistic to try to have an AGI that would be universal and optimal. And it’s not even desirable in any way, in my opinion. So that’s maybe not achievable and not even realistic, in my opinion.

Ashley Llorens: Is there an aspect of complementarity that we should be striving for in, say, a refreshed version of AGI or this kind of long-term goal for AI?

Sébastien Bubeck: Yeah, absolutely. But I don’t want to be here in this podcast today and try to say what I view in terms of this question, because I think it’s really a community that should come together and discuss this in the coming weeks, months, years, and come together with where do we want to go, where the society want to go, and so on.

I think it’s a terribly important question. And we should not dissociate our futuristic goal with the technical innovation that we’re trying to do day-to-day. We have to take both into account. But I imagine that this discussion will happen and we will know a lot more a year from now, hopefully.

Ashley Llorens: Thanks, Sébastien. Just a really fun and fascinating discussion. Appreciate your time today.

Sébastien Bubeck: Yeah, thanks, Ashley it was super fun.

The post AI Frontiers: The Physics of AI with Sébastien Bubeck appeared first on Microsoft Research.

Read More

Research Focus: Week of March 6, 2023

Research Focus: Week of March 6, 2023

Microsoft Research Focus 11 edition, week of March 06, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Hide and seek with Spectres

Attack methods like Spectre exploit speculative execution, one of the key performance optimizations of modern CPUs. Microsoft researchers are working on a novel testing tool that can automatically detect speculative leaks in commercial (back-box) CPUs. However, until now the testing process has been slow, which has hindered in-depth testing campaigns and the discovery of new classes of leakage.

In a new paper: Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing, researchers from Microsoft and academic collaborators identify the root causes of the performance limitations in existing approaches—and propose techniques to overcome them. These techniques improve the testing speed over the state of the art by up to two orders of magnitude.

These improvements enabled the researchers to run a testing campaign of unprecedented depth on Intel and AMD CPUs. In the process, they discovered two types of previously unknown speculative leaks (affecting string comparison and division) that have escaped previous manual and automatic analyses.

The paper that describes the novel techniques will appear at IEEE Symposium on Security and Privacy 2023.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

PODCAST

Microsoft Research’s Philipp Witte on improving carbon sequestration with AI

Reducing carbon dioxide in the atmosphere could play an important role in minimizing climate change. Carbon sequestration – the process of locking carbon dioxide in deep underground reservoirs – is a developing technology that could make a meaningful contribution if it were deployed at scale. Deep learning AI technologies can improve the models required to develop these reservoirs, which could help scale up sequestration projects to a meaningful level.

Philipp Witte, a researcher with Microsoft Research for Industry, recently chatted with Fixing the Future from IEEE Spectrum about how AI can help improve carbon sequestration. For example, AI can facilitate computationally intensive simulations and manage complex modeling requirements more efficiently than conventional methods.

Tune in to the podcast for a lively discussion at the intersection of AI and decarbonization.


OPPORTUNITY

Microsoft Research Data Science Summer School – Apply now

Microsoft Research New York City’s Data Science Summer School (DS3) is an intensive hands-on introduction to data science for local undergraduate students interested in attending graduate school in computer science and related fields. The curriculum includes coursework in data science and group research projects. This year’s program runs from May 30 to June 23, and applications will be accepted through April 11.

The course is taught by leading scientists at Microsoft Research New York City and each participant will receive a laptop and $3,000 stipend. Sessions will cover tools and techniques for acquiring, cleaning, and utilizing real-world data for research purposes. The course also serves as an introduction to problems in applied statistics and machine learning, and will cover the theory behind simple but effective methods for supervised and unsupervised learning.

Applicants must be currently enrolled in an undergraduate program in the New York City area. We strongly encourage people from diverse, non-traditional, and under-represented backgrounds in STEM to apply.

Check out the program details and apply today.

The post Research Focus: Week of March 6, 2023 appeared first on Microsoft Research.

Read More

Responsible AI: The research collaboration behind new open-source tools offered by Microsoft

Responsible AI: The research collaboration behind new open-source tools offered by Microsoft

Flowchart showing how responsible AI tools are used together for targeted debugging of machine learning models: the Responsible AI Dashboard for the identification of failures; followed by the Responsible AI Dashboard and Mitigations Library for the diagnosis of failures; then the Responsible AI Mitigations Library for mitigating failures; and lastly the Responsible AI Tracker for tracking, comparing, and validating mitigation techniques from which an arrow points back to the identification phase of the cycle  to indicate the repetition of the process as models and data continue to evolve during the ML lifecycle.

As computing and AI advancements spanning decades are enabling incredible opportunities for people and society, they’re also raising questions about responsible development and deployment. For example, the machine learning models powering AI systems may not perform the same for everyone or every condition, potentially leading to harms related to safety, reliability, and fairness. Single metrics often used to represent model capability, such as overall accuracy, do little to demonstrate under which circumstances or for whom failure is more likely; meanwhile, common approaches to addressing failures, like adding more data and compute or increasing model size, don’t get to the root of the problem. Plus, these blanket trial-and-error approaches can be resource intensive and financially costly.

Through its Responsible AI Toolbox, a collection of tools and functionalities designed to help practitioners maximize the benefits of AI systems while mitigating harms, and other efforts for responsible AI, Microsoft offers an alternative: a principled approach to AI development centered around targeted model improvement. Improving models through targeting methods aims to identify solutions tailored to the causes of specific failures. This is a critical part of a model improvement life cycle that not only includes the identification, diagnosis, and mitigation of failures but also the tracking, comparison, and validation of mitigation options. The approach supports practitioners in better addressing failures without introducing new ones or eroding other aspects of model performance.

“With targeted model improvement, we’re trying to encourage a more systematic process for improving machine learning in research and practice,” says Besmira Nushi, a Microsoft Principal Researcher involved with the development of tools for supporting responsible AI. She is a member of the research team behind the toolbox’s newest additions: the Responsible AI Mitigations Library, which enables practitioners to more easily experiment with different techniques for addressing failures, and the Responsible AI Tracker, which uses visualizations to show the effectiveness of the different techniques for more informed decision-making.

Targeted model improvement: From identification to validation

The tools in the Responsible AI Toolbox, available in open source and through the Azure Machine Learning platform offered by Microsoft, have been designed with each stage of the model improvement life cycle in mind, informing targeted model improvement through error analysis, fairness assessment, data exploration, and interpretability.

For example, the new mitigations library bolsters mitigation by offering a means of managing failures that occur in data preprocessing, such as those caused by a lack of data or lower-quality data for a particular subset. For tracking, comparison, and validation, the new tracker brings model, code, visualizations, and other development components together for easy-to-follow documentation of mitigation efforts. The tracker’s main feature is disaggregated model evaluation and comparison, which breaks down model performance by data subset to present a clearer picture of a mitigation’s effects on the intended subset, as well as other subsets, helping to uncover hidden performance declines before models are deployed and used by individuals and organizations. Additionally, the tracker allows practitioners to look at performance for subsets of data across iterations of a model to help practitioners determine the most appropriate model for deployment.

photo of Besmira Nushi smiling for the camera

“Data scientists could build many of the functionalities that we offer with these tools; they could build their own infrastructure,” says Nushi. “But to do that for every project requires a lot of effort and time. The benefit of these tools is scale. Here, they can accelerate their work with tools that apply to multiple scenarios, freeing them up to focus on the work of building more reliable, trustworthy models.”

Besmira Nushi, Microsoft Principal Researcher

Building tools for responsible AI that are intuitive, effective, and valuable can help practitioners consider potential harms and their mitigation from the beginning when developing a new model. The result can be more confidence that the work they’re doing is supporting AI that is safer, fairer, and more reliable because it was designed that way, says Nushi. The benefits of using these tools can be far-reaching—from contributing to AI systems that more fairly assess candidates for loans by having comparable accuracy across demographic groups to traffic sign detectors in self-driving cars that can perform better across conditions like sun, snow, and rain.

Converting research into tools for responsible AI

Creating tools that can have the impact researchers like Nushi envision often begins with a research question and involves converting the resulting work into something people and teams can readily and confidently incorporate in their workflows.

“Making that jump from a research paper’s code on GitHub to something that is usable involves a lot more process in terms of understanding what is the interaction that the data scientist would need, what would make them more productive,” says Nushi. “In research, we come up with many ideas. Some of them are too fancy, so fancy that they cannot be used in the real world because they cannot be operationalized.”

Multidisciplinary research teams consisting of user experience researchers, designers, and machine learning and front-end engineers have helped ground the process as have the contributions of those who specialize in all things responsible AI. Microsoft Research works closely with the incubation team of Aether, the advisory body for Microsoft leadership on AI ethics and effects, to create tools based on the research. Equally important has been partnership with product teams whose mission is to operationalize AI responsibly, says Nushi. For Microsoft Research, that is often Azure Machine Learning, the Microsoft platform for end-to-end ML model development. Through this relationship, Azure Machine Learning can offer what Microsoft Principal PM Manager Mehrnoosh Sameki refers to as customer “signals,” essentially a reliable stream of practitioner wants and needs directly from practitioners on the ground. And, Azure Machine Learning is just as excited to leverage what Microsoft Research and Aether have to offer: cutting-edge science. The relationship has been fruitful.

As the current Azure Machine Learning platform made its debut five years ago, it was clear tooling for responsible AI was going to be necessary. In addition to aligning with the Microsoft vision for AI development, customers were seeking out such resources. They approached the Azure Machine Learning team with requests for explainability and interpretability features, robust model validation methods, and fairness assessment tools, recounts Sameki, who leads the Azure Machine Learning team in charge of tooling for responsible AI. Microsoft Research, Aether, and Azure Machine Learning teamed up to integrate tools for responsible AI into the platform, including InterpretML for understanding model behavior, Error Analysis for identifying data subsets for which failures are more likely, and Fairlearn for assessing and mitigating fairness-related issues. InterpretML and Fairlearn are independent community-driven projects that power several Responsible AI Toolbox functionalities.

Before long, Azure Machine Learning approached Microsoft Research with another signal: customers wanted to use the tools together, in one interface. The research team responded with an approach that enabled interoperability, allowing the tools to exchange data and insights, facilitating a seamless ML debugging experience. Over the course of two to three months, the teams met weekly to conceptualize and design “a single pane of glass” from which practitioners could use the tools collectively. As Azure Machine Learning developed the project, Microsoft Research stayed involved, from providing design expertise to contributing to how the story and capabilities of what had become Responsible AI dashboard would be communicated to customers.

After the release, the teams dived into the next open challenge: enabling practitioners to better mitigate failures. Enter the Responsible AI Mitigations Library and the Responsible AI Tracker, which were developed by Microsoft Research in collaboration with Aether. Microsoft Research was well-equipped with the resources and expertise to figure out the most effective visualizations for doing disaggregated model comparison (there was very little previous work available on it) and navigating the proper abstractions for the complexities of applying different mitigations to different subsets of data with a flexible, easy-to-use interface. Throughout the process, the Azure team provided insight into how the new tools fit into the existing infrastructure.

With the Azure team bringing practitioner needs and the platform to the table and research bringing the latest in model evaluation, responsible testing, and the like, it is the perfect fit, says Sameki.

An open-source approach to tooling for responsible AI

While making these tools available through Azure Machine Learning supports customers in bringing their products and services to market responsibly, making these tools open source is important to cultivating an even larger landscape of responsibly developed AI. When release ready, these tools for responsible AI are made open source and then integrated into the Azure Machine Learning platform. The reasons for going with an open-source-first approach are numerous, say Nushi and Sameki:

  • freely available tools for responsible AI are an educational resource for learning and teaching the practice of responsible AI;
  • more contributors, both internal to Microsoft and external, add quality, longevity, and excitement to the work and topic; and
  • the ability to integrate them into any platform or infrastructure encourages more widespread use.

The decision also represents one of the Microsoft AI principles in action—transparency.

photo of Mehrnoosh Sameki smiling for the camera

“In the space of responsible AI, being as open as possible is the way to go, and there are multiple reasons for that,” says Sameki. “The main reason is for building trust with the users and with the consumers of these tools. In my opinion, no one would trust a machine learning evaluation technique or an unfairness mitigation algorithm that is unclear and close source. Also, this field is very new. Innovating in the open nurtures better collaborations in the field.”

Mehrnoosh Sameki, Microsoft Principal PM Manager

Looking ahead

AI capabilities are only advancing. The larger research community, practitioners, the tech industry, government, and other institutions are working in different ways to steer these advancements in a direction in which AI is contributing value and its potential harms are minimized. Practices for responsible AI will need to continue to evolve with AI advancements to support these efforts.

For Microsoft researchers like Nushi and product managers like Sameki, that means fostering cross-company, multidisciplinary collaborations in their continued development of tools that encourage targeted model improvement guided by the step-by-step process of identification, diagnosis, mitigation, and comparison and validation—wherever those advances lead.

“As we get better in this, I hope we move toward a more systematic process to understand what data is actually useful, even for the large models; what is harmful that really shouldn’t be included in those; and what is the data that has a lot of ethical issues if you include it,” says Nushi. “Building AI responsibly is crosscutting, requiring perspectives and contributions from internal teams and external practitioners. Our growing collection of tools shows that effective collaboration has the potential to impact—for the better—how we create the new generation of AI systems.”

The post Responsible AI: The research collaboration behind new open-source tools offered by Microsoft appeared first on Microsoft Research.

Read More

Research Focus: Week of February 20, 2023

Research Focus: Week of February 20, 2023

Microsoft Research Focus 10 edition, week of February 20, 2023

Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Self-supervised Multi-task pretrAining with contRol Transformers (SMART)

Many real-world applications require sequential decision making, where an agent interacts with a stochastic environment to perform a task. For example, a navigating robot is expected to control itself and move to a target using sensory information it receives along the way. Learning the proper control policy can be complicated by environmental uncertainty and high-dimensional perceptual information, such as raw-pixel spaces. More importantly, the learned strategy is specific to the task (e.g. which target to reach) and the agent (e.g., a two-leg robot or a four-leg robot). That means that a good strategy for one task does not necessarily apply to a new task or a different agent.

Pre-training a foundation model can help improve overall efficiency when facing a large variety of control tasks and agents. However, although foundation models have achieved incredible success in language domains, different control tasks and agents can have large discrepancies, making it challenging to find a universal foundation. It becomes even more challenging in real-world scenarios that lack supervision or high-quality behavior data.

In a new paper: SMART: Self-supervised Multi-task pretrAining with contRol Transformers, Microsoft researchers tackle these challenges and propose a generic pre-training framework for control problems. Their research demonstrates that a single pre-trained SMART model can be fine-tuned for various visual-control tasks and agents, either seen or unseen, with significantly improved performance and learning efficiency. SMART is also resilient to low-quality datasets and works well even when random behaviors comprise the pre-training data.


Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

NEW RESEARCH

A Ranking Game for Imitation Learning

Reinforcement learning relies on environmental reward feedback to learn meaningful behaviors. Since reward specification is a hard problem, imitation learning (IL) may be used to bypass reward specification and learn from expert data, often via Inverse Reinforcement Learning (IRL) techniques.  In IL, while near-optimal expert data is very informative, it can be difficult to obtain. Even with infinite data, expert data cannot imply a total ordering over trajectories as preferences can. On the other hand, learning from preferences alone is challenging, as a large number of preferences are required to infer a high-dimensional reward function, though preference data is typically much easier to collect than expert demonstrations. The classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences.

In a new paper: A Ranking Game for Imitation Learning accepted at TMLR 2023, researchers from UT Austin, Microsoft Research, and UMass Amherst create a unified algorithmic framework for IRL that incorporates both expert and suboptimal information for imitation learning. They propose a new framework for imitation learning called “rank-game” which treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise performance rankings between behaviors, while the policy agent learns to maximize this reward. A novel ranking loss function is proposed, giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities. Experimental results in the paper show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (LfO) setting. Project video and code can be found on GitHub.

rank-game: The Policy agent maximizes the reward function by interacting with the environment. The Reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent (vanilla), automatically generated (auto), or offline annotated rankings obtained from a human or offline dataset (pref). Treating this game in the Stackelberg framework leads to either Policy being a leader and Reward being a follower, or vice versa.
Figure 1: rank-game: The Policy agent maximizes the reward function by interacting with the environment. The Reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent (vanilla), automatically generated (auto), or offline annotated rankings obtained from a human or offline dataset (pref). Treating this game in the Stackelberg framework leads to either Policy being a leader and Reward being a follower, or vice versa.

NEWS

Microsoft helps GoodLeaf Farms drive agricultural innovation with data

Vertical indoor farming uses extensive technology to manage production and optimize growing conditions. This includes movement of grow benches, lighting, irrigation, and air and temperature controls. Data and analytics can help vertical farms produce the highest possible yields and quality.

Canadian vertical farm pioneer GoodLeaf Farms has announced a partnership with Microsoft and data and analytics firm Adastra to optimize crop production and quality. GoodLeaf has deployed Microsoft Azure Synapse Analytics and Microsoft Power Platform to utilize the vast amounts of data it collects.

GoodLeaf is also collaborating with Microsoft Research through Project FarmVibes, using GoodLeaf’s data to support research into controlled environment agriculture.

GoodLeaf’s farm in Guelph, Ontario, and two currently under construction in Calgary and Montreal, use a connected system of cameras and sensors to manage plant seeding, growing mediums, germination, temperature, humidity, nutrients, lighting, and air flow. Data science and analytics help the company grow microgreens and baby greens in Canada year-round, no matter the weather using a hydroponics system and specialized LED lights. 


OPPORTUNITY

Reinforcement Learning Open Source Fest

Proposals are now being accepted for Reinforcement Learning (RL) Open Source Fest 2023, a global online program that introduces students to open-source RL programs and software development. Our goal is to bring together a diverse group of students from around the world to help solve open-source RL problems and advance state-of-the-art research and development. The program produces open-source code written and released to benefit all.

Accepted students will join a four-month research project from May to August 2023, working virtually alongside researchers, data scientists, and engineers on the Microsoft Research New York City Real World Reinforcement Learning team. Students will also receive a $10,000 USD stipend. At the end of the program, students will present each of their projects to the Microsoft Research Real World Reinforcement Learning team online.

The proposal deadline is Monday, April 3, 2023, at 11:59 PM ET. Learn more and submit your proposal today.

The post Research Focus: Week of February 20, 2023 appeared first on Microsoft Research.

Read More

Research Focus: Week of February 6, 2023

Research Focus: Week of February 6, 2023

Microsoft Research Focus 09 edition, week of February 6, 2023

Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Behind the Tech podcast with Tobi Lütke: CEO and Founder, Shopify

In the latest episode of Behind the Tech, Microsoft CTO Kevin Scott is joined by Tobi Lütke, CEO and founder of the Canadian multinational e-commerce platform Shopify. Since his early days running an online snowboard shop from his carport, Tobi has envisioned himself as a craftsman first and a business exec second, a mindset he has used to solve a wide variety of problems. He and Kevin discuss applying computer science and engineering techniques to build and scale a company, the idea of bringing an ‘apprentice mindset’ to his work, and how Tobi’s daily practice of writing code and tinkering in his home lab inspires him to be a more creative leader.

Tune in now to enjoy the discussion.


Distribution inference risks: Identifying and mitigating sources of leakage

Distribution inference (or property inference) attacks allow an adversary to infer distributional information about the training data of a machine learning model, which can cause significant problems. For example, leaking distribution of sensitive attributes such as gender or race can create a serious privacy concern. This kind of attack has been shown to be feasible on different types of models and datasets. However, little attention has been given to identifying the potential causes of such leakages and to proposing mitigations.

A new paper, Distribution Inference Risks: Identifying and Mitigating Sources of Leakage, focuses on theoretically and empirically analyzing the sources of information leakage that allow an adversary to perpetrate distribution inference attacks. The researchers identified three sources of leakage: (1) memorizing specific information about the value of interest to the adversary; (2) wrong inductive bias of the model; and (3) finiteness of the training data. Next, based on their analysis, the researchers propose principled mitigation techniques against distribution inference attacks. Specifically, they demonstrate that causal learning techniques are more resilient to a particular type of distribution inference risk — distributional membership inference — than associative learning methods. And lastly, they present a formalization of distribution inference that allows for reasoning about more general adversaries than was previously possible.


Siva Kakarla wins Applied Networking Research Prize

Microsoft’s Siva Kakarla has been awarded an Applied Networking Research Prize for 2023 in recognition of his work on checking the correctness of nameservers. A senior researcher in the Networking Research Group of Microsoft Research, Kakarla was one of six people to receive this annual award from the Internet Research Task Force.

In their paper: SCALE: Automatically Finding RFC Compliance Bugs in DNS Nameservers, Kakarla and his colleagues introduce the first approach for finding RFC (request for comment) compliance errors in DNS nameserver implementations through automatic test generation. Their approach, called Small-scope Constraint-driven Automated Logical Execution, or SCALE, generates high-coverage tests for covering RFC behaviors.

The Applied Networking Research Prize acknowledges advances in applied networking, interesting new research ideas of potential relevance to the internet standards community, and people that are likely to have an impact on internet standards and technologies.

The post Research Focus: Week of February 6, 2023 appeared first on Microsoft Research.

Read More

Research Focus: Week of January 23, 2023

Research Focus: Week of January 23, 2023

Microsoft Research Focus 08 edition, week of January 23

Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Revolutionizing Document AI with multimodal document foundation models  

Organizations must digitize various documents, many with charts and images, to manage and streamline essential functions. Yet manually digitized documents are often of uneven quality, while web pages and electronic documents can come with multiple layouts.

Document AI technology is designed to efficiently extract, organize and analyze the information in different documents, freeing employees and companies from this repetitive and tedious work. The results are automated extraction, classification and understanding of information with rich typesetting formats from webpages, digital-born documents, or scanned documents, along with lower costs and reduced errors.

Microsoft Research Asia has been studying Document AI since 2019, working at the intersection of natural language processing and computer vision and using deep learning techniques. In their most recent work, researchers have developed new skills for Document AI, unveiled industry-leading models, and begun developing general-purpose and unified frameworks.


Tapping into Large Language Models with Microsoft’s Turing Academic Program

Large language models (LLMs) deliver impressive performance with difficult tasks and across various applications. As AI researchers explore LLMs, many questions persist. Answering these questions will require a range of different perspectives and proficiencies from experts from industry, research, and government.
 
To better understand opportunities and challenges with LLMs, Eric Horvitz, Microsoft’s Chief Scientific Officer, moderated a panel discussion “Towards a Healthy Research Ecosystem for Large Language Models.” Panelists included Ahmed Awadallah (Microsoft Research), Erwin Gianchandani (National Science Foundation), Percy Liang (Stanford University), and Saurabh Tiwary (Microsoft Turing).

A key theme of the panel was the need to expand access to LLMs, which requires large amounts of data and computing resources. The Microsoft Turing Academic Program (MS-TAP) supports this effort through multiple in-depth collaborations with partner universities.

You can learn more about MS-TAP and the panel discussion in this recent blog post.


Microsoft researchers named 2022 ACM Fellows

Three researchers from Microsoft were among the 57 Fellows named by the Association for Computing Machinery (ACM) for fundamental contributions to computing technologies in 2022.

This award, which recognizes the top 1% of ACM members for accomplishments in computing and information technology and/or outstanding service to ACM and the larger computing community, was presented to the following Microsoft researchers:

Ranveer Chandra
For contributions to software-defined wireless networking and applications to agriculture and rural broadband

Marc Pollefeys
For contributions to geometric computer vision and applications to AR/VR/MR, robotics and autonomous vehicles

Jaime Teevan
For contributions to human-computer interaction, information retrieval, and productivity

The ACM Fellows program was launched in 1993. Candidates are nominated by their peers and then reviewed by a selection committee. ACM is the world’s largest educational and scientific computing society, uniting educators, researchers, and professionals to inspire dialogue, share resources, and address challenges.

The post Research Focus: Week of January 23, 2023 appeared first on Microsoft Research.

Read More

Biomedical Research Platform Terra Now Available on Microsoft Azure

Biomedical Research Platform Terra Now Available on Microsoft Azure

Terra's logo on a black background with abstract DNA strand pattern

We stand at the threshold of a new era of precision medicine, where health and life sciences data hold the potential to dramatically propel and expand our understanding and treatment of human disease. One of the tools that we believe will help to enable precision medicine is Terra, the secure biomedical research platform co-developed by Broad Institute of MIT and Harvard, Microsoft, and Verily. Today, we are excited to share that Terra is available for preview on Microsoft Azure.

Starting today, any researcher can bring their data, access publicly available datasets, run analyses, and collaborate with others on Terra using Microsoft Azure. Learn more about accessing Terra and exploring its capabilities on the Terra blog.

By joining forces on Terra, the Broad Institute, Microsoft, and Verily are accelerating the next generation of collaborative biomedical research to positively impact health outcomes now and in the future. Terra’s cloud-based platform offers a secure, centralized location for biomedical research, connecting researchers to each other and to the datasets and tools they need to collaborate effectively, advance their work, and achieve scientific breakthroughs. Terra on Azure will also provide valuable support for enterprise organizations across industries. 

Terra on Azure is built to be enterprise-ready and natively supports single sign-on (SSO) with Azure Active Directory. Operating as platform as a service (PaaS), Terra deploys resources into an end-user’s Azure tenant, allowing customers to apply their Enterprise Agreements to their use of Terra and giving them more control over the cloud resources running in their environment as well as the different types of tools and data they can use within their Terra workspace.

Figure 1: Terra brings together components of the Microsoft Genomics and healthcare ecosystems to offer optimized, secure, and collaborative genomic research.
Figure 1: Terra brings together components of the Microsoft Genomics and healthcare ecosystems to offer optimized, secure, and collaborative biomedical research.

At Microsoft, with our focus on standards-driven data interoperability, we are building seamless connections between Terra and Azure Health Data Services to enable multi-modal data analysis—across clinical, genomics, imaging, and other modes—and to accelerate precision medicine research, discovery, development, and delivery. Terra on Azure can connect to other Azure services, allowing customers to draw on Azure innovations that are beneficial to biomedical analysis, such as those in Azure Confidential Computing for data privacy, Azure Synapse for data analytics, Azure Purview for data governance, and Azure ML for machine learning. 

How does the biomedical research community benefit from Terra?

Data and partnerships form the bedrock of biomedical research, but researchers often face significant challenges on the path to effective collaboration. Part of the challenge for data scientists and researchers is accessing large and diverse sample sizes. Although the volume and availability of data is increasing, silos are growing stronger as data becomes more globally distributed. Different regions and organizations have their own unique data access policies, making access to data nearly impossible and collaboration a sometimes daunting challenge.

Terra powers research collaborations within and across organizational boundaries by giving researchers and data stewards new tools and capabilities to help them overcome those challenges and achieve their goals. As a biomedical research platform, Terra provides a foundation for data stewards to manage dataset access and use policies across the research lifecycle, and it enables researchers to access, build, and analyze larger datasets much faster.

Figure 2: Terra is built to support researchers and data custodians.
Figure 2: Terra is built to support researchers and data custodians.

Through Terra on Azure, researchers can operate in secure environments purpose-built for health and life sciences; retrieve and examine public, controlled-access, and private data; reproduce analyses; and share hypotheses and analysis results. Analyses are performed within a security perimeter that enables data-access and data-use policies and compliance standards to be met.

How does Terra on Azure advance Health Futures’ goals?

Microsoft Health Futures is focused on empowering every person on the planet to live healthier lives and create a healthier future. We are responsible for research, incubations, and moonshots that drive cross-company strategy to support that goal. We believe the future of medicine is data-driven, predictive, and precise. Yet one of the major barriers to scientific discovery is access to data—at scale, longitudinally, and in multiple modalities.

Innovation within the life sciences is a core Health Futures priority, and we partner with leading organizations to advance and build infrastructure for emerging precision health modalities, including genomics, immunomics, and beyond. The Terra collaboration is a key piece of this broader priority and sets the foundation to scale real-world impact through our customers, partners, and the life sciences ecosystem.

It is an honor to partner with the Broad Institute and Verily to help researchers around the world understand and treat our toughest human diseases. Terra is a powerful platform that will enhance biomedical research collaboration and scientific exploration for the betterment of humankind.

The post Biomedical Research Platform Terra Now Available on Microsoft Azure appeared first on Microsoft Research.

Read More

Advancing human-centered AI: Updates on responsible AI research

Advancing human-centered AI: Updates on responsible AI research

Editor’s note: All papers referenced here represent collaborations throughout Microsoft and across academia and industry that include authors who contribute to Aether, the Microsoft internal advisory body for AI Ethics and Effects in Engineering and Research.


illustration of a lightbulb shape with different icons surrounding it on a purple background

  • Video frame with an image of hands reaching upward and overlaid with the question

    Video

    A human-centered approach to AI 

    Learn how considering potential benefits and harms to people and society helps create better AI in the keynote “Challenges and opportunities in responsible AI” (2022 ACM SIGIR Conference on Human Information Interaction and Retrieval).

Artificial intelligence, like all tools we build, is an expression of human creativity. As with all creative expression, AI manifests the perspectives and values of its creators. A stance that encourages reflexivity among AI practitioners is a step toward ensuring that AI systems are human-centered, developed and deployed with the interests and well-being of individuals and society front and center. This is the focus of research scientists and engineers affiliated with Aether, the advisory body for Microsoft leadership on AI ethics and effects. Central to Aether’s work is the question of who we’re creating AI for—and whether we’re creating AI to solve real problems with responsible solutions. With AI capabilities accelerating, our researchers work to understand the sociotechnical implications and find ways to help on-the-ground practitioners envision and realize these capabilities in line with Microsoft AI principles.

The following is a glimpse into the past year’s research for advancing responsible AI with authors from Aether. Throughout this work are repeated calls for reflexivity in AI practitioners’ processes—that is, self-reflection to help us achieve clarity about who we’re developing AI systems for, who benefits, and who may potentially be harmed—and for tools that help practitioners with the hard work of uncovering assumptions that may hinder the potential of human-centered AI. The research discussed here also explores critical components of responsible AI, such as being transparent about technology limitations, honoring the values of the people using the technology, enabling human agency for optimal human-AI teamwork, improving effective interaction with AI, and developing appropriate evaluation and risk-mitigation techniques for multimodal machine learning (ML) models.

Considering who AI systems are for

The need to cultivate broader perspectives and, for society’s benefit, reflect on why and for whom we’re creating AI is not only the responsibility of AI development teams but also of the AI research community. In the paper “REAL ML: Recognizing, Exploring, and Articulating Limitations of Machine Learning Research,” the authors point out that machine learning publishing often exhibits a bias toward emphasizing exciting progress, which tends to propagate misleading expectations about AI. They urge reflexivity on the limitations of ML research to promote transparency about findings’ generalizability and potential impact on society—ultimately, an exercise in reflecting on who we’re creating AI for. The paper offers a set of guided activities designed to help articulate research limitations, encouraging the machine learning research community toward a standard practice of transparency about the scope and impact of their work.

Graphic incorporating photos of a researcher sitting with a laptop and using the REAL ML tool, reflecting on research limitations to foster scientific progress, and a bird’s eye view of a cityscape at night.

Walk through REAL ML’s instructional guide and worksheet that help researchers with defining the limitations of their research and identifying societal implications these limitations may have in the practical use of their work.

Despite many organizations formulating principles to guide the responsible development and deployment of AI, a recent survey highlights that there’s a gap between the values prioritized by AI practitioners and those of the general public. The survey, which included a representative sample of the US population, found AI practitioners often gave less weight than the general public to values associated with responsible AI. This raises the question of whose values should inform AI systems and shifts attention toward considering the values of the people we’re designing for, aiming for AI systems that are better aligned with people’s needs.

Related papers

Creating AI that empowers human agency

Supporting human agency and emphasizing transparency in AI systems are proven approaches to building appropriate trust with the people systems are designed to help. In human-AI teamwork, interactive visualization tools can enable people to capitalize on their own domain expertise and let them easily edit state-of-the-art models. For example, physicians using GAM Changer can edit risk prediction models for pneumonia and sepsis to incorporate their own clinical knowledge and make better treatment decisions for patients.

A study examining how AI can improve the value of rapidly growing citizen-science contributions found that emphasizing human agency and transparency increased productivity in an online workflow where volunteers provide valuable information to help AI classify galaxies. When choosing to opt in to using the new workflow and receiving messages that stressed human assistance was necessary for difficult classification tasks, participants were more productive without sacrificing the quality of their input and they returned to volunteer more often.

Failures are inevitable in AI because no model that interacts with the ever-changing physical world can be complete. Human input and feedback are essential to reducing risks. Investigating reliability and safety mitigations for systems such as robotic box pushing and autonomous driving, researchers formalize the problem of negative side effects (NSEs), the undesirable behavior of these systems. The researchers experimented with a framework in which the AI system uses immediate human assistance in the form of feedback—either about the user’s tolerance for an NSE occurrence or their decision to modify the environment. Results demonstrate that AI systems can adapt to successfully mitigate NSEs from feedback, but among future considerations, there remains the challenge of developing techniques for collecting accurate feedback from individuals using the system.

The goal of optimizing human-AI complementarity highlights the importance of engaging human agency. In a large-scale study examining how bias in models influences humans’ decisions in a job recruiting task, researchers made a surprising discovery: when working with a black-box deep neural network (DNN) recommender system, people made significantly fewer gender-biased decisions than when working with a bag-of-words (BOW) model, which is perceived as more interpretable. This suggests that people tend to reflect and rely on their own judgment before accepting a recommendation from a system for which they can’t comfortably form a mental model of how its outputs are derived. Researchers call for exploring techniques to better engage human reflexivity when working with advanced algorithms, which can be a means for improving hybrid human-AI decision-making and mitigating bias. 

How we design human-AI interaction is key to complementarity and empowering human agency. We need to carefully plan how people will interact with AI systems that are stochastic in nature and present inherently different challenges than deterministic systems. Designing and testing human interaction with AI systems as early as possible in the development process, even before teams invest in engineering, can help avoid costly failures and redesign. Toward this goal, researchers propose early testing of human-AI interaction through factorial surveys, a method from the social sciences that uses short narratives for deriving insights about people’s perceptions.

But testing for optimal user experience before teams invest in engineering can be challenging for AI-based features that change over time. The ongoing nature of a person adapting to a constantly updating AI feature makes it difficult to observe user behavior patterns that can inform design improvements before deploying a system. However, experiments demonstrate the potential of HINT (Human-AI INtegration Testing), a framework for uncovering over-time patterns in user behavior during pre-deployment testing. Using HINT, practitioners can design test setup, collect data via a crowdsourced workflow, and generate reports of user-centered and offline metrics.

Graphic of bridging HCI and NLP for empowering human agency with images of people using chatbots.

Check out the 2022 anthology of this annual workshop that brings human-computer interaction (HCI) and natural language processing (NLP) research together for improving how people can benefit from NLP apps they use daily.

Related papers

Building responsible AI tools for foundation models

Although we’re still in the early stages of understanding how to responsibly harness the potential of large language and multimodal models that can be used as foundations for building a variety of AI-based systems, researchers are developing promising tools and evaluation techniques to help on-the-ground practitioners deliver responsible AI. The reflexivity and resources required for deploying these new capabilities with a human-centered approach are fundamentally compatible with business goals of robust services and products.

Natural language generation with open-ended vocabulary has sparked a lot of imagination in product teams. Challenges persist, however, including for improving toxic language detection; content moderation tools often over-flag content that mentions minority groups without respect to context while missing implicit toxicity. To help address this, a new large-scale machine-generated dataset, ToxiGen, enables practitioners to fine-tune pretrained hate classifiers for improving detection of implicit toxicity for 13 minority groups in both human- and machine-generated text.

Graphic for ToxiGen dataset for improving toxic language detection with images of diverse demographic groups of people in discussion and on smartphone.

Download the large-scale machine-generated ToxiGen dataset and install source code for fine-tuning toxic language detection systems for adversarial and implicit hate speech for 13 demographic minority groups. Intended for research purposes.

Multimodal models are proliferating, such as those that combine natural language generation with computer vision for services like image captioning. These complex systems can surface harmful societal biases in their output and are challenging to evaluate for mitigating harms. Using a state-of-the-art image captioning service with two popular image-captioning datasets, researchers isolate where in the system fairness-related harms originate and present multiple measurement techniques for five specific types of representational harm: denying people the opportunity to self-identify, reifying social groups, stereotyping, erasing, and demeaning.

The commercial advent of AI-powered code generators has introduced novice developers alongside professionals to large language model (LLM)-assisted programming. An overview of the LLM-assisted programming experience reveals unique considerations. Programming with LLMs invites comparison to related ways of programming, such as search, compilation, and pair programming. While there are indeed similarities, the empirical reports suggest it is a distinct way of programming with its own unique blend of behaviors. For example, additional effort is required to craft prompts that generate the desired code, and programmers must check the suggested code for correctness, reliability, safety, and security. Still, a user study examining what programmers value in AI code generation shows that programmers do find value in suggested code because it’s easy to edit, increasing productivity. Researchers propose a hybrid metric that combines functional correctness and similarity-based metrics to best capture what programmers value in LLM-assisted programming, because human judgment should determine how a technology can best serve us.

Related papers

Understanding and supporting AI practitioners

Organizational culture and business goals can often be at odds with what practitioners need for mitigating fairness and other responsible AI issues when their systems are deployed at scale. Responsible, human-centered AI requires a thoughtful approach: just because a technology is technically feasible does not mean it should be created.

Similarly, just because a dataset is available doesn’t mean it’s appropriate to use. Knowing why and how a dataset was created is crucial for helping AI practitioners decide on whether it should be used for their purposes and what its implications are for fairness, reliability, safety, and privacy. A study focusing on how AI practitioners approach datasets and documentation reveals current practices are informal and inconsistent. It points to the need for data documentation frameworks designed to fit within practitioners’ existing workflows and that make clear the responsible AI implications of using a dataset. Based on these findings, researchers iterated on Datasheets for Datasets and proposed the revised Aether Data Documentation Template.

Graphic for the Aether Data Documentation Template for promoting reflexivity and transparency with bird’s eye view of pedestrians at busy crosswalks and a close-up of hands typing on a computer keyboard.

Use this flexible template to reflect and help document underlying assumptions, potential risks, and implications of using your dataset.

AI practitioners find themselves balancing the pressures of delivering to meet business goals and the time requirements necessary for the responsible development and evaluation of AI systems. Examining these tensions across three technology companies, researchers conducted interviews and workshops to learn what practitioners need for measuring and mitigating AI fairness issues amid time pressure to release AI-infused products to wider geographic markets and for more diverse groups of people. Participants disclosed challenges in collecting appropriate datasets and finding the right metrics for evaluating how fairly their system will perform when they can’t identify direct stakeholders and demographic groups who will be affected by the AI system in rapidly broadening markets. For example, hate speech detection may not be adequate across cultures or languages. A look at what goes into AI practitioners’ decisions around what, when, and how to evaluate AI systems that use natural language generation (NLG) further emphasizes that when practitioners don’t have clarity about deployment settings, they’re limited in projecting failures that could cause individual or societal harm. Beyond concerns for detecting toxic speech, other issues of fairness and inclusiveness—for example, erasure of minority groups’ distinctive linguistic expression—are rarely a consideration in practitioners’ evaluations.

Coping with time constraints and competing business objectives is a reality for teams deploying AI systems. There are many opportunities for developing integrated tools that can prompt AI practitioners to think through potential risks and mitigations for sociotechnical systems.

Related papers

Thinking about it: Reflexivity as an essential for society and industry goals

As we continue to envision what all is possible with AI’s potential, one thing is clear: developing AI designed with the needs of people in mind requires reflexivity. We have been thinking about human-centered AI as being focused on users and stakeholders. Understanding who we are designing for, empowering human agency, improving human-AI interaction, and developing harm mitigation tools and techniques are as important as ever. But we also need to turn a mirror toward ourselves as AI creators. What values and assumptions do we bring to the table? Whose values get to be included and whose are left out? How do these values and assumptions influence what we build, how we build, and for whom? How can we navigate complex and demanding organizational pressures as we endeavor to create responsible AI? With technologies as powerful as AI, we can’t afford to be focused solely on progress for its own sake. While we work to evolve AI technologies at a fast pace, we need to pause and reflect on what it is that we are advancing—and for whom.

The post Advancing human-centered AI: Updates on responsible AI research appeared first on Microsoft Research.

Read More

Research Focus: Week of January 9, 2023

Research Focus: Week of January 9, 2023

Research Focus 07 - Week of January 9th, 2023

Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation

Jan P. Unsleber, Hongbin Liu, Leopold Talirz, Thomas Weymuth, Maximilian Mörchen, Adam Grofe, Dave Wecker, Christopher J. Stein, Ajay Panyala, Bo Peng, Karol Kowalski, Matthias Troyer, Markus Reiher

Quantum chemical calculations on atomistic systems have evolved into a standard approach to studying molecular matter. These calculations often involve a significant amount of manual input and specific process considerations, which could be automated and allow for further efficiencies. In our recent paper: High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation, we present the AutoRXN workflow, an automated workflow for exploratory high-throughput electronic structure calculations of molecular systems. In this workflow, (i) density functional theory methods are exploited to deliver minimum and transition-state structures and corresponding energies and properties, (ii) coupled cluster calculations are then launched for optimized structures to provide more accurate energy and property estimates, and (iii) multi-reference diagnostics are evaluated to back check the coupled cluster results and subject them to automated multi-configurational calculations for potential multi-configurational cases. All calculations are carried out in a cloud environment and support massive computational campaigns. Key features of all components of the AutoRXN workflow are autonomy, stability, and minimum operator interference. We highlight the AutoRXN workflow at the example of an autonomous reaction mechanism exploration of the mode of action of a homogeneous catalyst for the asymmetric reduction of ketones.


Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Disparate Impacts on Online Information Access during the COVID-19 Pandemic

Jina Suh, Eric Horvitz, Ryen W. White, Tim Althoff

Despite efforts to close the long-term and emergent health equity gap, studies during the COVID-19 pandemic show that socioeconomically and environmentally disadvantaged subpopulations have been disproportionately harmed by the disease[1]. Digital access to health services and information has also emerged as an important factor modulating health outcomes. During the pandemic, digital engagement in resources across health, educational, economic, and social needs became a necessity due to lockdown mandates and increased use of internet-based communication by public institutions. Unfortunately, disparities in digital access also reflect socioeconomic and environmental dimensions, which can lead to negative offline consequences, creating a “digital vicious cycle”[2]. Therefore, it is a public health priority to identify vulnerable populations and to understand potential barriers to critical digital resources.

In a new paper: Disparate Impacts on Online Information Access during the COVID-19 Pandemic, published in Nature Communications, researchers from Microsoft Research and the University of Washington have collaborated to harness the centrality of web search engines for online information access to observe digital disparities during the pandemic. They analyzed over 55 billion web search interactions on Bing during the pandemic across 25,150 U.S. ZIP codes to reveal that socioeconomic and environmental factors are associated with the differential use of digital resources across different communities – even if they were digitally connected.


DeepSpeed Data Efficiency library: Towards less data, faster training, and higher model quality

DeepSpeed Team, Andrey Proskurin

DeepSpeed has released a new Data Efficiency library to optimize deep learning training efficiency and cost. The library offers new algorithms on efficient data sampling/scheduling via curriculum learning and efficient data routing via random layerwise token dropping, together with composable and customizable library support. The library greatly reduces training cost while maintaining model quality (1.5-2x less data and time for GPT-3/BERT pretraining), or further improves model quality under the same training cost (>1 point gain for GPT-3-1.3B zero/few-shot evaluation). The code is open-sourced at https://github.com/microsoft/DeepSpeed.

You can learn more in our blog post and in the papers below.


Research Fellows Program at Microsoft Research India – Apply now

The Research Fellows Program at Microsoft Research India is now accepting applications for Fall 2023. This is an opportunity to work with world-class researchers on state-of-the-art technology. The program prepares students for careers in research, engineering, and entrepreneurship, while pushing the frontiers of computer science and technology. Previous Research Fellows have contributed to all aspects of the research lifecycle, spanning ideation, implementation, evaluation, and deployment.

Selected candidates spend one to two years with Microsoft Research India. Candidates should have completed BS/BE/BTech or MS/ME/MTech in Computer Science or related areas, graduating by summer 2023. Apply before February 3, 2023.


The post Research Focus: Week of January 9, 2023 appeared first on Microsoft Research.

Read More

Research @ Microsoft 2022: A look back at a year of accelerating progress in AI

Research @ Microsoft 2022: A look back at a year of accelerating progress in AI

2022 Microsoft Research - Year in review graphic

2022 has seen remarkable progress in foundational technologies that have helped to advance human knowledge and create new possibilities to address some of society’s most challenging problems. Significant advances in AI have also enabled Microsoft to bring new capabilities to customers through our products and services, including GitHub Copilot, an AI pair programmer capable of turning natural language prompts into code, and a preview of Microsoft Designer, a graphic design app that supports the creation of social media posts, invitations, posters, and one-of-a-kind images.

These offerings provide an early glimpse of how new AI capabilities, such as large language models, can enable people to interact with machines in increasingly powerful ways. They build on a significant, long-term commitment to fundamental research in computing and across the sciences, and the research community at Microsoft plays an integral role in advancing the state of the art in AI, while working closely with engineering teams and other partners to transform that progress into tangible benefits.

In 2022, Microsoft Research established AI4Science, a global organization applying the latest advances in AI and machine learning toward fundamentally transforming science; added to and expanded the capabilities of the company’s family of foundation models; worked to make these models and technologies more adaptable, collaborative, and efficient; further developed approaches to ensure that AI is used responsibly and in alignment with human needs; and pursued different approaches to AI, such as causal machine learning and reinforcement learning.

We shared our advances across AI and many other disciplines during our second annual Microsoft Research Summit, where members of our research community gathered virtually with their counterparts across industry and academia to discuss how emerging technologies are being explored and deployed to bring the greatest possible benefits to humanity.  

Plenary sessions at the event focused on the transformational impact of deep learning on the way we practice science, research that empowers medical practitioners and reduces inequities in healthcare, and emerging foundations for planet-scale computing. Further tracks and sessions over three days provided deeper dives into the future of the cloud; efficient large-scale AI; amplifying human productivity and creativity; delivering precision healthcare; building user trust through privacy, identity, and responsible AI; and enabling a resilient and sustainable world.

  • Blog

    Microsoft Climate Research Initiative (MCRI) 

    In June, the Microsoft Climate Research Initiative (MCRI) announced its first phase of collaborations among multidisciplinary researchers working together to accelerate cutting-edge research and transformative innovation in climate science and technology.

  • Publication

    New Future of Work Report 2022 

    In May, researchers across Microsoft published the New Future of Work Report 2022, which summarizes important recent research developments related to hybrid work. It highlights themes that have emerged in the findings of the past year and resurfaces older research that has become newly relevant.

In this blog post, we look back at some of the key achievements and notable work in AI and highlight other advances across our diverse, multidisciplinary, and global organization.

Advancing AI foundations and accelerating progress

Over the past year, the research community at Microsoft made significant contributions to the rapidly evolving landscape of powerful large-scale AI models. Microsoft Research and the Microsoft Turing team unveiled a new Turing Universal Language Representation model capable of performing both English and multilingual understanding tasks. In computer vision, advancements for the Project Florence-VL (Florence-Vision and Language) team spanned still imagery and video: its GIT model was the first to surpass human performance on the image captioning benchmark TextCaps; LAVENDER showed strong performance in video question answering, text-to-video retrieval, and video captioning; and GLIP and GLIPv2 combined localization and vision-language understanding. The group also introduced NUWA-Infinity, a model capable of converting text, images, and video into high-resolution images or long-duration video. Meanwhile, the Visual Computing Group scaled up its Transformer-based general-purpose computer vision architecture, Swin Transformer, achieving applicability across more vision tasks than ever before.

Researchers from Microsoft Research Asia and the Microsoft Turing team also introduced BEiT-3, a general-purpose multimodal foundation model that achieves state-of-the-art transfer performance on both vision and vision-language tasks. In BEiT-3, researchers introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, BEiT-3 performs masked “language” modeling on images (Imglish), texts (English), and image-text pairs (“parallel sentences”) in a unified manner. The code and pretrained models will be available at GitHub.

One of the most crucial accelerators of progress in AI is the ability to optimize training and inference for large-scale models. In 2022, the DeepSpeed team made a number of breakthroughs to improve mixture of experts (MoE) models, making them more efficient, faster, and less costly. Specifically, they were able to reduce training cost by 5x, reduce MoE parameter size by up to 3.7x, and reduce MoE inference latency by 7.3x while offering up to 4.5x faster and 9x cheaper inference for MoE models compared to quality-equivalent dense models.

Transforming scientific discovery and adding societal value

Our ability to comprehend and reason about the natural world has advanced over time, and the new AI4Science organization, announced in July, represents another turn in the evolution of scientific discovery. Machine learning is already being used in the natural sciences to model physical systems using observational data. AI4Science aims to dramatically accelerate our ability to model and predict natural phenomena by creating deep learning emulators that learn by using computational solutions to fundamental equations as training data.

This new paradigm can help scientists gain greater insight into natural phenomena, right down to their smallest components. Such molecular understanding and powerful computational tools can help accelerate the discovery of new materials to combat climate change, and new drugs to help support the prevention and treatment of disease.  

For instance, AI4Science’s Project Carbonix is working on globally accessible, at-scale solutions for decarbonizing the world economy, including reverse engineering materials that can pull carbon out of the environment and recycling carbon into materials. Collaborating on these efforts through the Microsoft Climate Research Initiative (MCRI) are domain experts from academia, industry, and government. Announced in June, MCRI is focused on areas such as carbon accounting, climate risk assessments, and decarbonization.

As part of the Generative Chemistry project, Microsoft researchers have been working with the global medicines company Novartis to develop and execute machine learning tools and human-in-the-loop approaches to enhance the entire drug discovery process. In April, they introduced MoLeR, a graph-based generative model for designing compounds that is more reflective of how chemists think about the process and is more efficient and practical than an earlier generative model the team developed. 

While AI4Science is focused on computational simulation, we have seen with projects like InnerEye that AI can have societal value in many other ways. In March, Microsoft acquired Nuance Communications Inc., further cementing the companies’ shared commitment to outcome-based AI across industries, particularly in healthcare. Tools like the integration of Microsoft Teams and Dragon Ambient eXperience (Nuance DAX) to help ease the administrative burden of physicians and support meaningful doctor-patient interactions are already making a difference.

Making AI more adaptable, collaborative, and efficient 

To help accelerate the capabilities of large-scale AI while building a landscape in which everyone can benefit from it, the research community at Microsoft aimed to drive progress in three areas: adaptability, collaboration, and efficiency.

To provide consistent value, AI systems must respond to changes in task and environment. Research in this area includes multi-task learning with task-aware routing of inputs, knowledge-infused decoding, model repurposing with data-centric ML, pruning and cognitive science or brain-inspired AI. A good example of our work toward adaptability is GODEL, or Grounded Open DialogueLanguage Model, which ushers in a new class of pretrained language models that enable chatbots to help with tasks and then engage in more general conversations.  

Microsoft’s research into more collaborative AI includes AdaTest, which leverages human expertise alongside the generative power of large language models to help people more efficiently find and correct bugs in natural language processing models. Researchers have also explored expanding the use of AI in creative processes, including a project in which science fiction writer Gabrielle Loisel used OpenAI’s GPT-3 to co-author a novella and other stories

To enable more people to make use of AI in an efficient and sustainable way, Microsoft researchers are pursuing several new architectures and training paradigms. This includes new modular architectures and novel techniques, such as DeepSpeed Compression, a composable library for extreme compression and zero-cost quantization, and Z-Code Mixture of Experts models, which boost translation efficiency and were deployed in Microsoft Translator in 2022.  

In December, researchers unveiled AutoDistil, a new technique that leverages knowledge distillation and neural architecture search to improve the balance between cost and performance when generating compressed models. They also introduced AdaMix, which improves the fine-tuning of large pretrained models for downstream tasks using mixture of adaptations modules for parameter-efficient model tuning. And vision-language model compression research on the lottery ticket hypothesis showed that pretrained language models can be significantly compressed without hurting their performance.

  • Blog

    Infusing AI into cloud computing systems 

    Cloud Intelligence/AIOps is a rapidly emerging technology trend and an interdisciplinary research direction across system, software engineering, and AI/ML communities. In this blog post from November, the researchers behind Microsoft’s AIOps work outline a research vision to make the cloud more autonomous, proactive, and manageable.

Building and deploying AI responsibly

Building AI that maximizes its benefit to humanity, and does so equitably, requires considering both the opportunities and risks that come with each new advancement in line with our guiding principles: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability.

Helping to put these principles into practice is Microsoft’s Responsible AI Standard, which the company made publicly available in June. The standard comprises tools and steps that AI practitioners can execute in their workflows today to help ensure that building AI responsibly is baked into every stage of development. These standards will evolve as the tools and resources to responsibly build AI evolve in response to the rapid pace of AI advancement, particularly pertaining to the growing size of AI models and the new challenges they bring

With FedKD and InclusiveFL, researchers tackled some of the obstacles in applying federated learning, an ML method for protecting privacy, to model training. Two separate teams explored solutions for the harmful language that large generative models can reproduce—one presenting a unified framework for both detoxifying and debiasing models and another introducing methods for making content moderation tools more robust. Meanwhile, researchers sought to strengthen human-AI collaboration by giving users more insight into how models arrive at their outputs via explanations provided by the models themselves.

The responsible development of AI also means deploying technologies that operate the way they were designed to—and the way people expect them to. In a pair of blog posts, researchers draw on their respective experiences developing a technology to support social agency in children who are born blind and another to support mental health practitioners in guiding patient treatment to stress the need for multiple measures of performance in determining the readiness of increasingly complex AI systems and the incorporation of domain experts and user research throughout the development process.

Advancing AI for decision making

Building the next generation of AI requires continuous research into fundamental new AI innovations. Two significant areas of study in 2022 were causal ML and reinforcement learning.

Causal ML

Identifying causal effects is an integral part of scientific inquiry. It helps us understand everything from educational outcomes to the effects of social policies to risk factors for diseases. Questions of cause and effect are also critical for the design and data-driven evaluation of many technological systems we build today.  

This year, Microsoft Research continued its work on causal ML, which combines traditional machine learning with causal inference methods. To help data scientists better understand and deploy causal inference, Microsoft researchers built the DoWhy library, an end-to-end causal inference tool, in 2018. To broaden access to this critical knowledge base, DoWhy has now migrated to an independent open-source governance model in a new PyWhy GitHub organization. As part of this new collaborative model, Amazon Web Services is contributing new technology based on structural causal models.

At this year’s Conference on Neural Information Processing Systems (NeurIPS), researchers presented a suite of open-source causal tools and libraries that aims to simultaneously provide core causal AI functionality to practitioners and create a platform for research advances to be rapidly deployed. This includes ShowWhy, a no-code user interface suite that empowers domain experts to become decision scientists. We hope that our work accelerates use-inspired basic research for improvement of causal AI.

Reinforcement learning (RL)

Reinforcement learning is a powerful tool for learning which behaviors are likely to produce the best outcomes in a given scenario, typically through trial and error. But this powerful tool faces some challenges. Trial and error can consume enormous resources when applied to large datasets. And for many real-time applications, there’s no room to learn from mistakes.   

To address RL’s computational bottleneck, Microsoft researchers developed Path Predictive Elimination, a reinforcement learning method that is robust enough to remove noise from continuously changing environments. Also in 2022, a Microsoft team released MoCapAct, a library of pretrained simulated models to enable advanced research on artificial humanoid control at a fraction of the compute resources currently required.  

Researchers also developed a new method for using offline RL to augment human-designed strategies for making critical decisions. This team deployed game theory to design algorithms that can use existing data to learn policies that improve on current strategies.

Readers’ choice: Notable blog posts for 2022 

Thank you for reading

2022 was an exciting year for research, and we look forward to the future breakthroughs our global research community will deliver. In the coming year, you can expect to hear more from us about our vision, and the impact we hope to achieve. We appreciate the opportunity to share our work with you, and we hope you will subscribe to the Microsoft Research Newsletter for the latest developments.

Writers and Editors
Elise Ballard
Kristina Dodge
Kate Forster
Chris Stetkiewicz
Larry West

Managing Editor
Amber Tingle

Project Manager
Amanda Melfi

Graphic Designer
Matt Sanderson

Editor in Chief
Matt Corwine

The post Research @ Microsoft 2022: A look back at a year of accelerating progress in AI appeared first on Microsoft Research.

Read More