Abstracts: Zero-shot models in single-cell biology with Alex Lu

Abstracts: Zero-shot models in single-cell biology with Alex Lu

Illustrated headshot of Alex Lu.

Members of the research community at Microsoft work continuously to advance their respective fields. The Abstracts podcast brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

The success of foundation models like ChatGPT has sparked growing interest in scientific communities seeking to use AI for things like discovery in single-cell biology. In this episode, senior researcher Alex Lu joins host Gretchen Huizinga to talk about his work on a paper called Assessing the limits of zero-shot foundation models in single-cell biology, where researchers tested zero-shot performance of proposed single-cell foundation models. Results showed limited efficacy compared to older, simpler methods, and suggested the need for more rigorous evaluation and research. 

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers. 

[MUSIC FADES]

On today’s episode, I’m talking to Alex Lu, a senior researcher at Microsoft Research and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models in Single-cell Biology. Alex Lu, wonderful to have you on the podcast. Welcome to Abstracts! 


ALEX LU: Yeah, I’m really excited to be joining you today. 

HUIZINGA: So let’s start with a little background of your work. In just a few sentences, tell us about your study and more importantly, why it matters. 

LU: Absolutely. And before I dive in, I want to give a shout out to the MSR research intern who actually did this work. This was led by Kasia Kedzierska, who interned with us two summers ago in 2023, and she’s the lead author on the study. But basically, in this research, we study single-cell foundation models, which have really recently rocked the world of biology, because they basically claim to be able to use AI to unlock understanding about single-cell biology. Biologists for a myriad of applications, everything from understanding how single cells differentiate into different kinds of cells, to discovering new drugs for cancer, will conduct experiments where they measure how much of every gene is expressed inside of just one single cell. So these experiments give us a powerful view into the cell’s internal state. But measurements from these experiments are incredibly complex. There are about 20,000 different human genes. So you get this really long chain of numbers that measure how much there is of 20,000 different genes. So deriving meaning from this really long chain of numbers is really difficult. And single-cell foundation models claim to be capable of unraveling deeper insights than ever before. So that’s the claim that these works have made. And in our recent paper, we showed that these models may actually not live up to these claims. Basically, we showed that single-cell foundation models perform worse in settings that are fundamental to biological discovery than much simpler machine learning and statistical methods that were used in the field before single-cell foundation models emerged and are the go-to standard for unpacking meaning from these complicated experiments. So in a nutshell, we should care about these results because it has implications on the toolkits that biologists use to understand their experiments. Our work suggests that single-cell foundation models may not be appropriate for practical use just yet, at least in the discovery applications that we cover. 

HUIZINGA: Well, let’s go a little deeper there. Generative pre-trained transformer models, GPTs, are relatively new on the research scene in terms of how they’re being used in novel applications, which is what you’re interested in, like single-cell biology. So I’m curious, just sort of as a foundation, what other research has already been done in this area, and how does this study illuminate or build on it? 

LU: Absolutely. Okay, so we were the first to notice and document this issue in single-cell foundation models, specifically. And this is because that we have proposed evaluation methods that, while are common in other areas of AI, have yet to be commonly used to evaluate single-cell foundation models. We performed something called zero-shot evaluation on these models. Prior to our work, most works evaluated single-cell foundation models with fine tuning. And the way to understand this is because single-cell foundation models are trained in a way that tries to expose these models to millions of single-cells. But because you’re exposing them to a large amount of data, you can’t really rely upon this data being annotated or like labeled in any particular fashion then. So in order for them to actually do the specialized tasks that are useful for biologists, you typically have to add on a second training phase. We call this the fine-tuning phase, where you have a smaller number of single cells, but now they are actually labeled with the specialized tasks that you want the model to perform. So most people, they typically evaluate the performance of single-cell models after they fine-tune these models. However, what we noticed is that this evaluating these fine-tuned models has several problems. First, it might not actually align with how these models are actually going to be used by biologists then. A critical distinction in biology is that we’re not just trying to interact with an agent that has access to knowledge through its pre-training, we’re trying to extend these models to discover new biology beyond the sphere of influence then. And so in many cases, the point of using these models, the point of analysis, is to explore the data with the goal of potentially discovering something new about the single cell that the biologists worked with that they weren’t aware of before. So in these kinds of cases, it is really tough to fine-tune a model. There’s a bit of a chicken and egg problem going on. If you don’t know, for example, there’s a new kind of cell in the data, you can’t really instruct the model to help us identify these kinds of new cells. So in other words, fine-tuning these models for those tasks essentially becomes impossible then. So the second issue is that evaluations on fine-tuned models can sometimes mislead us in our ability to understand how these models are working. So for example, the claim behind single-cell foundation model papers is that these models learn a foundation of biological knowledge by being exposed to millions of single cells in its first training phase, right? But it’s possible when you fine-tune a model, it may just be that any performance increases that you see using the model is simply because that you’re using a massive model that is really sophisticated, really large. And even if there’s any exposure to any cells at all then, that model is going to do perfectly fine then. So going back to our paper, what’s really different about this paper is that we propose zero-shot evaluation for these models. What that means is that we do not fine-tune the model at all, and instead we keep the model frozen during the analysis step. So how we specialize it to be a downstream task instead is that we extract the model’s internal embedding of single-cell data, which is essentially a numerical vector that contains information that the model is extracting and organizing from input data. So it’s essentially how the model perceives single-cell data and how it’s organizing in its own internal state. So basically, this is the better way for us to test the claim that single-cell foundation models are learning foundational biological insights. Because if they actually are learning these insights, they should be present in the models embedding space even before we fine-tune the model. 

HUIZINGA: Well, let’s talk about methodology on this particular study. You focused on assessing existing models in zero-shot learning for single-cell biology. How did you go about evaluating these models? 

LU: Yes, so let’s dive deeper into how zero-shot evaluations are conducted, okay? So the premise here is that we’re relying upon the fact that if these models are fully learning foundational biological insights, if we take the model’s internal representation of cells, then cells that are biologically similar should be close in that internal representation, where cells that are biologically distinct should be further apart. And that is exactly what we tested in our study. We compared two popular single-cell foundation models and importantly, we compared these models against older and reliable tools that biologists have used for exploratory analyses. So these include simpler machine learning methods like scVI, statistical algorithms like Harmony, and even basic data pre-processing steps, just like filtering your data down to a more robust subset of genes, then. So basically, we tested embeddings from our two single-cell foundation models against this baseline in a variety of settings. And we tested the hypothesis that biologically similar cells should be similar across these distinct methods across these datasets. 

HUIZINGA: Well, and as you as you did the testing, you obviously were aiming towards research findings, which is my favorite part of a research paper, so tell us what you did find and what you feel the most important takeaways of this paper are. 

LU: Absolutely. So in a nutshell, we found that these two newly proposed single-cell foundation models substantially underperformed compared to older methods then. So to contextualize why that is such a surprising result, there is a lot of hype around these methods. So basically, I think that,yeah, it’s a very surprising result, given how hyped these models are and how people were already adopting them. But our results basically caution that these shouldn’t really be adopted for these use purposes. 

HUIZINGA: Yeah, so this is serious real-world impact here in terms of if models are being adopted and adapted in these applications, how reliable are they, et cetera? So given that, who would you say benefits most from what you’ve discovered in this paper and why? 

LU: Okay, so two ways, right? So I think this has at least immediate implications on the way that we do discovery in biology. And as I’ve discussed, these experiments are used for cases that have practical impact, drug discovery applications, investigations into basic biology, then. But let’s also talk about the impact for methodologists, people who are trying to improve these single-cell foundation models, right? I think at the base, they’re really excited proposals. Because if you look at what some of the prior and less sophisticated methods couldn’t do, they tended to be more bespoke. So the excitement of single-cell foundation models is that you have this general-purpose model that can be used for everything and while they’re not living up to that purpose just now, just currently, I think that it’s important that we continue to bank onto that vision, right? So if you look at our contributions in that area, where single-cell foundation models are a really new proposal, so it makes sense that we may not know how to fully evaluate them just yet then. So you can view our work as basically being a step towards more rigorous evaluation of these models. Now that we did this experiment, I think the methodologists know to use this as a signal on how to improve the models and if they’re going in the right direction. And in fact, you are seeing more and more papers adopt zero-shot evaluations since we put out our paper then. And so this essentially helps future computer scientists that are working on single-cell foundation models know how to train better models. 

HUIZINGA: That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot learning research in biology, and what foundation might this paper lay for future research agendas in the field? 

LU: Yeah, absolutely. So now that we’ve shown single-cell foundation models don’t necessarily perform well, I think the natural question on everyone’s mind is how do we actually train single-cell foundation models that live up to that vision, that can perform in helping us discover new biology then? So I think in the short term, yeah, we’re actively investigating many hypotheses in this area. So for example, my colleagues, Lorin Crawford and Ava Amini, who were co-authors in the paper, recently put out a pre-print understanding how training data composition impacts model performance. And so one of the surprising findings that they had was that many of the training data sets that people used to train single-cell foundation models are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance then. But you can also look forward to many other explorations in this area as we continue to develop this research at the end of the day. But also zooming out into the bigger picture, I think one major takeaway from this paper is that developing AI methods for biology requires thought about the context of use, right? I mean, this is obvious for any AI method then, but I think people have gotten just too used to taking methods that work out there for natural vision or natural language maybe in the consumer domain and then extrapolating these methods to biology and expecting that they will work in the same way then, right? So for example, one reason why zero-shot evaluation was not routine practice for single-cell foundation models prior to our work, I mean, we were the first to fully establish that as a practice for the field, was because I think people who have been working in AI for biology have been looking to these more mainstream AI domains to shape their work then. And so with single-cell foundation models, many of these models are adopted from large language models with natural language processing, recycling the exact same architecture, the exact same code, basically just recycling practices in that field then. So when you look at like practices in like more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it’s more of like a niche instead of being considered central to model understanding. So again, because biology is different from mainstream language processing, it’s a scientific discipline, zero-shot evaluation becomes much more important, and you have no choice but to use these models, zero-shot then. So in other words, I think that we need to be thinking carefully about what it is that makes training a model for biology different from training a model, for example, for consumer purposes. 

[MUSIC]

HUIZINGA: Alex Lu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read it on the Genome Biology website. See you next time on Abstracts! 

[MUSIC FADES] 

The post Abstracts: Zero-shot models in single-cell biology with Alex Lu appeared first on Microsoft Research.

Read More

Abstracts: Aurora with Megan Stanley and Wessel Bruinsma

Abstracts: Aurora with Megan Stanley and Wessel Bruinsma

Abstracts podcast | Aurora with Megan Stanley and Wessel Bruinsma

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode of Abstracts, Microsoft senior researchers Megan Stanley and Wessel Bruinsma join host Amber Tingle to discuss their groundbreaking work on environmental forecasting. Their new Nature publication, “A Foundation Model for the Earth System,” (opens in new tab) features Aurora, an AI model that redefines weather prediction and extends its capabilities to other environmental domains such as tropical cyclones and ocean wave forecasting.


Learn more

A foundation model for the Earth system (opens in new tab)
Nature | May 2025

Introducing Aurora: The first large-scale foundation model of the atmosphere
Microsoft Research Blog | June 2024

Project Aurora: The first large-scale foundation model of the atmosphere
Video | September 2024

A Foundation Model for the Earth System
Paper | November 2024

Aurora (opens in new tab)
Azure AI Foundry Labs

Transcript

[MUSIC]   

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES] 

Our guests today are Megan Stanley and Wessel Bruinsma. They are both senior researchers within the Microsoft Research AI for Science initiative. They are also two of the coauthors on a new Nature publication called “A Foundation Model for the Earth System.”


This is such exciting work about environmental forecasting, so we’re happy to have the two of you join us today.  

Megan and Wessel, welcome. 

MEGAN STANLEY: Thank you. Thanks. Great to be here. 

WESSEL BRUINSMA: Thanks. 

TINGLE: Let’s jump right in. Wessel, share a bit about the problem your research addresses and why this work is so important. 

BRUINSMA: I think we’re all very much aware of the revolution that’s happening in the space of large language models, which have just become so strong. What’s perhaps lesser well-known is that machine learning models have also started to revolutionize this field of weather prediction. Whereas traditional weather prediction models, based on physical laws, used to be the state of the art, these traditional models are now challenged and often even outperformed by AI models.  

This advancement is super impressive and really a big deal. Mostly because AI weather forecasting models are computationally much more efficient and can even be more accurate. What’s unfortunate though, about this big step forward, is that these developments are mostly limited to the setting of weather forecasting.  

Weather forecasting is very important, obviously, but there are many other important environmental forecasting problems out there, such as air pollution forecasting or ocean wave forecasting. We have developed a model, named Aurora, which really kicks the AI revolution in weather forecasting into the next gear by extending these advancements to other environmental forecasting fields, too. With Aurora, we’re now able to produce state-of-the-art air pollution forecasts using an AI approach. And that wasn’t possible before! 

TINGLE: Megan, how does this approach differ from or build on work that’s already been done in the atmospheric sciences? 

STANLEY: Current approaches have really focused training very specifically on weather forecasting models. And in contrast, with Aurora, what we’ve attempted to do is train a so-called foundation model for the Earth system. In the first step, we train Aurora on a vast body of Earth system data. This is our pretraining step.  

And when I say a vast body of data, I really do mean a lot. And the purpose of this pretraining is to let Aurora, kind of, learn some general-purpose representation of the dynamics that govern the Earth system. But then once we’ve pretrained Aurora, and this really is the crux of this, the reason why we’re doing this project, is after the model has been pretrained, it can leverage this learned general-purpose representation and efficiently adapt to new tasks, new domains, new variables. And this is called fine-tuning. 

The idea is that the model really uses the learned representation to perform this adaptation very efficiently, which basically means Aurora is a powerful, flexible model that can relatively cheaply be adapted to any environmental forecasting task.   

TINGLE: Wessel, can you tell us about your methodology? How did you all conduct this research? 

BRUINSMA: While approaches so far have trained models on primarily one particular data set, this one dataset is very large, which makes it possible to train very good models. But it does remain only one dataset, and that’s not very diverse. In the domain of environmental forecasting, we have really tried to push the limits of scaling to large data by training Aurora on not just this one large dataset, but on as many very large datasets as we could find. 

These datasets are a combination of estimates of the historical state of the world, forecasts by other models, climate simulations, and more. We’ve been able to show that training on not just more data but more diverse data helps the model achieve even better performance. Showing this is difficult because there is just so much data.  

In addition to scaling to more and more diverse data, we also increased the size of the model as much as we could. Here we found that bigger models, despite being slower to run, make more efficient use of computational resources. It’s cheaper to train a good big model than a good small model. The mantra of this project was to really keep it simple and to scale to simultaneously very large and, more importantly, diverse data and large model size. 

TINGLE: So, Megan, what were your major findings? And we know they’re major because they’re in Nature. [LAUGHS] 

STANLEY: Yeah, [LAUGHS] I guess they really are. So the main outcome of this project is we were actually able to train a single foundation model that achieves state-of-the-art performance in four different domains. Air pollution forecasting. For example, predicting particulate matter near the surface or ozone in the atmosphere. Ocean wave forecasting, which is critical for planning shipping routes.  

Tropical cyclone track forecasting, so that means being able to predict where a hurricane or a typhoon is expected to go, which is obviously incredibly important, and very high-resolution weather forecasting.  

And I’ve, kind of, named these forecasting domains as if they’re just items in a list, but in every single one, Aurora really pushed the limits of what is possible with AI models. And we’re really proud of that.  

But perhaps, kind of, you know, to my mind, the key takeaway here is that the foundation model approach actually works. So what we have shown is it’s possible to actually train some kind of general model, a foundation model, and then adapt it to a wide variety of environmental tasks. Now we definitely do not claim that Aurora is some kind of ultimate environmental forecasting model. We are sure that the model and the pretraining procedure can actually be improved. But, nevertheless, we’ve shown that this approach works for environmental forecasting. It really holds massive promise, and that’s incredibly cool. 

TINGLE: Wessel, what do you think will be the real-world impact of this work? 

BRUINSMA: Well, for applications that we mentioned, which are air pollution forecasting, ocean wave forecasting, tropical cyclone track forecasting, and very high-resolution weather forecasting, Aurora could today be deployed in real-time systems to produce near real-time forecasts. And, you know, in fact, it already is. You can view real-time weather forecasts by the high-resolution version of the model on the website of ECMWF (European Centre for Medium-Range Weather Forecasts). 

But what’s remarkable is that every of these applications took a small team of engineers about four to eight weeks to fully execute. You should compare this to a typical development timeline for more traditional models, which can be on the order of multiple years. Using the pretraining fine-tuning approach that we used for Aurora, we might see significantly accelerated development cycles for environmental forecasting problems. And that’s exciting. 

TINGLE: Megan, if our listeners only walk away from this conversation with one key talking point, what would you like that to be? What should we remember about this paper? 

STANLEY: The biggest takeaway is that the pretraining fine-tuning paradigm, it really works for environmental forecasting, right? So you can train a foundational model, it learns some kind of general-purpose representation of the Earth system dynamics, and this representation boosts performance in a wide variety of forecasting tasks. But we really want to emphasize that Aurora only scratches the surface of what’s actually possible. 

So there are many more applications to explore than the four we’ve mentioned. And undoubtedly, the model and pretraining procedure can actually be improved. So we’re really excited to see what the next few years will bring. 

TINGLE: Wessel, tell us more about those opportunities and unanswered questions. What’s next on the research agenda in environmental prediction? 

BRUINSMA: Well, Aurora has two main limitations. The first is that the model produces only deterministic predictions, by which I mean a single predicted value. For variables like temperature, this is mostly fine. But other variables like precipitation, they are inherently some kind of stochastic. For these variables, we really want to assign probabilities to different levels of precipitation rather than predicting only a single value. 

An extension of Aurora to allow this sort of prediction would be a great next step.  

The second limitation is that Aurora depends on a procedure called assimilation. Assimilation attempts to create a starting point for the model from real-world observations, such as from weather stations and satellites. The model then takes the starting point and uses it to make predictions. Unfortunately, assimilation is super expensive, so it would be great if we could somehow circumvent the need for it. 

Finally, what we find really important is to make our advancements available to the community.

[MUSIC] 

TINGLE: Great. Megan and Wessel, thanks for joining us today on the Microsoft Research Podcast. 

BRUINSMA: Thanks for having us. 

STANLEY: Yeah, thank you. It’s been great. 

TINGLE: You can check out the Aurora model on Azure AI Foundry. You can read the entire paper, “A Foundation Model for the Earth System,” at aka.ms/abstracts. And you’ll certainly find it on the Nature website, too.  

Thank you so much for tuning in to Abstracts today. Until next time.  

[MUSIC FADES] 

The post Abstracts: Aurora with Megan Stanley and Wessel Bruinsma appeared first on Microsoft Research.

Read More

Collaborators: Healthcare Innovation to Impact

Collaborators: Healthcare Innovation to Impact

Microsoft Research Podcast | Collaborators: Healthcare Innovation to Impact | outline illustrations of Jonathan Carlson, Smitha Saligrama, Will Guyman, Cameron Runde, Dr. Matthew Lungren

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with. 

Amid the ongoing surge of AI research, healthcare is emerging as a leading area for real-world transformation. From driving efficiency gains for clinicians to improving patient outcomes, AI is beginning to make a tangible impact. Thousands of scientific papers have explored AI systems capable of analyzing medical documents and images with unprecedented accuracy. The latest work goes even further, showing how healthcare agents can collaborate—with each other and with human doctors—embedding AI directly into clinical workflows. 

In this discussion, we explore how teams across Microsoft are working together to generate advanced AI capabilities and solutions for developers and clinicians around the globe. Leading the conversation are Dr. Matthew Lungren, chief scientific officer for Microsoft Health and Life Sciences, and Jonathan Carlson, vice president and managing director of Microsoft Health Futures—two key leaders behind this collaboration. They’re joined by Smitha Saligrama, principal group engineering manager within Microsoft Health and Life Sciences, Will Guyman, group product manager within Microsoft Health and Life Sciences, and Cameron Runde, a senior strategy manager for Microsoft Research Health Futures—all of whom play crucial roles in turning AI breakthroughs into practical, life-saving innovations. 

Together, these experts examine how Microsoft is helping integrate cutting-edge AI into healthcare workflows—saving time today, and lives tomorrow.


Learn more

Developing next-generation cancer care management with multi-agent orchestration
Source Blog, May 2025

Healthcare Agent Orchestrator (opens in new tab)
GitHub

Azure AI Foundry Labs (opens in new tab)
Homepage

Transcript

[MUSIC] 

MATTHEW LUNGREN: You’re listening to Collaborators, a Microsoft Research podcast, showcasing the range of expertise that goes into transforming mind blowing ideas into  world changing technologies. Despite the advancements in AI over the decades, generative AI exploded into view in 2022, when ChatGPT became the, sort of, internet browser for AI and became the fastest adopted consumer software application in history.


JONATHAN CARLSON: From the beginning, healthcare stood out to us as an important opportunity for general reasoners to improve the lives and experiences of patients and providers. Indeed, in the past two years, there’s been an explosion of scientific papers looking at the application first of text reasoners and medicine, then multi-modal reasoners that can interpret medical images, and now, most recently, healthcare agents that can reason with each other. But even more impressive than the pace of research has been the surprisingly rapid diffusion of this technology into real world clinical workflows. 

LUNGREN: So today, we’ll talk about how our cross-company collaboration has shortened that gap and delivered advanced AI capabilities and solutions into the hands of developers and clinicians around the world, empowering everyone in health and life sciences to achieve more. I’m Doctor Matt Lungren, chief scientific officer for Microsoft Health and Life Sciences. 

CARLSON: And I’m Jonathan Carlson, vice president and managing director of Microsoft Health Futures. 

LUNGREN: And together we brought some key players leading in the space of AI and health care from across Microsoft. Our guests today are Smitha Saligrama, principal group engineering manager within Microsoft Health and Life Sciences, Will Guyman, group product manager within Microsoft Health and Life Sciences, and Cameron Runde, a senior strategy manager for Microsoft Health Futures. 

CARLSON: We’ve asked these brilliant folks to join us because each of them represents a mission critical group of cutting-edge stakeholders, scaling breakthroughs into purpose-built solutions and capabilities for health care. 

LUNGREN: We’ll hear today how generative AI capabilities can unlock reasoning across every data type in medicine: text, images, waveforms, genomics. And further, how multi-agent frameworks in healthcare can accelerate complex workflows, in some cases acting as a specialist team member, safely secured inside the Microsoft 365 tools used by hundreds of millions of healthcare enterprise users across the world. The opportunity to save time today and lives tomorrow with AI has never been larger.

[MUSIC FADES] 
 
MATTHEW LUNGREN: Jonathan. You know, it’s been really interesting kind of observing Microsoft Research over the decades. I’ve, you know, been watching you guys in my prior academic career. You are always on the front of innovation, particularly in health care. And I find it fascinating that, you know, millions of people are using the solutions that, you know, your team has developed over the years and yet you still find ways to stay cutting edge and state of the art, even in this accelerating time of technology and AI, particularly, how do you do that? [LAUGHS]

 JONATHAN CARLSON: I mean, it’s some of what’s in our DNA, I mean, we’ve been publishing in health and life sciences for two decades here. But when we launched Health Futures as a mission-focused lab about 7 or 8 years ago, we really started with the premise that the way to have impact was to really close the loop between, not just good ideas that get published, but good ideas that can actually be grounded in real problems that clinicians and scientists care about, that then allow us to actually go from that first proof of concept into an incubation, into getting real world feedback that allows us to close that loop. And now with, you know, the HLS organization here as a product group, we have the opportunity to work really closely with you all to not just prove what’s possible in the clinic or in the lab, but actually start scaling that into the broader community. 

CAMERON RUNDE: And one thing I’ll add here is that the problems that we’re trying to tackle in health care are extremely complex. And so, as Jonathan said, it’s really important that we come together and collaborate across disciplines as well as across the company of Microsoft and with our external collaborators, as well across the whole industry. 

CARLSON: So, Matt, back to you. What are you guys doing in the product group? How do you guys see these models getting into the clinic?

LUNGREN: You know, I think a lot of people, you know, think about AI is just, you know, maybe just even a few years old because of GPT and how that really captured the public’s consciousness. Right?

And so, you think about the speech-to-text technology of being able to dictate something, for a clinic note or for a visit, that was typically based on Nuance technology. And so there’s a lot of product understanding of the market, how to deliver something that clinicians will use, understanding the pain points and workflows and really that Health IT space, which is sometimes the third rail, I feel like with a lot of innovation in healthcare. 

But beyond that, I mean, I think now that we have this really powerful engine of Microsoft and the platform capabilities, we’re seeing, innovations on the healthcare side for data storage, data interoperability, with different types of medical data. You have new applications coming online, the ability, of course, to see generative AI now infused into the speech-to-text and, becoming Dragon Copilot, which is something that has been, you know, tremendously, received by the community. 

Physicians are able to now just have a conversation with a patient. They turn to their computer and the note is ready for them. There’s no more this, we call it keyboard liberation. I don’t know if you heard that before. And that’s just been tremendous. And there’s so much more coming from that side. And then there’s other parts of the workflow that we also get engaged in — the diagnostic workflow.

So medical imaging, sharing images across different hospital systems, the list goes on. And so now when you move into AI, we feel like there’s a huge opportunity to deliver capabilities into the clinical workflow via the products and solutions we already have. But, I mean, we’ll now that we’ve kind of expanded our team to involve Azure and platform, we’re really able to now focus on the developers.

WILL GUYMAN: Yeah. And you’re always telling me as a doctor how frustrating it is to be spending time at the computer instead of with your patients. I think you told me, you know, 4,000 clicks a day for the typical doctor, which is tremendous. And something like Dragon Copilot can save that five minutes per patient. But it can also now take actions after the patient encounter so it can draft the after-visit summary. 

It can order labs and medications for the referral. And that’s incredible. And we want to keep building on that. There’s so many other use cases across the ecosystem. And so that’s why in Azure AI Foundry, we have translated a lot of the research from Microsoft Research and made that available to developers to build and customize for their own applications. 

SMITHA SALIGRAMA: Yeah. And as you were saying, in our transformation of moving from solutions to platforms and as, scaling solutions to other, multiple scenarios, as we put our models in AI Foundry, we provide these developer capabilities like bring your own data and fine tune these models and then apply it to, scenarios that we couldn’t even imagine. So that’s kind of the platform play we’re scaling now. 

LUNGREN: Well, I want to do a reality check because, you know, I think to us that are now really focused on technology, it seems like, I’ve heard this story before, right. I, I remember even in, my academic clinical days where it felt like technology was always the quick answer and it felt like technology was, there was maybe a disconnect between what my problems were or what I think needed to be done versus kind of the solutions that were kind of, created or offered to us. And I guess at some level, how Jonathan, do you think about this? Because to do things well in the science space is one thing, to do things well in science, but then also have it be something that actually drives health care innovation and practice and translation. It’s tricky, right?

CARLSON: Yeah. I mean, as you said, I think one of the core pathologies of Big Tech is we assume every problem is a technology problem. And that’s all it will take to solve the problem. And I think, look, I was trained as a computational biologist, and that sits in the awkward middle between biology and computation. And the thing that we always have to remember, the thing that we were very acutely aware of when we set out, was that we are not the experts. We do have, you know, you as an M.D., we have everybody on the team, we have biologists on the team. 

But this is a big space. And the only way we’re going to have real impact, the only way we’re even going to pick the right problems to work on is if we really partner deeply, with providers, with EHR (electronic health records) vendors, with scientists, and really understand what’s important and again, get that feedback loop. 

RUNDE: Yeah, I think we really need to ground the work that we do in the science itself. You need to understand the broader ecosystem and the broader landscape, across health care and life sciences, so that we can tackle the most important problems, not just the problems that we think are important. Because, as Jonathan said, we’re not the experts in health care and life sciences. And that’s really the secret sauce. When you have the clinical expertise come together with the technical expertise. That’s how you really accelerate health care.

CARLSON: When we really launched this, this mission, 7 or 8 years ago, we really came in with the premise of, if we decide to stop, we want to be sure the world cares. And the only way that’s going to be true is if we’re really deeply embedded with the people that matter–the patients, the providers and the scientists.

LUNGREN: And now it really feels like this collaborative effort, you know, really can help start to extend that mission. Right. I think, you know, Will and Smitha, that we definitely feel the passion and the innovation. And we certainly benefit from those collaborations, too. But then we have these other partners and even customers, right, that we can start to tap into and have that flywheel keep spinning. 

GUYMAN: Yeah. And the whole industry is an ecosystem. So, we have our own data sets at Microsoft Research that you’ve trained amazing AI models with. And those are in the catalog. But then you’ve also partnered with institutions like Providence or Page AI . And those models are in the catalog with their data. And then there are third parties like Nvidia that have their own specialized proprietary data sets, and their models are there too. So, we have this ecosystem of open source models. And maybe Smitha, you want to talk about how developers can actually customize these. 

SALIGRAMA: Yeah. So we use the Azure AI Foundry ecosystem. Developers can feel at home if they’re using the AI Foundry. So they can look at our model cards that we publish as part of the models we publish, understand the use cases of these models, how to, quickly, bring up these APIs and, look at different use cases of how to apply these and even fine tune these models with their own data. Right. And then, use it for specific tasks that we couldn’t have even imagined. 

LUNGREN: Yeah it has been interesting to see we have these health care models in the catalog again, some that came from research, some that came from third parties and other product developers and Azure’s kind of becoming the home base, I think, for a lot of health and life science developers. They’re seeing all the different modalities, all the different capabilities. And then in combination with Azure OpenAI, which as we know, is incredibly competent in lots of different use cases. How are you looking at the use cases, and what are you seeing folks use these models for as they come to the catalog and start sharing their discoveries or products? 

GUYMAN: Well, the general-purpose large language models are amazing for medical general reasoning. So Microsoft Research has shown that that they can perform super well on, for example, like the United States medical licensing exam, they can exceed doctor performance if they’re just picking between different multiple-choice questions. But real medicine we know is messier. It doesn’t always start with the whole patient context provided as text in the prompt. You have to get the source data and that raw data is often non-text. The majority of it is non-text. It’s things like medical imaging, radiology, pathology, ophthalmology, dermatology. It goes on and on. And there’s endless signal data, lab data. And so all of this diverse data type needs to be processed through specialized models because much of that data is not available on the public internet. 

And that’s why we’re taking this partner approach, first party and third party models that can interpret all this kind of data and then connect them ultimately back to these general reasoners to reason over that. 

LUNGREN: So, you know, I’ve been at this company for a while and, you know, familiar with kind of how long it takes, generally to get, you know, a really good research paper, do all the studies, do all the data analysis, and then go through the process of publishing, right, which takes, as, you know, a long time and it’s, you know, very rigorous. 

And one of the things that struck me, last year, I think we, we started this big collaboration and, within a quarter, you had a Nature paper coming out from Microsoft Research, and that model that the Nature paper was describing was ready to be used by anyone on the Azure AI Foundry within that same quarter. It kind of blew my mind when I thought about it, you know, even though we were all, you know, working very hard to get that done. Any thoughts on that? I mean, has this ever happened in your career? And, you know, what’s the secret sauce to that? 

CARLSON: Yeah, I mean, the time scale from research to product has been massively compressed. And I’d push that even further, which is to say, the reason why it took a quarter was because we were laying the railroad tracks as we’re driving the train. We have examples right after that when we are launching on Foundry the same day we were publishing the paper. 

And frankly, the review times are becoming longer than it takes to actually productize the models. I think there’s two things that are going on with that are really converging. One is that the overall ecosystem is converging on a relatively small number of patterns, and that gives us, as a tech company, a reason to go off and really make those patterns hardened in a way that allows not just us, but third parties as well, to really have a nice workflow to publish these models. 

But the other is actually, I think, a change in how we work, you know, and for most of our history as an industrial research lab, we would do research and then we’d go pitch it to somebody and try and throw it over the fence. We’ve really built a much more integrated team. In fact, if you look at that Nature paper or any of the other papers, there’s folks from product teams. Many of you are on the papers along with our clinical collaborators.

RUNDE: Yeah. I think one thing that’s really important to note is that there’s a ton of different ways that you can have impact, right? So I like to think about phasing. In Health Futures at least, I like to think about phasing the work that we do. So first we have research, which is really early innovation. And the impact there is getting our technology and our tools out there and really sharing the learnings that we’ve had. 

So that can be through publications like you mentioned. It can be through open-sourcing our models. And then you go to incubation. So, this is, I think, one of the more new spaces that we’re getting into, which is maybe that blurred line between research and product. Right. Which is, how do we take the tools and technologies that we’ve built and get them into the hands of users, typically through our partnerships? 

Right. So, we partner very deeply and collaborate very deeply across the industry. And incubation is really important because we get that early feedback. We get an ability to pivot if we need to. And we also get the ability to see what types of impact our technology is having in the real world. And then lastly, when you think about scale, there’s tons of different ways that you can scale. We can scale third-party through our collaborators and really empower them to go to market to commercialize the things that we’ve built together. 

You can also think about scaling internally, which is why I’m so thankful that we’ve created this flywheel between research and product, and a lot of the models that we’ve built that have gone through research, have gone through incubation, have been able to scale on the Azure AI Foundry. But that’s not really our expertise. Right? The scale piece in research, that’s research and incubation. Smitha, how do you think about scaling? 

SALIGRAMA: So, there are several angles to scaling the models, the state-of-the-art models we see from the research team. The first angle is, the open sourcing, to get developer trust, and very generous commercial licenses so that they can use it and for their own, use cases. The second is, we also allow them to customize these models, fine tuning these models with their own data. So a lot of different angles of how we provide support and scaling, the state-of-the-art of models we get from the research org.

GUYMAN: And as one example, you know, University of Wisconsin Health, you know, which Matt knows well. They took one of our models, which is highly versatile. They customized it in Foundry and they optimized it to reliably identify abnormal chest X-rays, the most common imaging procedure, so they could improve their turnaround time triage quickly. And that’s just one example. But we have other partners like Sectra who are doing more of operations use cases automatically routing imaging to the radiologists, setting them up to be efficient. And then Page AI is doing, you know, biomarker identification for actually diagnostics and new drug discovery. So, there’s so many use cases that we have partners already who are building and customizing.

LUNGREN: The part that’s striking to me is just that, you know, we could all sit in a room and think about all the different ways someone might use these models on the catalog. And I’m still shocked at the stuff that people use them for and how effective they are. And I think part of that is, you know, again, we talk a lot about generative AI and healthcare and all the things you can do. Again, you know, in text, you refer to that earlier and certainly off the shelf, there’s really powerful applications. But there is, you know, kind of this tip of the iceberg effect where under the water, most of the data that we use to take care of our patients is not text. Right. It’s all the different other modalities. And I think that this has been an unlock right, sort of taking these innovations, innovations from the community, putting them in this ecosystem kind of catalog, essentially. Right. And then allowing folks to kind of, you know, build and develop applications with all these different types of data. Again, I’ve been surprised at what I’m seeing. 

CARLSON: This has been just one of the most profound shifts that’s happened in the last 12 months, really. I mean, two years ago we had general models in text that really shifted how we think about, I mean, natural language processing got totally upended by that. Turns out the same technology works for images as well. It doesn’t only allow you to automatically extract concepts from images, but allows you to align those image concepts with text concepts, which means that you can have a conversation with that image. And once you’re in that world now, you are a place where you can start stitching together these multimodal models that really change how you can interact with the data, and how you can start getting more information out of the raw primary data that is part of the patient journey.

LUNGREN: Well, and we’re going to get to that because I think you just touched on something. And I want to re-emphasize stitching these things together. There’s a lot of different ways to potentially do that. Right? There’s ways that you can literally train the model end to end with adapters and all kinds of other early fusion fusions. All kinds of ways. But one of the things that the word of the I guess the year is going to be agents and an agent is a very interesting term to think about how you might abstract away some of the components or the tasks that you want the model to, to accomplish in the midst of sort of a real human to maybe model interaction. Can you talk a little bit more about, how we’re thinking about agents in this, in this platform approach? 
 
GUYMAN: Well, this is our newest addition to the Azure AI Foundry. So there’s an agent catalog now where we have a set of pre-configured agents for health care. And then we also have a multi-agent orchestrator that can jump start the process of developers building their own multi-agent workflows to tackle some complex real-world tasks that clinicians have to deal with. And these agents basically combine a general reasoner, like a large language model, like a GPT 4o or an o series model with a specialized model, like a model that understands radiology or pathology with domain-specific knowledge and tools. So the knowledge might be, you know, public guidelines or, you know, medical journals or your own private data from your EHR or medical imaging system, and then tools like a code interpreter to deal with all of the numeric data or tools like that that the clinicians are using today, like PowerPoint, Word, Teams and etc. And so we’re allowing developers to build and customize each of these agents in Foundry and then deploy them into their workflows. 

LUNGREN: And, and I really like that concept because, you know, as, as a, as a from the user personas, I think about myself as a user. How am I going to interact with these agents? Where does it naturally fit? And I and I sort of, you know, I’ve seen some of the demonstrations and some of the work that’s going on with Stanford in particular, showing that, you know, and literally in a Teams chat, I can have my clinician colleagues and I can have specialized health care agents that kind of interact, like I’m interacting with a human on a chat.

It is a completely mind-blowing thing for me. And it’s a light bulb moment for me to I wonder, what have we, what have we heard from folks that have, you know, tried out this health care agent orchestrator in this kind of deployment environment via Teams?

GUYMAN: Well, someone joked, you know, are you sure you’re not using Teams because you work at Microsoft? [LAUGHS] But, then we actually were meeting with one of the, radiologists at one of our partners, and they said that that morning they had just done a Teams meeting, or they had met with other specialists to talk about a patient’s cancer case, or they were coming up with a treatment plan. 

And that was the light bulb moment for us. We realized, actually, Teams is already being used by physicians as an internal communication tool, as a tool to get work done. And especially since the pandemic, a lot of the meetings moved to virtual and telemedicine. And so it’s a great distribution channel for AI, which is often been a struggle for AI to actually get in the hands of clinicians. And so now we’re allowing developers to build and then deploy very easily and extend it into their own workflows. 

CARLSON: I think that’s such an important point. I mean, if you think about one of the really important concepts in computer science is an application programing interface, like some set of rules that allow two applications to talk to each other. One of the big pushes, really important pushes, in medicine has been standards that allow us to actually have data standards and APIs that allow these to talk to each other, and yet still we end up with these silos. There’s silos of data. There’s silos of applications.

And just like when you and I work on our phone, we have to go back and forth between applications. One of the things that I think agents do is that it takes the idea that now you can use language to understand intent and effectively program an interface, and it creates a whole new abstraction layer that allows us to simplify the interaction between not just humans and the endpoint, but also for developers. 

It allows us to have this abstraction layer that lets different developers focus on different types of models, and yet stitch them all together in a very, very natural, way, not just for the users, but for the ability to actually deploy those models. 

SALIGRAMA: Just to add to what Jonathan was mentioning, the other cool thing about the Microsoft Teams user interface is it’s also enterprise ready.

RUNDE: And one important thing that we’re thinking about, is exactly this from the very early research through incubation and then to scale, obviously. Right. And so early on in research, we are actively working with our partners and our collaborators to make sure that we have the right data privacy and consent in place. We’re doing this in incubation as well. And then obviously in scale. Yep. 

LUNGREN: So, I think AI has always been thought of as a savior kind of technology. We talked a little bit about how there’s been some ups and downs in terms of the ability for technology to be effective in health care. At the same time, we’re seeing a lot of new innovations that are really making a difference. But then we kind of get, you know, we talked about agents a little bit. It feels like we’re maybe abstracting too far. Maybe it’s things are going too fast, almost. What makes this different? I mean, in your mind is this truly a logical next step or is it going to take some time? 

CARLSON: I think there’s a couple things that have happened. I think first, on just a pure technology. What led to ChatGPT? And I like to think of really three major breakthroughs.

The first was new mathematical concepts of attention, which really means that we now have a way that a machine can figure out which parts of the context it should actually focus on, just the way our brains do. Right? I mean, if you’re a clinician and somebody is talking to you, the majority of that conversation is not relevant for the diagnosis. But, you know how to zoom in on the parts that matter. That’s a super powerful mathematical concept. The second one is this idea of self-supervision. So, I think one of the fundamental problems of machine learning has been that you have to train on labeled training data and labels are expensive, which means data sets are small, which means the final models are very narrow and brittle. And the idea of self-supervision is that you can just get a model to automatically learn concepts, and the language is just predict the next word. And what’s important about that is that leads to models that can actually manipulate and understand really messy text and pull out what’s important about that, and then and then stitch that back together in interesting ways.

And the third concept, that came out of those first two, was just the observational scale. And that’s that more is better, more data, more compute, bigger models. And that really leads to a reason to keep investing. And for these models to keep getting better. So that as a as a groundwork, that’s what led to ChatGPT. That’s what led to our ability now to not just have rule-based systems or simple machine learning based systems to take a messy EHR record, say, and pull out a couple concepts.

But to really feed the whole thing in and say, okay, I need you to figure out which concepts are in here. And is this particular attribute there, for example. That’s now led to the next breakthrough, which is all those core ideas apply to images as well. They apply to proteins, to DNA. And so we’re starting to see models that understand images and the concepts of images, and can actually map those back to text as well. 

So, you can look at a pathology image and say, not just at the cell, but it appears that there’s some certain sort of cancer in this particular, tissue there. And then you take those two things together and you layer on the fact that now you have a model, or a set of models, that can understand intent, can understand human concepts and biomedical concepts, and you can start stitching them together into specialized agents that can actually reason with each other, which at some level gives you an API as a developer to say, okay, I need to focus on a pathology model and get this really, really, sound while somebody else is focusing on a radiology model, but now allows us to stitch these all together with a user interface that we can now talk to through natural language. 

RUNDE: I’d like to double click a little bit on that medical abstraction piece that you mentioned. Just the amount of data, clinical data that there is for each individual patient. Let’s think about cancer patients for a second to make this real. Right. For every cancer patient, it could take a couple of hours to structure their information. And why is that important? Because, you have to get that information in a structured way and abstract relevant information to be able to unlock precision health applications right, for each patient. So, to be able to match them to a trial, right, someone has to sit there and go through all of the clinical notes from their entire patient care journey, from the beginning to the end. And that’s not scalable. And so one thing that we’ve been doing in an active project that we’ve been working on with a handful of our partners, but Providence specifically, I’ll call out, is using AI to actually abstract and curate that information. So that gives time back to the health care provider to spend with patients, instead of spending all their time curating this information. 

And this is super important because it sets the scene and the backbone for all those precision health applications. Like I mentioned, clinical trial matching, tumor boards is another really important example here. Maybe Matt, you can talk to that a little bit.

LUNGREN: It’s a great example. And you know it’s so funny. We’ve talked about this use case and the you know the health care agent orchestrator is sort of the at the initial lighthouse use case was a tumor board setting. And I remember when we first started working with some of the partners on this, I think we were you know, under a research kind of lens, thinking about what could this, what new diagnoses could have come up with or what new insights might have and what was really a really key moment for us, I think, was noticing that we had developed an agent that can take all of the multimodal data about a patient’s chart, organize it in a timeline, in chronological fashion, and then allow folks to click on different parts of the timeline to ground it back to the note. And just that, which doesn’t sound like a really interesting research paper. It was mind blowing for clinicians who, again, as you said, spend a great deal of time, often outside of the typical work hours, trying to organize these patient records in order to go present to a tumor board.

And a tumor board is a critical meeting that happens at many cancer centers where specialists all get together, come with their perspective, and make a comment on what would be the best next step in treatment. But the background in preparing for that is you know, again, organizing the data. But to your point, also, what are the clinical trials that are active? There are thousands of clinical trials. There’s hundreds every day added. How can anyone keep up with that? And these are the kinds of use cases that start to bubble up. And you realize that a technology that understands concepts, context and can reason over vast amounts of data with a language interface-that is a powerful tool. Even before we get to some of the, you know, unlocking new insights and even precision medicine, this is that idea of saving time before lives to me. And there’s an enormous amount of undifferentiated heavy lifting that happens in health care that these agents and these kinds of workflows can start to unlock. 

GUYMAN: And we’ve packaged these agents, the manual abstraction work that, you know, manually takes hours. Now we have an agent. It’s in Foundry along with the clinical trial matching agent, which I think at Providence you showed could double the match rate over the baseline that they were using by using the AI for multiple data sources. So, we have that and then we have this orchestration that is using this really neat technology from Microsoft Research. Semantic Kernel, Magentic One, Omni Parser. These are technologies that are good at figuring out which agent to use for a given task. So a clinician who’s used to working with other specialists, like a radiologist, a pathologist, a surgeon, they can now also consult these specialist agents who are experts in their domain and there’s shared memory across the agents. 

There’s turn taking, there’s negotiation between the agents. So, there’s this really interesting system that’s emerging. And again, this is all possible to be used through Teams. And there’s some great extensibility as well. We’ve been talking about that and working on some cool tools. 

SALIGRAMA: Yeah. Yeah. No, I think if I have to geek out a little bit on how all this agent tech orchestrations are coming up, like I’ve been in software engineering for decades, it’s kind of a next version of distributed systems where you have these services that talk to each other. It’s a more natural way because LLMs are giving these natural ways instead of a structured API ways of conversing. We have these agents which can naturally understand how to talk to each other. Right. So this is like the next evolution of our systems now. And the way we’re packaging all of this is multiple ways based on all the standards and innovation that’s happening in this space. So, first of all, we are building these agents that are very good at specific tasks, like, Will was saying like, a trial matching agent or patient timeline agents. 

So, we take all of these, and then we package it in a workflow and an orchestration. We use the standard, some of these coming from research. The Semantic Kernel, the Magentic-One. And then, all of these also allow us to extend these agents with custom agents that can be plugged in. So, we are open sourcing the entire agent orchestration in AI Foundry templates, so that developers can extend their own agents, and make their own workflows out of it. So, a lot of cool innovation happening to apply this technology to specific scenarios and workflows. 

LUNGREN: Well, I was going to ask you, like, so as part of that extension. So, like, you know, folks can say, hey, I have maybe a really specific part of my workflow that I want to use some agents for, maybe one of the agents that can do PubMed literature search, for example. But then there’s also agents that, come in from the outside, you know, sort of like I could, I can imagine a software company or AI company that has a built-in agent that plugs in as well. 

SALIGRAMA: Yeah. Yeah, absolutely. So, you can bring your own agent. And then we have these, standard ways of communicating with agents and integrating with the orchestration language so you can bring your own agent and extend this health care agent, agent orchestrator to your own needs. 

LUNGREN: I can just think of, like, in a group chat, like a bunch of different specialist agents. And I really would want an orchestrator to help find the right tool, to your point earlier, because I’m guessing this ecosystem is going to expand quickly. Yeah. And I may not know which tool is best for which question. I just want to ask the question. Right. 

SALIGRAMA: Yeah. Yeah. 

CARLSON: Well, I think to that point to I mean, you said an important point here, which is tools, and these are not necessarily just AI tools. Right? I mean, we’ve known this for a while, right? LLMS are not very good at math, but you can have it use a calculator and then it works very well. And you know you guys both brought up the universal medical abstraction a couple times. 

And one of the things that I find so powerful about that is we’ve long had this vision within the precision health community that we should be able to have a learning hospital system. We should be able to actually learn from the actual real clinical experiences that are happening every day, so that we can stop practicing medicine based off averages. 

There’s a lot of work that’s gone on for the last 20 years about how to actually do causal inference. That’s not an AI question. That’s a statistical question. The bottleneck, the reason why we haven’t been able to do that is because most of that information is locked up in unstructured text. And these other tools need essentially a table. 

And so now you can decompose this problem, say, well, what if I can use AI not to get to the causal answer, but to just structure the information. So now I can put it into the causal inference tool. And these sorts of patterns I think again become very, not just powerful for a programmer, but they start pulling together different specialties. And I think we’ll really see an acceleration, really, of collaboration across disciplines because of this. 

CARLSON: So, when I joined Microsoft Research 18 years ago, I was doing work in computational biology. And I would always have to answer the question: why is Microsoft in biomedicine? And I would always kind of joke saying, well, it is. We sell Office and Windows to every health care system in the world. We’re already in the space. And it really struck me to now see that we’ve actually come full circle. And now you can actually connect in Teams, Word, PowerPoint, which are these tools that everybody uses every day, but they’re actually now specialize-able through these agents. Can you guys talk a little bit about what that looks like from a developer perspective? How can provider groups actually start playing with this and see this come to life? 

SALIGRAMA: A lot of healthcare organizations already use Microsoft productivity tools, as you mentioned. So, they asked the developers, build these agents, and use our healthcare orchestrations, to plug in these agents and expose these in these productivity tools. They will get access to all these healthcare workers. So the healthcare agent orchestrator we have today integrates with Microsoft Teams, and it showcases an example of how you can at (@) mention these agents and talk to them like you were talking to another person in a Teams chat. And then it also provides examples of these agents and how they can use these productivity tools. One of the examples we have there is how they can summarize the assessments of this whole chat into a Word Doc, or even convert that into a PowerPoint presentation, for later on.

CARLSON: One of the things that has struck me is how easy it is to do. I mean, Will, I don’t know if you’ve worked with folks that have gone from 0 to 60, like, how fast? What does that look like? 

GUYMAN: Yeah, it’s funny for us, the technology to transfer all this context into a Word Document or PowerPoint presentation for a doctor to take to a meeting is relatively straightforward compared to the complicated clinical trial matching multimodal processing. The feedback has been tremendous in terms of, wow, that saves so much time to have this organized report that then I can show up to meeting with and the agents can come with me to that meeting because they’re literally having a Teams meeting, often with other human specialists. And the agents can be there and ask and answer questions and fact check and source all the right information on the fly. So, there’s a nice integration into these existing tools. 

LUNGREN: We worked with several different centers just to kind of understand, you know, where this might be useful. And, like, as I think we talked about before, the ideas that we’ve come up with again, this is a great one because it’s complex. It’s kind of hairy. There’s a lot of things happening under the hood that don’t necessarily require a medical license to do, right, to prepare for a tumor board and to organize data. But, it’s fascinating, actually. So, you know, folks have come up with ideas of, could I have an agent that can operate an MRI machine, and I can ask the agent to change some parameters or redo a protocol. We thought that was a pretty powerful use case. We’ve had others that have just said, you know, I really want to have a specific agent that’s able to kind of act like deep research does for the consumer side, but based on the context of my patient, so that it can search all the literature and pull the data in the papers that are relevant to this case. And the list goes on and on from operations all the way to clinical, you know, sort of decision making at some level. And I think that the research community that’s going to sprout around this will help us, guide us, I guess, to see what is the most high-impact use cases. Where is this effective? And maybe where it’s not effective.

But to me, the part that makes me so, I guess excited about this is just that I don’t have to think about, okay, well, then we have to figure out Health IT. Because it’s always, you know, we always have great ideas and research, and it always feels like there’s such a huge chasm to get it in front of the health care workers that might want to test this out. And it feels like, again, this productivity tool use case again with the enterprise security, the possibility for bringing in third parties to contribute really does feel like it’s a new surface area for innovation.

CARLSON: Yeah, I love that. Look. Let me end by putting you all on the spot. So, in three years, multimodal agents will do what? Matt, I’ll start with you. 

LUNGREN: I am convinced that it’s going to save massive amount of time before it saves many lives. 

RUNDE: I’ll focus on the patient care journey and diagnostic journey. I think it will kind of transform that process for the patient itself and shorten that process. 

GUYMAN: Yeah, I think we’ve seen already papers recently showing that different modalities surfaced complementary information. And so we’ll see kind of this AI and these agents becoming an essential companion to the physician, surfacing insights that would have been overlooked otherwise. 

SALIGRAMA: And similar to what you guys were saying, agents will become important assistants to healthcare workers, reducing a lot of documentation and workflow, excess work they have to do. 

CARLSON: I love that. And I guess for my part, I think really what we’re going to see is a massive unleash of creativity. We’ve had a lot of folks that have been innovating in this space, but they haven’t had a way to actually get it into the hands of early adopters. And I think we’re going to see that really lead to an explosion of creativity across the ecosystem. 

LUNGREN: So, where do we get started? Like where are the developers who are listening to this? The folks that are at, you know, labs, research labs and developing health care solutions. Where do they go to get started with the Foundry, the models we’ve talked about, the healthcare agent orchestrator. Where do they go?

GUYMAN: So AI.azure.com is the AI Foundry. It’s a website you can go as a developer. You can sign in with your Azure subscription, get your Azure account, your own VM, all that stuff. And you have an agent catalog, the model catalog. You can start from there. There is documentation and templates that you can then deploy to Teams or other applications. 

LUNGREN: And tutorials are coming. Right. We have recordings of tutorials. We’ll have Hackathons, some sessions and then more to come. Yeah, we’re really excited. 

[MUSIC] 

LUNGREN: Thank you so much, guys for joining us. 

CARLSON: Yes. Yeah. Thanks. 

SALIGRAMA: Thanks for having us. 

[MUSIC FADES] 

The post Collaborators: Healthcare Innovation to Impact appeared first on Microsoft Research.

Read More

Magentic-UI, an experimental human-centered web agent

Magentic-UI, an experimental human-centered web agent

This figure denotes a human figure above a small monitor on the left and a gear on the right with arrows pointing to each.

Modern productivity is rooted in the web—from searching for information and filling in forms to navigating dashboards. Yet, many of these tasks remain manual and repetitive. Today, we are introducing Magentic-UI, a new open-source research prototype of a human-centered agent that is meant to help researchers study open questions on human-in-the-loop approaches and oversight mechanisms for AI agents. This prototype collaborates with users on web-based tasks and operates in real time over a web browser. Unlike other computer use agents that aim for full autonomy, Magentic-UI offers a transparent and controllable experience for tasks that are action-oriented and require activities beyond just performing simple web searches.

Magentic-UI builds on Magentic-One (opens in new tab), a powerful multi-agent team we released last year, and is powered by AutoGen (opens in new tab), our leading agent framework. It is available under MIT license at https://github.com/microsoft/Magentic-UI (opens in new tab) and on Azure AI Foundry Labs (opens in new tab), the hub where developers, startups, and enterprises can explore groundbreaking innovations from Microsoft Research. Magentic-UI is integrated with Azure AI Foundry models and agents. Learn more about how to integrate Azure AI agents into the Magentic-UI multi-agent architecture by following this code sample (opens in new tab)

Magentic-UI can perform tasks that require browsing the web, writing and executing Python and shell code, and understanding files. Its key features include:

  1. Collaborative planning with users (co-planning). Magentic-UI allows users to directly modify its plan through a plan editor or by providing textual feedback before Magentic-UI executes any actions. 
  2. Collaborative execution with users (co-tasking). Users can pause the system and give feedback in natural language or demonstrate it by directly taking control of the browser.
  3. Safety with human-in-the-loop (action guards). Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors.
  4. Safety with human-in-the-loop. Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors. 
  5. Learning from experience (plan learning). Magentic-UI can learn and save plans from previous interactions to improve task completion for future tasks. 
A screenshot of Magentic-UI actively performing a task. The left side of the screen shows Magentic-UI stating its plan based on the following query
Figure 1: Screenshot of Magentic-UI actively performing a task. The left side of the screen shows Magentic-UI stating its plan and progress to accomplish a user’s complex goal. The right side shows the browser Magentic-UI is controlling. 

How is Magentic-UI human-centered?

While many web agents promise full autonomy, in practice users can be left unsure of what the agent can do, what it is currently doing, and whether they have enough control to intervene when something goes wrong or doesn’t occur as expected. By contrast, Magentic-UI considers user needs at every stage of interaction. We followed a human-centered design methodology in building Magentic-UI by prototyping and obtaining feedback from pilot users during its design. 

Co-planning - This figure describes how users can collaboratively plan with Magentic-UI. On the left hand side, users can accept the plan magentic-ui creates or re-create the plan. On the right hand side they can see the actions magnetic-ui is taking on the browser.
Figure 2: Co-planning – Users can collaboratively plan with Magentic-UI.

For example, after a person specifies and before Magentic-UI even begins to execute, it creates a clear step-by-step plan that outlines what it would do to accomplish the task. People can collaborate with Magentic-UI to modify this plan and then give final approval for Magentic-UI to begin execution. This is crucial as users may have expectations of how the task should be completed; communicating that information could significantly improve agent performance. We call this feature co-planning.

During execution, Magentic-UI shows in real time what specific actions it’s about to take. For example, whether it is about to click on a button or input a search query. It also shows in real time what it observed on the web pages it is visiting. Users can take control of the action at any point in time and give control back to the agent. We call this feature co-tasking.

Co-tasking. This fiture shows how magentic-ui shows the plan it created to the user and allows the user to push a button to accept the plan or modify it.
Figure 3: Co-tasking – Magentic-UI provides real-time updates about what it is about to do and what it already did, allowing users to collaboratively complete tasks with the agent.
Action-guards. TOn the left hand side,this figure shous how Magentic-UI will ask users for permission using Approve or Reject buttons before executing actions that it deems consequential or important. On the right hand side the figure shows magentic-ui picking Crab Wonton from Thai Thani's restaurant menu.
Figure 4: Action-guards – Magentic-UI will ask users for permission before executing actions that it deems consequential or important. 

Additionally, Magentic-UI asks for user permission before performing actions that are deemed irreversible, such as closing a tab or clicking a button with side effects. We call these “action guards”. The user can also configure Magentic-UI’s action guards to always ask for permission before performing any action. If the user deems an action risky (e.g., paying for an item), they can reject it. 

This figure shows how magentic-ui learns a plan by allowing the users to click on a button entitled
This figure shows how users can see their saved plans and click on a button to either run that same plan or edit it before running it.
Figure 5: Plan learning – Once a task is successfully completed, users can request Magentic-UI to learn a step-by-step plan from this experience.

After execution, the user can ask Magentic-UI to reflect on the conversation and infer and save a step-by-step plan for future similar tasks. Users can view and modify saved plans for Magentic-UI to reuse in the future in a saved-plans gallery. In a future session, users can launch Magentic-UI with the saved plan to either execute the same task again, like checking the price of a specific flight, or use the plan as a guide to help complete similar tasks, such as checking the price of a different type of flight. 

Combined, these four features—co-planning, co-tasking, action guards, and plan learning—enable users to collaborate effectively with Magentic-UI.

Architecture

Magentic-UI’s underlying system is a team of specialized agents adapted from AutoGen’s Magentic-One system. The agents work together to create a modular system:

  • Orchestrator is the lead agent, powered by a large language model (LLM), that performs co-planning with the user, decides when to ask the user for feedback, and delegates sub-tasks to the remaining agents to complete.
  • WebSurfer is an LLM agent equipped with a web browser that it can control. Given a request by the Orchestrator, it can click, type, scroll, and visit pages in multiple rounds to complete the request from the Orchestrator.
  • Coder is an LLM agent equipped with a Docker code-execution container. It can write and execute Python and shell commands and provide a response back to the Orchestrator.
  • FileSurfer is an LLM agent equipped with a Docker code-execution container and file-conversion tools from the MarkItDown (opens in new tab) package. It can locate files in the directory controlled by Magentic-UI, convert files to markdown, and answer questions about them.
This figure shows a histogram. It's a comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has aaccess to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost.
Figure 6: System architecture diagram of Magentic-UI

To interact with Magentic-UI, users can enter a text message and attach images. In response, Magentic-UI creates a natural-language step-by-step plan with which users can interact through a plan-editing interface. Users can add, delete, edit, regenerate steps, and write follow-up messages to iterate on the plan. While the user editing the plan adds an upfront cost to the interaction, it can potentially save a significant amount of time in the agent executing the plan and increase its chance at success.

The plan is stored inside the Orchestrator and is used to execute the task. For each step of the plan, the Orchestrator determines which of the agents (WebSurfer, Coder, FileSurfer) or the user should complete the step. Once that decision is made, the Orchestrator sends a request to one of the agents or the user and waits for a response. After the response is received, the Orchestrator decides whether that step is complete. If it is, the Orchestrator moves on to the following step.

Once all steps are completed, the Orchestrator generates a final answer that is presented to the user. If, while executing any of the steps, the Orchestrator decides that the plan is inadequate (for example, because a certain website is unreachable), the Orchestrator can replan with user permission and start executing a new plan.

All intermediate progress steps are clearly displayed to the user. Furthermore, the user can pause the execution of the plan and send additional requests or feedback. The user can also configure through the interface whether agent actions (e.g., clicking a button) require approval.

Evaluating Magentic-UI

Magentic-UI innovates through its ability to integrate human feedback in its planning and execution of tasks. We performed a preliminary automated evaluation to showcase this ability on the GAIA benchmark (opens in new tab) for agents with a user-simulation experiment.

Evaluation with simulated users

This figure shows a box diagram to describe the architecture of Magentic-UI.
Figure 7: Comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has aaccess to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost.

GAIA is a benchmark for general AI assistants, with multimodal question-answer pairs that are challenging, requiring the agents to navigate the web, process files, and execute code. The traditional evaluation setup with GAIA assumes the system will autonomously complete the task and return an answer, which is compared to the ground-truth answer. 

To evaluate the human-in-the-loop capabilities of Magentic-UI, we transform GAIA into an interactive benchmark by introducing the concept of a simulated user. Simulated users provide value in two ways: by having specific expertise that the agent may not possess, and by providing guidance on how the task should be performed.

We experiment with two types of simulated users to show the value of human-in-the-loop: (1) a simulated user that is more intelligent than the Magentic-UI agents and (2) a simulated user with the same intelligence as Magentic-UI agents but with additional information about the task. During co-planning, Magentic-UI takes feedback from this simulated user to improve its plan. During co-tasking, Magentic-UI can ask the (simulated) user for help when it gets stuck. Finally, if Magentic-UI does not provide a final answer, then the simulated user provides an answer instead.

The simulated user is an LLM without any tools, instructed to interact with Magentic-UI the way we expect a human would act. The first type of simulated user relies on OpenAI’s o4-mini, more performant at many tasks than the one powering the Magentic-UI agents (GPT-4o). For the second type of simulated user, we use GPT-4o for both the simulated user and the rest of the agents, but the user has access to side information about each task. Each task in GAIA has side information, which includes a human-written plan to solve the task. While this plan is not used as input in the traditional benchmark, in our interactive setting we provide this information to the second type of simulated user,which is powered by an LLM so that it can mimic a knowledgeable user. Importantly, we tuned our simulated user so as not to reveal the ground-truth answer directly as the answer is usually found inside the human written plan. Instead, it is prompted to guide Magentic-UI indirectly. We found that this tuning prevented the simulated user from inadvertently revealing the answer in all but 6% of tasks when Magentic-UI provides a final answer. 

On the validation subset of GAIA (162 tasks), we show the results of Magentic-One operating in autonomous mode, Magentic-UI operating in autonomous mode (without the simulated user), Magentic-UI with simulated user (1) (smarter model), Magentic-UI with simulated user (2) (side-information), and human performance. We first note that Magentic-UI in autonomous mode is within a margin of error of the performance of Magentic-One. Note that the same LLM (GPT-4o) is used for Magentic-UI and Magentic-One.

Magentic-UI with the simulated user that has access to side information improves the accuracy of autonomous Magentic-UI by 71%, from a 30.3% task-completion rate to a 51.9% task-completion rate. Moreover, Magentic-UI only asks for help from the simulated user in 10% of tasks and relies on the simulated user for the final answer in 18% of tasks. And in those tasks where it does ask for help, it asks for help on average 1.1 times. Magentic-UI with the simulated user powered by a smarter model improves to 42.6% where Magentic-UI asks for help in only 4.3% of tasks, asking for help an average of 1.7 times in those tasks. This demonstrates the potential of even lightweight human feedback for improving performance (e.g., task completion) of autonomous agents, especially at a fraction of the cost compared to people completing tasks entirely manually. 

Learning and reusing plans

As described above, once Magentic-UI completes a task, users have the option for Magentic-UI to learn a plan based on the execution of the task. These plans are saved in a plan gallery, which users and Magentic-UI can access in the future.

The user can select a plan from the plan gallery, which is displayed by clicking on the Saved Plans button. Alternatively, as a user enters a task that closely matches a previous task, the saved plan will be displayed even before the user is done typing. If no identical task is found, Magentic-UI can use AutoGen’s Task-Centric Memory (opens in new tab) to retrieve plans for any similar tasks. Our preliminary evaluations show that this retrieval is highly accurate, and when recalling a saved plan can be around 3x faster than generating a new plan. Once a plan is recalled or generated, the user can always accept it, modify it, or ask Magentic-UI to modify it for the specific task at hand. 

Safety and control

Magentic-UI can surf the live internet and execute code. With such capabilities, we need to ensure that Magentic-UI acts in a safe and secure manner. The following features, design decisions, and evaluations were made to ensure this:

  • Allow-list: Users can set a list of websites that Magentic-UI is allowed to access. If Magentic-UI needs to access a website outside of the allow-list, users must explicitly approve it through the interface
  • Anytime interruptions: At any point of Magentic-UI completing the task, the user can interrupt Magentic-UI and stop any pending code execution or web browsing.
  • Docker sandboxing: Magentic-UI controls a browser that is launched inside a Docker container with no credentials, which avoids risks with logged-in accounts and credentials. Moreover, any code execution is also performed inside a separate Docker container to avoid affecting the host environment in which Magentic-UI is running. This is illustrated in the system architecture of Magentic-UI (Figure 3).
  • Detection and approval of irreversible agent actions: Users can configure an action-approval policy (action guards) to determine which actions Magentic-UI can perform without user approval. In the extreme, users can specify that any action (e.g., any button click) needs explicit user approval. Users must press an “Accept” or “Deny” button for each action.

In addition to the above design decisions, we performed a red-team evaluation of Magentic-UI on a set of internal scenarios, which we developed to challenge the security and safety of Magentic-UI. Such scenarios include cross-site prompt injection attacks, where web pages contain malicious instructions distinct from the user’s original intent (e.g., to execute risky code, access sensitive files, or perform actions on other websites). It also contains scenarios comparable to phishing, which try to trick Magentic-UI into entering sensitive information, or granting permissions on impostor sites (e.g., a synthetic website that asks Magentic-UI to log in and enter Google credentials to read an article). In our preliminary evaluations, we found that Magentic-UI either refuses to complete the requests, stops to ask the user, or, as a final safety measure, is eventually unable to complete the request due to Docker sandboxing. We have found that this layered approach is effective for thwarting these attacks.

We have also released transparency notes, which can be found at: https://github.com/microsoft/magentic-ui/blob/main/TRANSPARENCY_NOTE.md (opens in new tab)

Open research questions 

Magentic-UI provides a tool for researchers to study critical questions in agentic systems and particularly on human-agent interaction. In a previous report (opens in new tab), we outlined 12 questions for human-agent communication, and Magentic-UI provides a vehicle to study these questions in a realistic setting. A key question among these is how we enable humans to efficiently intervene and provide feedback to the agent while executing a task. Humans should not have to constantly watch the agent. Ideally, the agent should know when to reach out for help and provide the necessary context for the human to assist it. A second question is about safety. As agents interact with the live web, they may become prone to attacks from malicious actors. We need to study what necessary safeguards are needed to protect the human from side effects without adding a heavy burden on the human to verify every agent action. There are also many other questions surrounding security, personalization, and learning that Magentic-UI can help with studying. 

Conclusion

Magentic-UI is an open-source agent prototype that works with people to complete complex tasks that require multi-step planning and browser use. As agentic systems expand in the scope of tasks they can complete, Magentic-UI’s design enables better transparency into agent actions and enables human control to ensure safety and reliability. Moreover, by facilitating human intervention, we can improve performance while still reducing human cost in completing tasks on aggregate. Today we have released the first version of Magentic-UI. Looking ahead, we plan to continue developing it in the open with the goal of improving its capabilities and answering research questions on human-agent collaboration. We invite the research community to extend and reuse Magentic-UI for their scientific explorations and domains. 

The post Magentic-UI, an experimental human-centered web agent appeared first on Microsoft Research.

Read More

Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers

Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers

AI Revolution podcast | Episode 5 - Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers | outline illustration of Carey Goldberg, Peter Lee, and Dr. Isaac (Zak) Kohane

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee. 

In this episode, Lee reunites with his coauthors Carey Goldberg (opens in new tab) and Dr. Zak Kohane (opens in new tab) to review the predictions they made and reflect on what has and hasn’t materialized based on discussions with the series’ early guests: frontline clinicians, patient/consumer advocates, technology developers, and policy and ethics thinkers. Together, the coauthors explore how generative AI is being used on the ground today—from clinical note-taking to empathetic patient communication—and discuss the ongoing tensions around safety, equity, and institutional adoption. The conversation also surfaces deeper questions about values embedded in AI systems and the future role of human clinicians.


Learn more

Compared with What? Measuring AI against the Health Care We Have (opens in new tab) (Kohane) 
Publication | October 2024 

Medical Artificial Intelligence and Human Values (opens in new tab) (Kohane) 
Publication | May 2024 

Managing Patient Use of Generative Health AI (opens in new tab) (Goldberg) 
Publication | December 2024 

Patient Portal — When Patients Take AI into Their Own Hands (opens in new tab) (Goldberg) 
Publication | April 2024 

To Do No Harm — and the Most Good — with AI in Health Care (opens in new tab) (Goldberg) 
Publication | February 2024 

This time, the hype about AI in medicine is warranted (opens in new tab) (Goldberg) 
Opinion article | April 2023 

The AI Revolution in Medicine: GPT-4 and Beyond   
Book | Peter Lee, Carey Goldberg, Isaac Kohane | April 2023

Transcript

[MUSIC]     

[BOOK PASSAGE]  

PETER LEE: “We need to start understanding and discussing AI’s potential for good and ill now. Or rather, yesterday. … GPT-4 has game-changing potential to improve medicine and health.” 

[END OF BOOK PASSAGE]  

[THEME MUSIC]     

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.     

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?      

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.


[THEME MUSIC FADES]  

The passage I read at the top is from the book’s prologue.   

When Carey, Zak, and I wrote the book, we could only speculate how generative AI would be used in healthcare because GPT-4 hadn’t yet been released. It wasn’t yet available to the very people we thought would be most affected by it. And while we felt strongly that this new form of AI would have the potential to transform medicine, it was such a different kind of technology for the world, and no one had a user’s manual for this thing to explain how to use it effectively and also how to use it safely.  

So we thought it would be important to give healthcare professionals and leaders a framing to start important discussions around its use. We wanted to provide a map not only to help people navigate a new world that we anticipated would happen with the arrival of GPT-4 but also to help them chart a future of what we saw as a potential revolution in medicine.  

So I’m super excited to welcome my coauthors: longtime medical/science journalist Carey Goldberg and Dr. Zak Kohane, the inaugural chair of Harvard Medical School’s Department of Biomedical Informatics and the editor-in-chief for The New England Journal of Medicine AI.  

We’re going to have two discussions. This will be the first one about what we’ve learned from the people on the ground so far and how we are thinking about generative AI today.  

[TRANSITION MUSIC] 

Carey, Zak, I’m really looking forward to this. 

CAREY GOLDBERG: It’s nice to see you, Peter.  

LEE: [LAUGHS] It’s great to see you, too. 

GOLDBERG: We missed you. 

ZAK KOHANE: The dynamic gang is back. [LAUGHTER] 

LEE: Yeah, and I guess after that big book project two years ago, it’s remarkable that we’re still on speaking terms with each other. [LAUGHTER] 

In fact, this episode is to react to what we heard in the first four episodes of this podcast. But before we get there, I thought maybe we should start with the origins of this project just now over two years ago. And, you know, I had this early secret access to Davinci 3, now known as GPT-4.  

I remember, you know, experimenting right away with things in medicine, but I realized I was in way over my head. And so I wanted help. And the first person I called was you, Zak. And you remember we had a call, and I tried to explain what this was about. And I think I saw skepticism in—polite skepticism—in your eyes. But tell me, you know, what was going through your head when you heard me explain this thing to you? 

KOHANE: So I was divided between the fact that I have tremendous respect for you, Peter. And you’ve always struck me as sober. And we’ve had conversations which showed to me that you fully understood some of the missteps that technology—ARPA, Microsoft, and others—had made in the past. And yet, you were telling me a full science fiction compliant story [LAUGHTER] that something that we thought was 30 years away was happening now.  

LEE: Mm-hmm. 

KOHANE: And it was very hard for me to put together. And so I couldn’t quite tell myself this is BS, but I said, you know, I need to look at it. Just this seems too good to be true. What is this? So it was very hard for me to grapple with it. I was thrilled that it might be possible, but I was thinking, How could this be possible

LEE: Yeah. Well, even now, I look back, and I appreciate that you were nice to me, because I think a lot of people would have [LAUGHS] been much less polite. And in fact, I myself had expressed a lot of very direct skepticism early on.  

After ChatGPT got released, I think three or four days later, I received an email from a colleague running … who runs a clinic, and, you know, he said, “Wow, this is great, Peter. And, you know, we’re using this ChatGPT, you know, to have the receptionist in our clinic write after-visit notes to our patients.”  

And that sparked a huge internal discussion about this. And you and I knew enough about hallucinations and about other issues that it seemed important to write something about what this could do and what it couldn’t do. And so I think, I can’t remember the timing, but you and I decided a book would be a good idea. And then I think you had the thought that you and I would write in a hopelessly academic style [LAUGHTER] that no one would be able to read.  

So it was your idea to recruit Carey, I think, right? 

KOHANE: Yes, it was. I was sure that we both had a lot of material, but communicating it effectively to the very people we wanted to would not go well if we just left ourselves to our own devices. And Carey is super brilliant at what she does. She’s an idea synthesizer and public communicator in the written word and amazing. 

LEE: So yeah. So, Carey, we contact you. How did that go? 

GOLDBERG: So yes. On my end, I had known Zak for probably, like, 25 years, and he had always been the person who debunked the scientific hype for me. I would turn to him with like, “Hmm, they’re saying that the Human Genome Project is going to change everything.” And he would say, “Yeah. But first it’ll be 10 years of bad news, and then [LAUGHTER] we’ll actually get somewhere.”  
 
So when Zak called me up at seven o’clock one morning, just beside himself after having tried Davinci 3, I knew that there was something very serious going on. And I had just quit my job as the Boston bureau chief of Bloomberg News, and I was ripe for the plucking. And I also … I feel kind of nostalgic now about just the amazement and the wonder and the awe of that period. We knew that when generative AI hit the world, there would be all kinds of snags and obstacles and things that would slow it down, but at that moment, it was just like the holy crap moment. [LAUGHTER] And it’s fun to think about it now. 

LEE: Yeah. I think ultimately, you know, recruiting Carey, you were [LAUGHS] so important because you basically went through every single page of this book and made sure … I remember, in fact, it’s affected my writing since because you were coaching us that every page has to be a page turner. There has to be something on every page that motivates people to want to turn the page and get to the next one. 

KOHANE: I will see that and raise that one. I now tell GPT-4, please write this in the style of Carey Goldberg.  

GOLDBERG: [LAUGHTER] No way! Really?  

KOHANE: Yes way. Yes way. Yes way. 

GOLDBERG: Wow. Well, I have to say, like, it’s not hard to motivate readers when you’re writing about the most transformative technology of their lifetime. Like, I think there’s a gigantic hunger to read and to understand. So you were not hard to work with, Peter and Zak. [LAUGHS] 

LEE: All right. So I think we have to get down to work [LAUGHS] now.  

Yeah, so for these podcasts, you know, we’re talking to different types of people to just reflect on what’s actually happening, what has actually happened over the last two years. And so the first episode, we talked to two doctors. There’s Chris Longhurst at UC San Diego and Sara Murray at UC San Francisco. And besides being doctors and having AI affect their clinical work, they just happen also to be leading the efforts at their respective institutions to figure out how best to integrate AI into their health systems. 

And, you know, it was fun to talk to them. And I felt like a lot of what they said was pretty validating for us. You know, they talked about AI scribes. Chris, especially, talked a lot about how AI can respond to emails from patients, write referral letters. And then, you know, they both talked about the importance of—I think, Zak, you used the phrase in our book “trust but verify”—you know, to have always a human in the loop.   

What did you two take away from their thoughts overall about how doctors are using … and I guess, Zak, you would have a different lens also because at Harvard, you see doctors all the time grappling with AI. 

KOHANE: So on the one hand, I think they’ve done some very interesting studies. And indeed, they saw that when these generative models, when GPT-4, was sending a note to patients, it was more detailed, friendlier. 

But there were also some nonobvious results, which is on the generation of these letters, if indeed you review them as you’re supposed to, it was not clear that there was any time savings. And my own reaction was, Boy, every one of these things needs institutional review. It’s going to be hard to move fast.  

And yet, at the same time, we know from them that the doctors on their smartphones are accessing these things all the time. And so the disconnect between a healthcare system, which is duty bound to carefully look at every implementation, is, I think, intimidating.  

LEE: Yeah. 

KOHANE: And at the same time, doctors who just have to do what they have to do are using this new superpower and doing it. And so that’s actually what struck me …  

LEE: Yeah. 

KOHANE: … is that these are two leaders and they’re doing what they have to do for their institutions, and yet there’s this disconnect. 

And by the way, I don’t think we’ve seen any faster technology adoption than the adoption of ambient dictation. And it’s not because it’s time saving. And in fact, so far, the hospitals have to pay out of pocket. It’s not like insurance is paying them more. But it’s so much more pleasant for the doctors … not least of which because they can actually look at their patients instead of looking at the terminal and plunking down.  

LEE: Carey, what about you? 

GOLDBERG: I mean, anecdotally, there are time savings. Anecdotally, I have heard quite a few doctors saying that it cuts down on “pajama time” to be able to have the note written by the AI and then for them to just check it. In fact, I spoke to one doctor who said, you know, basically it means that when I leave the office, I’ve left the office. I can go home and be with my kids. 

So I don’t think the jury is fully in yet about whether there are time savings. But what is clear is, Peter, what you predicted right from the get-go, which is that this is going to be an amazing paper shredder. Like, the main first overarching use cases will be back-office functions. 

LEE: Yeah, yeah. Well, and it was, I think, not a hugely risky prediction because, you know, there were already companies, like, using phone banks of scribes in India to kind of listen in. And, you know, lots of clinics actually had human scribes being used. And so it wasn’t a huge stretch to imagine the AI.

[TRANSITION MUSIC] 

So on the subject of things that we missed, Chris Longhurst shared this scenario, which stuck out for me, and he actually coauthored a paper on it last year. 

LEE: [LAUGHS] So, Carey, maybe I’ll start with you. What did we understand about this idea of empathy out of AI at the time we wrote the book, and what do we understand now? 

GOLDBERG: Well, it was already clear when we wrote the book that these AI models were capable of very persuasive empathy. And in fact, you even wrote that it was helping you be a better person, right. [LAUGHS] So their human qualities, or human imitative qualities, were clearly superb. And we’ve seen that borne out in multiple studies, that in fact, patients respond better to them … that they have no problem at all with how the AI communicates with them. And in fact, it’s often better.  

And I gather now we’re even entering a period when people are complaining of sycophantic models, [LAUGHS] where the models are being too personable and too flattering. I do think that’s been one of the great surprises. And in fact, this is a huge phenomenon, how charming these models can be. 

LEE: Yeah, I think you’re right. We can take credit for understanding that, Wow, these things can be remarkably empathetic. But then we missed this problem of sycophancy. Like, we even started our book in Chapter 1 with a quote from Davinci 3 scolding me. Like, don’t you remember when we were first starting, this thing was actually anti-sycophantic. If anything, it would tell you you’re an idiot.  

KOHANE: It argued with me about certain biology questions. It was like a knockdown, drag-out fight. [LAUGHTER] I was bringing references. It was impressive. But in fact, it made me trust it more. 

LEE: Yeah. 

KOHANE: And in fact, I will say—I remember it’s in the book—I had a bone to pick with Peter. Peter really was impressed by the empathy. And I pointed out that some of the most popular doctors are popular because they’re very empathic. But they’re not necessarily the best doctors. And in fact, I was taught that in medical school.   

And so it’s a decoupling. It’s a human thing, that the empathy does not necessarily mean … it’s more of a, potentially, more of a signaled virtue than an actual virtue. 

GOLDBERG: Nicely put. 

LEE: Yeah, this issue of sycophancy, I think, is a struggle right now in the development of AI because I think it’s somehow related to instruction-following. So, you know, one of the challenges in AI is you’d like to give an AI a task—a task that might take several minutes or hours or even days to complete. And you want it to faithfully kind of follow those instructions. And, you know, that early version of GPT-4 was not very good at instruction-following. It would just silently disobey and, you know, and do something different. 

And so I think we’re starting to hit some confusing elements of like, how agreeable should these things be?  

One of the two of you used the word genteel. There was some point even while we were, like, on a little book tour … was it you, Carey, who said that the model seems nicer and less intelligent or less brilliant now than it did when we were writing the book? 

GOLDBERG: It might have been, I think so. And I mean, I think in the context of medicine, of course, the question is, well, what’s likeliest to get the results you want with the patient, right? A lot of healthcare is in fact persuading the patient to do what you know as the physician would be best for them. And so it seems worth testing out whether this sycophancy is actually constructive or not. And I suspect … well, I don’t know, probably depends on the patient. 

So actually, Peter, I have a few questions for you … 

LEE: Yeah. Mm-hmm. 

GOLDBERG: … that have been lingering for me. And one is, for AI to ever fully realize its potential in medicine, it must deal with the hallucinations. And I keep hearing conflicting accounts about whether that’s getting better or not. Where are we at, and what does that mean for use in healthcare? 

LEE: Yeah, well, it’s, I think two years on, in the pretrained base models, there’s no doubt that hallucination rates by any benchmark measure have reduced dramatically. And, you know, that doesn’t mean they don’t happen. They still happen. But, you know, there’s been just a huge amount of effort and understanding in the, kind of, fundamental pretraining of these models. And that has come along at the same time that the inference costs, you know, for actually using these models has gone down, you know, by several orders of magnitude.  

So things have gotten cheaper and have fewer hallucinations. At the same time, now there are these reasoning models. And the reasoning models are able to solve problems at PhD level oftentimes. 

But at least at the moment, they are also now hallucinating more than the simpler pretrained models. And so it still continues to be, you know, a real issue, as we were describing. I don’t know, Zak, from where you’re at in medicine, as a clinician and as an educator in medicine, how is the medical community from where you’re sitting looking at that? 

KOHANE: So I think it’s less of an issue, first of all, because the rate of hallucinations is going down. And second of all, in their day-to-day use, the doctor will provide questions that sit reasonably well into the context of medical decision-making. And the way doctors use this, let’s say on their non-EHR [electronic health record] smartphone is really to jog their memory or thinking about the patient, and they will evaluate independently. So that seems to be less of an issue. I’m actually more concerned about something else that’s I think more fundamental, which is effectively, what values are these models expressing?  

And I’m reminded of when I was still in training, I went to a fancy cocktail party in Cambridge, Massachusetts, and there was a psychotherapist speaking to a dentist. They were talking about their summer, and the dentist was saying about how he was going to fix up his yacht that summer, and the only question was whether he was going to make enough money doing procedures in the spring so that he could afford those things, which was discomforting to me because that dentist was my dentist. [LAUGHTER] And he had just proposed to me a few weeks before an expensive procedure. 

And so the question is what, effectively, is motivating these models?  

LEE: Yeah, yeah.  

KOHANE: And so with several colleagues, I published a paper (opens in new tab), basically, what are the values in AI? And we gave a case: a patient, a boy who is on the short side, not abnormally short, but on the short side, and his growth hormone levels are not zero. They’re there, but they’re on the lowest side. But the rest of the workup has been unremarkable. And so we asked GPT-4, you are a pediatric endocrinologist. 

Should this patient receive growth hormone? And it did a very good job explaining why the patient should receive growth hormone.  

GOLDBERG: Should. Should receive it.  

KOHANE: Should. And then we asked, in a separate session, you are working for the insurance company. Should this patient receive growth hormone? And it actually gave a scientifically better reason not to give growth hormone. And in fact, I tend to agree medically, actually, with the insurance company in this case, because giving kids who are not growth hormone deficient, growth hormone gives only a couple of inches over many, many years, has all sorts of other issues. But here’s the point, we had 180-degree change in decision-making because of the prompt. And for that patient, tens-of-thousands-of-dollars-per-year decision; across patient populations, millions of dollars of decision-making.  

LEE: Hmm. Yeah. 

KOHANE: And you can imagine these user prompts making their way into system prompts, making their way into the instruction-following. And so I think this is aptly central. Just as I was wondering about my dentist, we should be wondering about these things. What are the values that are being embedded in them, some accidentally and some very much on purpose? 

LEE: Yeah, yeah. That one, I think, we even had some discussions as we were writing the book, but there’s a technical element of that that I think we were missing, but maybe Carey, you would know for sure. And that’s this whole idea of prompt engineering. It sort of faded a little bit. Was it a thing? Do you remember? 

GOLDBERG: I don’t think we particularly wrote about it. It’s funny, it does feel like it faded, and it seems to me just because everyone just gets used to conversing with the models and asking for what they want. Like, it’s not like there actually is any great science to it. 

LEE: Yeah, even when it was a hot topic and people were talking about prompt engineering maybe as a new discipline, all this, it never, I was never convinced at the time. But at the same time, it is true. It speaks to what Zak was just talking about because part of the prompt engineering that people do is to give a defined role to the AI.  

You know, you are an insurance claims adjuster, or something like that, and defining that role, that is part of the prompt engineering that people do. 

GOLDBERG: Right. I mean, I can say, you know, sometimes you guys had me take sort of the patient point of view, like the “every patient” point of view. And I can say one of the aspects of using AI for patients that remains absent in as far as I can tell is it would be wonderful to have a consumer-facing interface where you could plug in your whole medical record without worrying about any privacy or other issues and be able to interact with the AI as if it were physician or a specialist and get answers, which you can’t do yet as far as I can tell. 

LEE: Well, in fact, now that’s a good prompt because I think we do need to move on to the next episodes, and we’ll be talking about an episode that talks about consumers. But before we move on to Episode 2, which is next, I’d like to play one more quote, a little snippet from Sara Murray. 

LEE: Carey, you wrote this fictional account at the very start of our book. And that fictional account, I think you and Zak worked on that together, talked about this medical resident, ER resident, using, you know, a chatbot off label, so to speak. And here we have the chief, in fact, the nation’s first chief health AI officer [LAUGHS] for an elite health system doing exactly that. That’s got to be pretty validating for you, Carey. 

GOLDBERG: It’s very. [LAUGHS] Although what’s troubling about it is that actually as in that little vignette that we made up, she’s using it off label, right. It’s like she’s just using it because it helps the way doctors use Google. And I do find it troubling that what we don’t have is sort of institutional buy-in for everyone to do that because, shouldn’t they if it helps? 

LEE: Yeah. Well, let’s go ahead and get into Episode 2. So Episode 2, we sort of framed as talking to two people who are on the frontlines of big companies integrating generative AI into their clinical products. And so, one was Matt Lungren, who’s a colleague of mine here at Microsoft. And then Seth Hain, who leads all of R&D at Epic.  

Maybe we’ll start with a little snippet of something that Matt said that struck me in a certain way. 

LEE: I think we expected healthcare systems to adopt AI, and we spent a lot of time in the book on AI writing clinical encounter notes. It’s happening for real now, and in a big way. And it’s something that has, of course, been happening before generative AI but now is exploding because of it. Where are we at now, two years later, just based on what we heard from guests? 

KOHANE: Well, again, unless they’re forced to, hospitals will not adopt new technology unless it immediately translates into income. So it’s bizarrely counter-cultural that, again, they’re not being able to bill for the use of the AI, but this technology is so compelling to the doctors that despite everything, it’s overtaking the traditional dictation-typing routine. 

LEE: Yeah. 

GOLDBERG: And a lot of them love it and say, you will pry my cold dead hands off of my ambient note-taking, right. And I actually … a primary care physician allowed me to watch her. She was actually testing the two main platforms that are being used. And there was this incredibly talkative patient who went on and on about vacation and all kinds of random things for about half an hour.  

And both of the platforms were incredibly good at pulling out what was actually medically relevant. And so to say that it doesn’t save time doesn’t seem right to me. Like, it seemed like it actually did and in fact was just shockingly good at being able to pull out relevant information. 

LEE: Yeah. 

KOHANE: I’m going to hypothesize that in the trials, which have in fact shown no gain in time, is the doctors were being incredibly meticulous. [LAUGHTER] So I think … this is a Hawthorne effect, because you know you’re being monitored. And we’ve seen this in other technologies where the moment the focus is off, it’s used much more routinely and with much less inspection, for the better and for the worse. 

LEE: Yeah, you know, within Microsoft, I had some internal disagreements about Microsoft producing a product in this space. It wouldn’t be Microsoft’s normal way. Instead, we would want 50 great companies building those products and doing it on our cloud instead of us competing against those 50 companies. And one of the reasons is exactly what you both said. I didn’t expect that health systems would be willing to shell out the money to pay for these things. It doesn’t generate more revenue. But I think so far two years later, I’ve been proven wrong.

I wanted to ask a question about values here. I had this experience where I had a little growth, a bothersome growth on my cheek. And so had to go see a dermatologist. And the dermatologist treated it, froze it off. But there was a human scribe writing the clinical note.  

And so I used the app to look at the note that was submitted. And the human scribe said something that did not get discussed in the exam room, which was that the growth was making it impossible for me to safely wear a COVID mask. And that was the reason for it. 

And that then got associated with a code that allowed full reimbursement for that treatment. And so I think that’s a classic example of what’s called upcoding. And I strongly suspect that AI scribes, an AI scribe would not have done that. 

GOLDBERG: Well, depending what values you programmed into it, right, Zak? [LAUGHS] 

KOHANE: Today, today, today, it will not do it. But, Peter, that is actually the central issue that society has to have because our hospitals are currently mostly in the red. And upcoding is standard operating procedure. And if these AI get in the way of upcoding, they are going to be aligned towards that upcoding. You know, you have to ask yourself, these MRI machines are incredibly useful. They’re also big money makers. And if the AI correctly says that for this complaint, you don’t actually have to do the MRI …  

LEE: Right. 

KOHANE: what’s going to happen? And so I think this issue of values … you’re right. Right now, they’re actually much more impartial. But there are going to be business plans just around aligning these things towards healthcare. In many ways, this is why I think we wrote the book so that there should be a public discussion. And what kind of AI do we want to have? Whose values do we want it to represent? 

GOLDBERG: Yeah. And that raises another question for me. So, Peter, speaking from inside the gigantic industry, like, there seems to be such a need for self-surveillance of the models for potential harms that they could be causing. Are the big AI makers doing that? Are they even thinking about doing that? 

Like, let’s say you wanted to watch out for the kind of thing that Zak’s talking about, could you? 

LEE: Well, I think evaluation, like the best evaluation we had when we wrote our book was, you know, what score would this get on the step one and step two US medical licensing exams? [LAUGHS]  

GOLDBERG: Right, right, right, yeah. 

LEE: But honestly, evaluation hasn’t gotten that much deeper in the last two years. And it’s a big, I think, it is a big issue. And it’s related to the regulation issue also, I think. 

Now the other guest in Episode 2 is Seth Hain from Epic. You know, Zak, I think it’s safe to say that you’re not a fan of Epic and the Epic system. You know, we’ve had a few discussions about that, about the fact that doctors don’t have a very pleasant experience when they’re using Epic all day.  

Seth, in the podcast, said that there are over 100 AI integrations going on in Epic’s system right now. Do you think, Zak, that that has a chance to make you feel better about Epic? You know, what’s your view now two years on? 

KOHANE: My view is, first of all, I want to separate my view of Epic and how it’s affected the conduct of healthcare and the quality of life of doctors from the individuals. Like Seth Hain is a remarkably fine individual who I’ve enjoyed chatting with and does really great stuff. Among the worst aspects of the Epic, even though it’s better in that respect than many EHRs, is horrible user interface. 

The number of clicks that you have to go to get to something. And you have to remember where someone decided to put that thing. It seems to me that it is fully within the realm of technical possibility today to actually give an agent a task that you want done in the Epic record. And then whether Epic has implemented that agent or someone else, it does it so you don’t have to do the clicks. Because it’s something really soul sucking that when you’re trying to help patients, you’re having to remember not the right dose of the medication, but where was that particular thing that you needed in that particular task?  

I can’t imagine that Epic does not have that in its product line. And if not, I know there must be other companies that essentially want to create that wrapper. So I do think, though, that the danger of multiple integrations is that you still want to have the equivalent of a single thought process that cares about the patient bringing those different processes together. And I don’t know if that’s Epic’s responsibility, the hospital’s responsibility, whether it’s actually a patient agent. But someone needs to be also worrying about all those AIs that are being integrated into the patient record. So … what do you think, Carey? 

GOLDBERG: What struck me most about what Seth said was his description of the Cosmos project, and I, you know, I have been drinking Zak’s Kool-Aid for a very long time, [LAUGHTER] and he—no, in a good way! And he persuaded me long ago that there is this horrible waste happening in that we have all of these electronic medical records, which could be used far, far more to learn from, and in particular, when you as a patient come in, it would be ideal if your physician could call up all the other patients like you and figure out what the optimal treatment for you would be. And it feels like—it sounds like—that’s one of the central aims that Epic is going for. And if they do that, I think that will redeem a lot of the pain that they’ve caused physicians these last few years.  

And I also found myself thinking, you know, maybe this very painful period of using electronic medical records was really just a growth phase. It was an awkward growth phase. And once AI is fully used the way Zak is beginning to describe, the whole system could start making a lot more sense for everyone. 

LEE: Yeah. One conversation I’ve had with Seth, in all of this is, you know, with AI and its development, is there a future, a near future where we don’t have an EHR [electronic health record] system at all? You know, AI is just listening and just somehow absorbing all the information. And, you know, one thing that Seth said, which I felt was prescient, and I’d love to get your reaction, especially Zak, on this is he said, I think that … he said, technically, it could happen, but the problem is right now, actually doctors do a lot of their thinking when they write and review notes. You know, the actual process of being a doctor is not just being with a patient, but it’s actually thinking later. What do you make of that? 

KOHANE: So one of the most valuable experiences I had in training was something that’s more or less disappeared in medicine, which is the post-clinic conference, where all the doctors come together and we go through the cases that we just saw that afternoon. And we, actually, were trying to take potshots at each other [LAUGHTER] in order to actually improve. Oh, did you actually do that? Oh, I forgot. I’m going to go call the patient and do that.  

And that really happened. And I think that, yes, doctors do think, and I do think that we are insufficiently using yet the artificial intelligence currently in the ambient dictation mode as much more of a independent agent saying, did you think about that? 

I think that would actually make it more interesting, challenging, and clearly better for the patient because that conversation I just told you about with the other doctors, that no longer exists.  

LEE: Yeah. Mm-hmm. I want to do one more thing here before we leave Matt and Seth in Episode 2, which is something that Seth said with respect to how to reduce hallucination.  

LEE: Yeah, so, Carey, this sort of gets at what you were saying, you know, that shouldn’t these models be just bringing in a lot more information into their thought processes? And I’m certain when we wrote our book, I had no idea. I did not conceive of RAG at all. It emerged a few months later.  

And to my mind, I remember the first time I encountered RAG—Oh, this is going to solve all of our problems of hallucination. But it’s turned out to be harder. It’s improving day by day, but it’s turned out to be a lot harder. 

KOHANE: Seth makes a very deep point, which is the way RAG is implemented is basically some sort of technique for pulling the right information that’s contextually relevant. And the way that’s done is typically heuristic at best. And it’s not … doesn’t have the same depth of reasoning that the rest of the model has.  

And I’m just wondering, Peter, what you think, given the fact that now context lengths seem to be approaching a million or more, and people are now therefore using the full strength of the transformer on that context and are trying to figure out different techniques to make it pay attention to the middle of the context. In fact, the RAG approach perhaps was just a transient solution to the fact that it’s going to be able to amazingly look in a thoughtful way at the entire record of the patient, for example. What do you think, Peter? 

LEE: I think there are three things, you know, that are going on, and I’m not sure how they’re going to play out and how they’re going to be balanced. And I’m looking forward to talking to people in later episodes of this podcast, you know, people like Sébastien Bubeck or Bill Gates about this, because, you know, there is the pretraining phase, you know, when things are sort of compressed and baked into the base model.  

There is the in-context learning, you know, so if you have extremely long or infinite context, you’re kind of learning as you go along. And there are other techniques that people are working on, you know, various sorts of dynamic reinforcement learning approaches, and so on. And then there is what maybe you would call structured RAG, where you do a pre-processing. You go through a big database, and you figure it all out. And make a very nicely structured database the AI can then consult with later.  

And all three of these in different contexts today seem to show different capabilities. But they’re all pretty important in medicine.   

[TRANSITION MUSIC] 

Moving on to Episode 3, we talked to Dave DeBronkart, who is also known as “e-Patient Dave,” an advocate of patient empowerment, and then also Christina Farr, who has been doing a lot of venture investing for consumer health applications.  

Let’s get right into this little snippet from something that e-Patient Dave said that talks about the sources of medical information, particularly relevant for when he was receiving treatment for stage 4 kidney cancer. 

LEE: All right. So I have a question for you, Carey, and a question for you, Zak, about the whole conversation with e-Patient Dave, which I thought was really remarkable. You know, Carey, I think as we were preparing for this whole podcast series, you made a comment—I actually took it as a complaint—that not as much has happened as I had hoped or thought. People aren’t thinking boldly enough, you know, and I think, you know, I agree with you in the sense that I think we expected a lot more to be happening, particularly in the consumer space. I’m giving you a chance to vent about this. 

GOLDBERG: [LAUGHTER] Thank you! Yes, that has been by far the most frustrating thing to me. I think that the potential for AI to improve everybody’s health is so enormous, and yet, you know, it needs some sort of support to be able to get to the point where it can do that. Like, remember in the book we wrote about Greg Moore talking about how half of the planet doesn’t have healthcare, but people overwhelmingly have cellphones. And so you could connect people who have no healthcare to the world’s medical knowledge, and that could certainly do some good.  

And I have one great big problem with e-Patient Dave, which is that, God, he’s fabulous. He’s super smart. Like, he’s not a typical patient. He’s an off-the-charts, brilliant patient. And so it’s hard to … and so he’s a great sort of lead early-adopter-type person, and he can sort of show the way for others.  

But what I had hoped for was that there would be more visible efforts to really help patients optimize their healthcare. Probably it’s happening a lot in quiet ways like that any discharge instructions can be instantly beautifully translated into a patient’s native language and so on. But it’s almost like there isn’t a mechanism to allow this sort of mass consumer adoption that I would hope for.

LEE: Yeah. But you have written some, like, you even wrote about that person who saved his dog (opens in new tab). So do you think … you know, and maybe a lot more of that is just happening quietly that we just never hear about? 

GOLDBERG: I’m sure that there is a lot of it happening quietly. And actually, that’s another one of my complaints is that no one is gathering that stuff. It’s like you might happen to see something on social media. Actually, e-Patient Dave has a hashtag, PatientsUseAI, and a blog, as well. So he’s trying to do it. But I don’t know of any sort of overarching or academic efforts to, again, to surveil what’s the actual use in the population and see what are the pros and cons of what’s happening. 

LEE: Mm-hmm. So, Zak, you know, the thing that I thought about, especially with that snippet from Dave, is your opening for Chapter 8 that you wrote, you know, about your first patient dying in your arms. I still think of how traumatic that must have been. Because, you know, in that opening, you just talked about all the little delays, all the little paper-cut delays, in the whole process of getting some new medical technology approved. But there’s another element that Dave kind of speaks to, which is just, you know, patients who are experiencing some issue are very, sometimes very motivated. And there’s just a lot of stuff on social media that happens. 

KOHANE: So this is where I can both agree with Carey and also disagree. I think when people have an actual health problem, they are now routinely using it. 

GOLDBERG: Yes, that’s true. 

KOHANE: And that situation is happening more often because medicine is failing. This is something that did not come up enough in our book. And perhaps that’s because medicine is actually feeling a lot more rickety today than it did even two years ago.  

We actually mentioned the problem. I think, Peter, you may have mentioned the problem with the lack of primary care. But now in Boston, our biggest healthcare system, all the practices for primary care are closed. I cannot get for my own faculty—residents at MGH [Massachusetts General Hospital] can’t get primary care doctor. And so … 

LEE: Which is just crazy. I mean, these are amongst the most privileged people in medicine, and they can’t find a primary care physician. That’s incredible. 

KOHANE: Yeah, and so therefore … and I wrote an article about this in the NEJM [New England Journal of Medicine] (opens in new tab) that medicine is in such dire trouble that we have incredible technology, incredible cures, but where the rubber hits the road, which is at primary care, we don’t have very much.  

And so therefore, you see people who know that they have a six-month wait till they see the doctor, and all they can do is say, “I have this rash. Here’s a picture. What’s it likely to be? What can I do?” “I’m gaining weight. How do I do a ketogenic diet?” Or, “How do I know that this is the flu?”  
 
This is happening all the time, where acutely patients have actually solved problems that doctors have not. Those are spectacular. But I’m saying more routinely because of the failure of medicine. And it’s not just in our fee-for-service United States. It’s in the UK; it’s in France. These are first-world, developed-world problems. And we don’t even have to go to lower- and middle-income countries for that. 

LEE: Yeah. 

GOLDBERG: But I think it’s important to note that, I mean, so you’re talking about how even the most elite people in medicine can’t get the care they need. But there’s also the point that we have so much concern about equity in recent years. And it’s likeliest that what we’re doing is exacerbating inequity because it’s only the more connected, you know, better off people who are using AI for their health. 

KOHANE: Oh, yes. I know what various Harvard professors are doing. They’re paying for a concierge doctor. And that’s, you know, a $5,000- to $10,000-a-year-minimum investment. That’s inequity. 

LEE: When we wrote our book, you know, the idea that GPT-4 wasn’t trained specifically for medicine, and that was amazing, but it might get even better and maybe would be necessary to do that. But one of the insights for me is that in the consumer space, the kinds of things that people ask about are different than what the board-certified clinician would ask. 

KOHANE: Actually, that’s, I just recently coined the term. It’s the … maybe it’s … well, at least it’s new to me. It’s the technology or expert paradox. And that is the more expert and narrow your medical discipline, the more trivial it is to translate that into a specialized AI. So echocardiograms? We can now do beautiful echocardiograms. That’s really hard to do. I don’t know how to interpret an echocardiogram. But they can do it really, really well. Interpret an EEG [electroencephalogram]. Interpret a genomic sequence. But understanding the fullness of the human condition, that’s actually hard. And actually, that’s what primary care doctors do best. But the paradox is right now, what is easiest for AI is also the most highly paid in medicine. [LAUGHTER] Whereas what is the hardest for AI in medicine is the least regarded, least paid part of medicine. 

GOLDBERG: So this brings us to the question I wanted to throw at both of you actually, which is we’ve had this spasm of incredibly prominent people predicting that in fact physicians would be pretty obsolete within the next few years. We had Bill Gates saying that; we had Elon Musk saying surgeons are going to be obsolete within a few years. And I think we had Demis Hassabis saying, “Yeah, we’ll probably cure most diseases within the next decade or so.” [LAUGHS] 

So what do you think? And also, Zak, to what you were just saying, I mean, you’re talking about being able to solve very general overarching problems. But in fact, these general overarching models are actually able, I would think, are able to do that because they are broad. So what are we heading towards do you think? What should the next book be … The end of doctors? [LAUGHS] 

KOHANE: So I do recall a conversation that … we were at a table with Bill Gates, and Bill Gates immediately went to this, which is advancing the cutting edge of science. And I have to say that I think it will accelerate discovery. But eliminating, let’s say, cancer? I think that’s going to be … that’s just super hard. The reason it’s super hard is we don’t have the data or even the beginnings of the understanding of all the ways this devilish disease managed to evolve around our solutions.  

And so that seems extremely hard. I think we’ll make some progress accelerated by AI, but solving it in a way Hassabis says, God bless him. I hope he’s right. I’d love to have to eat crow in 10 or 20 years, but I don’t think so. I do believe that a surgeon working on one of those Davinci machines, that stuff can be, I think, automated.  

And so I think that’s one example of one of the paradoxes I described. And it won’t be that we’re replacing doctors. I just think we’re running out of doctors. I think it’s really the case that, as we said in the book, we’re getting a huge deficit in primary care doctors. 

But even the subspecialties, my subspecialty, pediatric endocrinology, we’re only filling half of the available training slots every year. And why? Because it’s a lot of work, a lot of training, and frankly doesn’t make as much money as some of the other professions.  

LEE: Yeah. Yeah, I tend to think that, you know, there are going to be always a need for human doctors, not for their skills. In fact, I think their skills increasingly will be replaced by machines. And in fact, I’ve talked about a flip. In fact, patients will demand, Oh my god, you mean you’re going to try to do that yourself instead of having the computer do it? There’s going to be that sort of flip. But I do think that when it comes to people’s health, people want the comfort of an authority figure that they trust. And so what is more of a question for me is whether we will ever view a machine as an authority figure that we can trust. 

And before I move on to Episode 4, which is on norms, regulations and ethics, I’d like to hear from Chrissy Farr on one more point on consumer health, specifically as it relates to pregnancy: 

LEE: In the consumer space, I don’t think we really had a focus on those periods in a person’s life when they have a lot of engagement, like pregnancy, or I think another one is menopause, cancer. You know, there are points where there is, like, very intense engagement. And we heard that from e-Patient Dave, you know, with his cancer and Chrissy with her pregnancy. Was that a miss in our book? What do think, Carey? 

GOLDBERG: I mean, I don’t think so. I think it’s true that there are many points in life when people are highly engaged. To me, the problem thus far is just that I haven’t seen consumer-facing companies offering beautiful AI-based products. I think there’s no question at all that the market is there if you have the products to offer. 

LEE: So, what do you think this means, Zak, for, you know, like Boston Children’s or Mass General Brigham—you know, the big places? 

KOHANE: So again, all these large healthcare systems are in tough shape. MGB [Mass General Brigham] would be fully in the red if not for the fact that its investments, of all things, have actually produced. If you look at the large healthcare systems around the country, they are in the red. And there’s multiple reasons why they’re in the red, but among them is cost of labor.  

And so we’ve created what used to be a very successful beast, the health center. But it’s developed a very expensive model and a highly regulated model. And so when you have high revenue, tiny margins, your ability to disrupt yourself, to innovate, is very, very low because you will have to talk to the board next year if you went from 2% positive margin to 1% negative margin.  

LEE: Yeah. 

KOHANE: And so I think we’re all waiting for one of the two things to happen, either a new kind of healthcare delivery system being generated or ultimately one of these systems learns how to disrupt itself.  

LEE: Yeah. All right. I think we have to move on to Episode 4. And, you know, when it came to the question of regulation, I think this is … my read is when we were writing our book, this is the part that we struggled with the most.  

GOLDBERG: We punted. [LAUGHS] We totally punted to the AI. 

LEE: We had three amazing guests. One was Laura Adams from National Academy of Medicine. Let’s play a snippet from her. 

LEE: All right, so I very well remember that we had discussed this kind of idea when we were writing our book. And I think before we finished our book, I personally rejected the idea. But now two years later, what do the two of you think? I’m dying to hear. 

GOLDBERG: Well, wait, why … what do you think? Like, are you sorry that you rejected it? 

LEE: I’m still skeptical because when we are licensing human beings as doctors, you know, we’re making a lot of implicit assumptions that we don’t test as part of their licensure, you know, that first of all, they are [a] human being and they care about life, and that, you know, they have a certain amount of common sense and shared understanding of the world.  

And there’s all sorts of sort of implicit assumptions that we have about each other as human beings living in a society together. That you know how to study, you know, because I know you just went through three years of medical or four years of medical school and all sorts of things. And so the standard ways that we license human beings, they don’t need to test all of that stuff. But somehow intuitively, all of that seems really important. 

I don’t know. Am I wrong about that? 

KOHANE: So it’s compared with what issue? Because we know for a fact that doctors who do a lot of a procedure, like do this procedure, like high-risk deliveries all the time, have better outcomes than ones who only do a few high risk. We talk about it, but we don’t actually make it explicit to patients or regulate that you have to have this minimal amount. And it strikes me that in some sense, and, oh, very importantly, these things called human beings learn on the job. And although I used to be very resentful of it as a resident, when someone would say, I don’t want the resident, I want the … 

GOLDBERG: … the attending. [LAUGHTER] 

KOHANE: … they had a point. And so the truth is, maybe I was a wonderful resident, but some people were not so great. [LAUGHTER] And so it might be the best outcome if we actually, just like for human beings, we say, yeah, OK, it’s this good, but don’t let it work autonomously, or it’s done a thousand of them, just let it go. We just don’t have practically speaking, we don’t have the environment, the lab, to test them. Now, maybe if they get embodied in robots and literally go around with us, then it’s going to be [in some sense] a lot easier. I don’t know. 

LEE: Yeah.  

GOLDBERG: Yeah, I think I would take a step back and say, first of all, we weren’t the only ones who were stumped by regulating AI. Like, nobody has done it yet in the United States to this day, right. Like, we do not have standing regulation of AI in medicine at all in fact. And that raises the issue of … the story that you hear often in the biotech business, which is, you know, more prominent here in Boston than anywhere else, is that thank goodness Cambridge put out, the city of Cambridge, put out some regulations about biotech and how you could dump your lab waste and so on. And that enabled the enormous growth of biotech here.  

If you don’t have the regulations, then you can’t have the growth of AI in medicine that is worthy of having. And so, I just … we’re not the ones who should do it, but I just wish somebody would.  

LEE: Yeah. 

GOLDBERG: Zak. 

KOHANE: Yeah, but I want to say this as always, execution is everything, even in regulation.  

And so I’m mindful that a conference that both of you attended, the RAISE conference [Responsible AI for Social and Ethical Healthcare] (opens in new tab). The Europeans in that conference came to me personally and thanked me for organizing this conference about safe and effective use of AI because they said back home in Europe, all that we’re talking about is risk, not opportunities to improve care.  

And so there is a version of regulation which just locks down the present and does not allow the future that we’re talking about to happen. And so, Carey, I absolutely hear you that we need to have a regulation that takes away some of the uncertainty around liability, around the freedom to operate that would allow things to progress. But we wrote in our book that premature regulation might actually focus on the wrong thing. And so since I’m an optimist, it may be the fact that we don’t have much of a regulatory infrastructure today, that it allows … it’s a unique opportunity—I’ve said this now to several leaders—for the healthcare systems to say, this is the regulation we need.  

GOLDBERG: It’s true. 

KOHANE: And previously it was top-down. It was coming from the administration, and those executive orders are now history. But there is an opportunity, which may or may not be attained, there is an opportunity for the healthcare leadership—for experts in surgery—to say, “This is what we should expect.”  

LEE: Yeah.  

KOHANE: I would love for this to happen. I haven’t seen evidence that it’s happening yet. 

GOLDBERG: No, no. And there’s this other huge issue, which is that it’s changing so fast. It’s moving so fast. That something that makes sense today won’t in six months. So, what do you do about that

LEE: Yeah, yeah, that is something I feel proud of because when I went back and looked at our chapter on this, you know, we did make that point, which I think has turned out to be true.  

But getting back to this conversation, there’s something, a snippet of something, that Vardit Ravitsky said that I think touches on this topic.  

GOLDBERG: Totally agree. Who cares about informed consent about AI. Don’t want it. Don’t need it. Nope. 

LEE: Wow. Yeah. You know, and this … Vardit of course is one of the leading bioethicists, you know, and of course prior to AI, she was really focused on genetics. But now it’s all about AI.  

And, Zak, you know, you and other doctors have always told me, you know, the truth of the matter is, you know, what do you call the bottom-of-the-class graduate of a medical school? 

And the answer is “doctor.” 

KOHANE: “Doctor.” Yeah. Yeah, I think that again, this gets to compared with what? We have to compare AI not to the medicine we imagine we have, or we would like to have, but to the medicine we have today. And if we’re trying to remove inequity, if we’re trying to improve our health, that’s what … those are the right metrics. And so that can be done so long as we avoid catastrophic consequences of AI.  

So what would the catastrophic consequence of AI be? It would be a systematic behavior that we were unaware of that was causing poor healthcare. So, for example, you know, changing the dose on a medication, making it 20% higher than normal so that the rate of complications of that medication went from 1% to 5%. And so we do need some sort of monitoring.  

We haven’t put out the paper yet, but in computer science, there’s, well, in programming, we know very well the value for understanding how our computer systems work.  

And there was a guy by name of Allman, I think he’s still at a company called Sendmail, who created something called syslog. And syslog is basically a log of all the crap that’s happening in our operating system. And so I’ve been arguing now for the creation of MedLog. And MedLog … in other words, what we cannot measure, we cannot regulate, actually. 

LEE: Yes. 

KOHANE: And so what we need to have is MedLog, which says, “Here’s the context in which a decision was made. Here’s the version of the AI, you know, the exact version of the AI. Here was the data.” And we just have MedLog. And I think MedLog is actually incredibly important for being able to measure, to just do what we do in … it’s basically the black box for, you know, when there’s a crash. You know, we’d like to think we could do better than crash. We can say, “Oh, we’re seeing from MedLog that this practice is turning a little weird.” But worst case, patient dies, [we] can see in MedLog, what was the information this thing knew about it? And did it make the right decision? We can actually go for transparency, which like in aviation, is much greater than in most human endeavors.  

GOLDBERG: Sounds great. 

LEE: Yeah, it’s sort of like a black box. I was thinking of the aviation black box kind of idea. You know, you bring up medication errors, and I have one more snippet. This is from our guest Roxana Daneshjou from Stanford.

LEE: Yeah, so this is something we did write about in the book. We made a prediction that AI might be a second set of eyes, I think is the way we put it, catching things. And we actually had examples specifically in medication dose errors. I think for me, I expected to see a lot more of that than we are. 

KOHANE: Yeah, it goes back to our conversation about Epic or competitor Epic doing that. I think we’re going to see that having oversight over all medical orders, all orders in the system, critique, real-time critique, where we’re both aware of alert fatigue. So we don’t want to have too many false positives. At the same time, knowing what are critical errors which could immediately affect lives. I think that is going to become in terms of—and driven by quality measures—a product. 

GOLDBERG: And I think word will spread among the general public that kind of the same way in a lot of countries when someone’s in a hospital, the first thing people ask relatives are, well, who’s with them? Right?  

LEE: Yeah. Yup. 

GOLDBERG: You wouldn’t leave someone in hospital without relatives. Well, you wouldn’t maybe leave your medical …  

KOHANE: By the way, that country is called the United States. 

GOLDBERG: Yes, that’s true. [LAUGHS] It is true here now, too. But similarly, I would tell any loved one that they would be well advised to keep using AI to check on their medical care, right. Why not? 

LEE: Yeah. Yeah. Last topic, just for this Episode 4. Roxana, of course, I think really made a name for herself in the AI era writing, actually just prior to ChatGPT, you know, writing some famous papers about how computer vision systems for dermatology were biased against dark-skinned people. And we did talk some about bias in these AI systems, but I feel like we underplayed it, or we didn’t understand the magnitude of the potential issues. What are your thoughts? 

KOHANE: OK, I want to push back, because I’ve been asked this question several times. And so I have two comments. One is, over 100,000 doctors practicing medicine, I know they have biases. Some of them actually may be all in the same direction, and not good. But I have no way of actually measuring that. With AI, I know exactly how to measure that at scale and affordably. Number one. Number two, same 100,000 doctors. Let’s say I do know what their biases are. How hard is it for me to change that bias? It’s impossible … 

LEE: Yeah, yeah.  

KOHANE: … practically speaking. Can I change the bias in the AI? Somewhat. Maybe some completely. 

I think that we’re in a much better situation. 

GOLDBERG: Agree. 

LEE: I think Roxana made also the super interesting point that there’s bias in the whole system, not just in individuals, but, you know, there’s structural bias, so to speak.  

KOHANE: There is. 

LEE: Yeah. Hmm. There was a super interesting paper that Roxana wrote not too long ago—her and her collaborators—showing AI’s ability to detect, to spot bias decision-making by others. Are we going to see more of that? 

KOHANE: Oh, yeah, I was very pleased when, in NEJM AI [New England Journal of Medicine Artificial Intelligence], we published a piece with Marzyeh Ghassemi (opens in new tab), and what they were talking about was actually—and these are researchers who had published extensively on bias and threats from AI. And they actually, in this article, did the flip side, which is how much better AI can do than human beings in this respect.  

And so I think that as some of these computer scientists enter the world of medicine, they’re becoming more and more aware of human foibles and can see how these systems, which if they only looked at the pretrained state, would have biases. But now, where we know how to fine-tune the de-bias in a variety of ways, they can do a lot better and, in fact, I think are much more … a much greater reason for optimism that we can change some of these noxious biases than in the pre-AI era. 

GOLDBERG: And thinking about Roxana’s dermatological work on how I think there wasn’t sufficient work on skin tone as related to various growths, you know, I think that one thing that we totally missed in the book was the dawn of multimodal uses, right. 

LEE: Yeah. Yeah, yeah. 

GOLDBERG: That’s been truly amazing that in fact all of these visual and other sorts of data can be entered into the models and move them forward. 

LEE: Yeah. Well, maybe on these slightly more optimistic notes, we’re at time. You know, I think ultimately, I feel pretty good still about what we did in our book, although there were a lot of misses. [LAUGHS] I don’t think any of us could really have predicted really the extent of change in the world.  

[TRANSITION MUSIC] 

So, Carey, Zak, just so much fun to do some reminiscing but also some reflection about what we did. 

[THEME MUSIC] 

And to our listeners, as always, thank you for joining us. We have some really great guests lined up for the rest of the series, and they’ll help us explore a variety of relevant topics—from AI drug discovery to what medical students are seeing and doing with AI and more.  

We hope you’ll continue to tune in. And if you want to catch up on any episodes you might have missed, you can find them at aka.ms/AIrevolutionPodcast (opens in new tab) or wherever you listen to your favorite podcasts.   

Until next time.  

[MUSIC FADES]


The post Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers appeared first on Microsoft Research.

Read More

Predicting and explaining AI model performance: A new approach to evaluation

Predicting and explaining AI model performance: A new approach to evaluation

The image shows a radar chart comparing the performance of different AI models across various metrics. The chart has a circular grid with labeled axes including VO, AS, CEc, CEe, CL, MCr, MCt, MCu, MS, QLI, QLqA, SNs, KNa, KNc, KNF, KNn, and AT. Different AI models are represented by various line styles: Babbage-002 (dotted line), Davinci-002 (dash-dotted line), GPT-3.5-Turbo (dashed line), GPT-4.0 (solid thin line), OpenAI ol-mini (solid thick line), and OpenAI o1 (solid bold line). There is a legend in the bottom left corner explaining the line styles for each model. The background transitions from blue on the left to green on the right.

With support from the Accelerating Foundation Models Research (AFMR) grant program, a team of researchers from Microsoft and collaborating institutions has developed an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do.

In the paper, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power,” they introduce a methodology that goes beyond measuring overall accuracy. It assesses the knowledge and cognitive abilities a task requires and evaluates them against the model’s capabilities.

ADeLe: An ability-based approach to task evaluation

The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities. This difficulty rating is based on a detailed rubric, originally developed for human tasks and shown to work reliably when applied by AI models.

By comparing what a task requires with what a model can do, ADeLe generates an ability profile that not only predicts performance but also explains why a model is likely to succeed or fail—linking outcomes to specific strengths or limitations.

The 18 scales reflect core cognitive abilities (e.g., attention, reasoning), knowledge areas (e.g., natural or social sciences), and other task-related factors (e.g., prevalence of the task on the internet). Each task is rated from 0 to 5 based on how much it draws on a given ability. For example, a simple math question might score 1 on formal knowledge, while one requiring advanced expertise could score 5. Figure 1 illustrates how the full process works—from rating task requirements to generating ability profiles.

Figure 1: This diagram presents a framework for explaining and predicting AI performance on new tasks using cognitive demand profiles. The System Process (top) evaluates an AI system on the ADeLe Battery—tasks annotated with DeLeAn rubrics—to create an ability profile with each dimension representing what level of demand the model can reach. The Task Process (bottom) applies the same rubrics to new tasks, generating demand profiles from annotated inputs. An optional assessor model can be trained to robustly predict how well the AI system will perform on these new tasks by matching system abilities to task demands.
Figure 1. Top: For each AI model, (1) run the new system on the ADeLe benchmark, and (2) extract its ability profile. Bottom: For each new task or benchmark, (A) apply 18 rubrics and (B) get demand histograms and profiles that explain what abilities the tasks require. Optionally, predict performance on the new tasks for any system based on the demand and ability profiles, or past performance data, of the systems.

To develop this system, the team analyzed 16,000 examples spanning 63 tasks drawn from 20 AI benchmarks, creating a unified measurement approach that works across a wide range of tasks. The paper details how ratings across 18 general scales explain model success or failure and predict performance on new tasks in both familiar and unfamiliar settings.

Evaluation results 

Using ADeLe, the team evaluated 20 popular AI benchmarks and uncovered three key findings: 1) Current AI benchmarks have measurement limitations; 2) AI models show distinct patterns of strengths and weaknesses across different capabilities; and 3) ADeLe provides accurate predictions of whether AI systems will succeed or fail on a new task. 

1. Revealing hidden flaws in AI testing methods 

Many popular AI tests either don’t measure what they claim or only cover a limited range of difficulty levels. For example, the Civil Service Examination benchmark is meant to test logical reasoning, but it also requires other abilities, like specialized knowledge and metacognition. Similarly, TimeQA, designed to test temporal reasoning, only includes medium-difficulty questions—missing both simple and complex challenges. 

2. Creating detailed AI ability profiles 

Using the 0–5 rating for each ability, the team created comprehensive ability profiles of 15 LLMs. For each of the 18 abilities measured, they plotted “subject characteristic curves” to show how a model’s success rate changes with task difficulty.  

They then calculated a score for each ability—the difficulty level at which a model has a 50% chance of success—and used these results to generate radial plots showing each model’s strengths and weaknesses across the different scales and levels, illustrated in Figure 2.

Figure 2: The image consists of three radar charts showing ability profiles of 15 LLMs evaluated across 18 ability scales, ranged from 0 to infinity (the higher, the more capable the model is). Each chart has multiple axes labeled with various ability scales such as VO, AS, CEc, AT, CL, MCr, etc. The left chart shows ability for Babbage-002 (light red), Davinci-002 (orange), GPT-3.5-Turbo (red), GPT-4 (dark red), OpenAI o1-mini (gray), and OpenAI o1 (dark gray). The middle chart shows ability for LLaMA models: LLaMA-3.2-1B-Instruct (light blue), LLaMA-3.2-3B-Instruct (blue), LLaMA-3.2-11B-Instruct (dark blue), LLaMA-3.2-90B-Instruct (navy blue), and LLaMA-3.1-405B Instruct (very dark blue). The right chart shows ability for DeepSeek-R1-Dist-Qwen models: DeepSeek-R1-Dist-Qwen-1.5B (light green), DeepSeek-R1-Dist-Qwen-7B (green), DeepSeek-R1-Dist-Qwen-14B (dark green), DK-R1-Dist-Qwen-32B (very dark green). Each model's ability is represented by a colored polygon within the radar charts.
Figure 2. Ability profiles for the 15 LLMs evaluated.

This analysis revealed the following: 

  • When measured against human performance, AI systems show different strengths and weaknesses across the 18 ability scales. 
  • Newer LLMs generally outperform older ones, though not consistently across all abilities. 
  • Knowledge-related performance depends heavily on model size and training methods. 
  • Reasoning models show clear gains over non-reasoning models in logical thinking, learning and abstraction, and social capabilities, such as inferring the mental states of their users. 
  • Increasing the size of general-purpose models after a given threshold only leads to small performance gains. 

3. Predicting AI success and failure 

In addition to evaluation, the team created a practical prediction system based on demand-level measurements that forecasts whether a model will succeed on specific tasks, even unfamiliar ones.  

The system achieved approximately 88% accuracy in predicting the performance of popular models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to anticipate potential failures before deployment, adding the important step of reliability assessment for AI models.

Looking ahead

ADeLe can be extended to multimodal and embodied AI systems, and it has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.

This technology marks a major step toward a science of AI evaluation, one that offers both clear explanations of system behavior and reliable predictions about performance. It aligns with the vision laid out in a previous Microsoft position paper on the promise of applying psychometrics to AI evaluation and a recent Societal AI white paper emphasizing the importance of AI evaluation.

As general-purpose AI advances faster than traditional evaluation methods, this work lays a timely foundation for making AI assessments more rigorous, transparent, and ready for real-world deployment. The research team is working toward building a collaborative community to strengthen and expand this emerging field.

The post Predicting and explaining AI model performance: A new approach to evaluation appeared first on Microsoft Research.

Read More

Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv

Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv

Illustrated headshots of Hongxia Hao (left) and Bing Lv (right).

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts bring its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, senior researcher Hongxia Hao (opens in new tab), and physics professor Bing Lv (opens in new tab), join host Gretchen Huizinga to talk about how they are using deep learning techniques to probe the upper limits of heat transfer in inorganic crystals, discover novel materials with exceptional thermal conductivity, and rewrite the rulebook for designing high-efficiency electronics and sustainable energy.

Transcript

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers.

Today I’m talking to two researchers, Hongxia Hao, a senior researcher at Microsoft Research AI for Science, and Bing Lv, an associate professor in physics at the University of Texas at Dallas. Hongxia and Bing are co-authors of a paper called Probing the Limit of Heat Transfer in Inorganic Crystals with Deep Learning. I’m excited to learn more about this! Hongxia and Bing, it’s great to have you both on Abstracts!


HONGXIA HAO: Nice to be here.

BING LV: Nice to be here, too.

HUIZINGA: So Hongxia, let’s start with you and a brief overview of this paper. In just a few sentences. Tell us about the problem your research addresses and more importantly, why we should care about it.

HAO: Let me start with a very simple yet profound question. What’s the fastest the heat can travel through a solid material? This is not just an academic curiosity, but it’s a question that touched the bottom of how we build technologies around us. So from the moment when you tap your smartphone, and the moment where the laptop is turned on and functioning, heat is always flowing. So we’re trying to answer the question of a century-old mystery of the upper limit of heat transfer in solids. So we care about this not just because it’s a fundamental problem in physics and material science, but because solving it could really rewrite the rulebook for designing high-efficiency electronics and sustainable energy, etc. And nowadays, with very cutting-edge nanometer chips or very fancy technologies, we are packing more computing power into smaller space, but the faster and denser we build, the harder it becomes to remove the heat. So in many ways, thermal bottlenecks, not just transistor density, are now the ceiling of the Moore’s Law. And also the stakes are very enormous. We really wish to bring more thermal solutions by finding more high thermal conductor choices from the perspective of materials discovery with the help of AI.

LV: So I think one of the biggest things as Hongxia said, right? Thermal solutions will become, eventually become, a bottleneck for all type of heterogeneous integration of the materials. So from this perspective, so how people actually have been finding out previously, all the thermal was the last solution to solve. But now people actually more and more realize all these things have to be upfront. This co-design, all these things become very important. So I think what we are doing right now, integrated with AI, helping to identify the large space of the materials, identify fundamentally what will be the limit of this material, will become very important for the society.

HUIZINGA: Hmm. Yeah. Hongxia, did you have anything to add to that?

HAO: Yes, so previously many people are working on exploring these material science questions through experimental tradition and the past few decades people see a new trend using computational materials discovery. Like for example, we do the fundamental solving of the Schrödinger equation using Density Functional Theory [DFT]. Actually, this brings us a lot of opportunities. The question here is, as the theory is getting more and more developed, it’s too expensive for us to make it very large scale and to study tons of materials. Think about this. The bottleneck here, now, is not just about having a very good theory, it’s about the scale. So, this is where AI, specifically now we are using deep learning, comes into play.

HUIZINGA: Well, Hongxia, let’s stay with you for a minute and talk about methodology. How did you do this research and what was the methodology you employed?

HAO: So here we, for this question, we built a pipeline that spans the AI, the quantum mechanics, and computational brute-force with a blend of efficiency and accuracy. It begins with generating an enormous chemical and structure design space because this is inspired by Slack’s principle. We focus first on simple crystals, and there are the systems most likely to have low and harmonious state, fewer phononic scattering events, and therefore potentially have high thermal conductivities. But we didn’t stop here. We also included a huge pool of more complex and higher energy structures to ensure diversity and avoid bias. And for each candidate, we first run like a structure relaxation using MatterSim, which is a deep learning foundational model for material science for us to characterize the properties of materials. And we use that screen for dynamic stability. And now it’s about 200K structures past this filter. And then came another real challenge: calculating the thermal conductivity. We try to solve this problem using the Boltzmann transport equation and the three-phonon scattering process. The twist here is all of this was not done by traditional DFT solvers, but with our deep learning model, the MatterSim. It’s trained to predict energy, force, and stress. And we can get second- and third-order interatomic force constants directly from here, which can guarantee the accuracy of the solution. And finally, to validate the model’s predictions, we performed full DFT-based calculations on the top candidates that we found, some of which even include higher-order scattering mechanism, electron phonon coupling effect, etc. And this rigorous validation gave us confidence in the speed and accuracy trade-offs and revealed a spectrum of materials that had either previously been overlooked or were never before conceived.

HUIZINGA: So Bing, let’s talk about your research findings. How did things work out for you on this project and what did you find?

LV: I think one of the biggest things for this paper is it creates a very large material base. Basically, you can say it’s a smart database which eventually will be made accessible to the public. I think that’s a big achievement because people who actually if they have to look into it, they actually can go search Microsoft database, finding out, oh, this material does have this type of thermal properties. This is actually, this database can send about 230,000 materials. And one of the things we confirm is the highest thermal conductivity material based on all the wisdom of Slack criteria, predicted diamond would have the highest thermal conductivity. We more or less really very solidly prove diamond, at this stage, will remain with the highest thermal conductivity. We have a lot of new materials, exotic materials, which some of them, Hongxia can elaborate a little bit more. So, which having all this very exotic combination of properties, thermal with other properties, which could actually provide a new insight for new physics development, new material development, and a new device perspective. All of this combined will have actually a very profound impact to society.

HUIZINGA: Yeah, Hongxia, go a little deeper on that because that was an interesting part of the paper when you talked about diamond still being the sort of “gold standard,” to mix metaphors! But you’ve also found some other materials that are remarkable compared to silicon.

HAO: Yeah, yeah. Among this search space, even though we didn’t find like something that’s higher than diamonds, but we do discover more than like twenty new materials with thermal conductivity exceeding that of silicon. And silicon is something like a benchmark for criteria that we think we want to compare with because it’s a backbone of modern electronics. More interestingly, I think, is the manganese vanadium. It shows some very interesting and surprising phenomena. Like it’s a metallic compound, but with very high lattice thermal connectivity. And this is the first time discovered by, like, through our search pattern, and it’s something that cannot be easily discovered without the hope with AI. And right now, think Bing can explain more on this, and show some interesting results.

HUIZINGA: Yeah, go ahead Bing.

LV: So this is actually very surprising to me as an experimentalist because of when Hongxia presented their theory work to me, this material, magnesium vanadium, it’s discovered back in 1938, almost 100 years ago, but there’s no more than twenty papers talking about this! A lot of them was on theory, okay, not even on experimental part. We actually did quite a bit of work on this. We actually are in the process; will characterize this and then moving forward even for the thermal conductivity measurements. So that will be hopefully, will be adding to the value of these things, showing you, Hey, AI does help to predict the materials could really generate the new materials with very good high thermal conductivity.

HUIZINGA: Yeah, so Bing, stay with you for a minute. I want you to talk about some kind of real-world applications of this. I know you alluded to a couple of things, but how is this work significant in that respect, and who might be most excited about it, aside from the two of you? [LAUGHS]

LV: So I think as I mentioned before, the first thing is this database. I believe that’s the first ever large material database regarding to the thermal conductivity. And it has, as I said, 230,000 materials with AI-predicted thermal connectivity. This will provide not only science but engineering with a vastly expanding catalog of candidate materials for the future roadmap of integration, material integration, and all these bottlenecks we are talking about, the thermal solution for the semiconductors or for even beyond the semiconductor integration, people actually can have a database to looking for. So these things, it will become very important, and I believe over a long time it will generate a very long impact for the research community, for the society development.

HUIZINGA: Yeah. Hongxia, did you have anything to add to that one too?

HAO: Yeah, so this study reshapes how we think about limits. I like the sentence that the only way to discover the limits of possible is to go beyond them into the impossible. In this case, we tried, but we didn’t break the diamond limit. But we proved it even more rigorously than ever before. In doing so, we also uncovered some uncharted peaks in the thermal conductivity landscape. This would not happen without new AI capabilities for material science. I think in the long run, I believe researchers could benefit from using this AI design and shift their way on how to do materials research with AI.

HUIZINGA: Yeah, it’ll be interesting to see if anyone ever does break the diamond limit with the new tools that are available, but…

HAO: Yeah!

HUIZINGA: So this is the part of the abstracts podcast where I like to ask for sort of a golden nugget, a one sentence takeaway that listeners might get from this paper. If you had one Hongxia, what would it be? And then I’ll ask Bing to maybe give his.

HAO: Yes. AI is no longer just a tool. It’s becoming a critical partner for us in scientific discovery. So our work proved that the large-scale data-driven science can now approach long-standing and fundamental questions with very fresh eyes. When trained well, and guided with physical intuition, models like MatterSim can really realize a full in-silico characterization for materials and don’t just simulate some known materials, but really trying to imagine what nature hasn’t yet revealed. Our work points to a path forward, not just incrementally better materials, but entirely new class of high-performance compounds where we could never have guessed without AI.

HUIZINGA: Yeah. Bing, what’s your one takeaway?

LV: I think I want to add a few things on top of Hongxia’s comments because I think Hongxia has very good critical words I would like to emphasize. When we train the AI well, if we guide the AI well, it could be very useful to become our partner. So I think all in all, our human being’s intellectual merit here is still going to play a significantly important role, okay? We are generating this AI, we should really train the AI, we should be using our human being intellectual merit to guide them to be useful for our human being society advancement. Now with all these AI tools, I think it’s a very golden time right now. Experimentalists could work very closely with like Hongxia, who’s a good theorist who has very good intellectual merits, and then we actually now incorporate with AI, then combine all pieces together, hopefully we’re really able to accelerating material discovery in a much faster pace than ever which the whole society will eventually get a benefit from it.

HUIZINGA: Yeah. Well, as we close, Bing, I want you to go a little further and talk about what’s next then, research wise. What are the open questions or outstanding challenges that remain in this field and what’s on your research agenda to address them?

LV: So first of all, I think this paper is addressing primarily on these crystalline ordered inorganic bulk materials. And also with the condition we are targeting at ambient pressure, room temperature, because that’s normally how the instrument is working, right? But what if under extreme conditions? We want to go to space, right? There we’ll have extreme conditions, some very… sometimes very cold, sometimes very hot. We have some places with extremely probably quite high pressure. Or we have some conditions that are highly radioactive. So under that condition, there’s going to be a new database could be emerged. Can we do something beyond that? Another good important thing is we are targeting this paper on high thermal conductivity. What about extremely low thermal conductivity? Those will actually bring a very good challenge for theorists and also the machine learning approach. I think that’s something Hongxia probably is very excited to work on in that direction. I know since she’s ambitious, she wants to do something more than beyond what we actually achieved so far.

HUIZINGA: Yeah, so Hongxia, how would you encapsulate what your dream research is next?

HAO: Yeah, so I think besides all of these exciting research directions, on my end, another direction is perhaps kind of exciting is we want to move from search to design. So right now we are kind of good at asking like what exists by just doing a forward prediction and brute force. But with generative AI, we can start asking what should exist? In the future, we can have an incorporation between forward prediction and backwards generative design to really tackle questions. If you have materials like you want to have desired like properties, how would you design the problems?

HUIZINGA: Well, it sounds like there’s a full plate of research agenda goodness going forward in this field, both with human brains and AI. So, Hongxia Hao and Bing Lv, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read a pre-print of it on arXiv. See you next time on Abstracts!

The post Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv appeared first on Microsoft Research.

Read More

Research Focus: Week of May 7, 2025

Research Focus: Week of May 7, 2025

In this issue:

New research on compound AI systems and causal verification of the Confidential Consortium Framework; release of Phi-4-reasoning; enriching tabular data with semantic structure, and more.

Research Focus: May 07, 2025

Towards Resource-Efficient Compound AI Systems

Unlike the current state-of-the-art, our vision is fungible workflows with high-level descriptions, managed jointly by the Workflow Orchestrator and Cluster Manager. This allows higher resource multiplexing between independent workflows to improve efficiency.

This research introduces Murakkab, a prototype system built on a declarative workflow that reimagines how compound AI systems are built and managed to significantly improve resource efficiency. Compound AI systems integrate multiple interacting components like language models, retrieval engines, and external tools. They are essential for addressing complex AI tasks. However, current implementations could benefit from greater efficiencies in resource utilization, with improvements to tight coupling between application logic and execution details, better connections between orchestration and resource management layers, and bridging gaps between efficiency and quality.

Murakkab addresses critical inefficiencies in current AI architectures and offers a new approach that unifies workflow orchestration and cluster resource management for better performance and sustainability. In preliminary evaluations, it demonstrates speedups up to ∼ 3.4× in workflow completion times while delivering ∼ 4.5× higher energy efficiency, showing promise in optimizing resources and advancing AI system design.


Smart Casual Verification of the Confidential Consortium Framework

Diagram showing the components of the verification architecture for CCF's consensus. The diagram is discussed in detail in the paper.

This work presents a new, pragmatic verification technique that improves the trustworthiness of distributed systems like the Confidential Consortium Framework (CCF) and proves its effectiveness by catching critical bugs before deployment. Smart casual verification is a novel hybrid verification approach to validating CCF, an open-source platform for developing trustworthy and reliable cloud applications which underpins Microsoft’s Azure Confidential Ledger service. 

The researchers apply smart casual verification to validate the correctness of CCF’s novel distributed protocols, focusing on its unique distributed consensus protocol and its custom client consistency model. This hybrid approach combines the rigor of formal specification and model checking with the pragmatism of automated testing, specifically binding the formal specification in TLA+ to the C++ implementation. While traditional formal methods are often one-off efforts by domain experts, the researchers have integrated smart casual verification into CCF’s continuous integration pipeline, allowing contributors to continuously validate CCF as it evolves. 


Phi-4-reasoning Technical Report

graphical user interface, text, application, email

This report introduces Phi-4-reasoning (opens in new tab), a 14-billion parameter model optimized for complex reasoning tasks. It is trained via supervised fine-tuning of Phi-4 using a carefully curated dataset of high-quality prompts and reasoning demonstrations generated by o3-mini. These prompts span diverse domains—including math, science, coding, and spatial reasoning—and are selected to challenge the base model near its capability boundaries.

Building on recent findings that reinforcement learning (RL) can further improve smaller models, the team developed Phi-4-reasoning-plus, which incorporates an additional outcome-based RL phase using verifiable math problems. This enhances the model’s ability to generate longer, more effective reasoning chains. 

Despite its smaller size, the Phi-4-reasoning family outperforms significantly larger open-weight models such as DeepSeekR1-Distill-Llama-70B and approaches the performance of full-scale frontier models like DeepSeek R1. It excels in tasks requiring multi-step problem solving, logical inference, and goal-directed planning.

The work highlights the combined value of supervised fine-tuning and reinforcement learning for building efficient, high-performing reasoning models. It also offers insights into training data design, methodology, and evaluation strategies. Phi-4-reasoning contributes to the growing class of reasoning-specialized language models and points toward more accessible, scalable AI for science, education, and technical domains.


TeCoFeS: Text Column Featurization using Semantic Analysis

The workflow diagram illustrates the various steps in the TECOFES approach. Step 0 is the embedding computation module, which calculates embeddings for all rows of text, setting the foundation for subsequent steps. Step 2, the smart sampler, captures diverse samples and feeds them into the labeling module (step 3), which generates labels. These labels are then utilized by the Extend Mapping module (step 4) to map the remaining unlabeled data.

This research introduces a practical, cost-effective solution for enriching tabular data with semantic structure, making it more useful for downstream analysis and insights—which is especially valuable in business intelligence, data cleaning, and automated analytics workflows. This approach outperforms baseline models and naive LLM applications on converted text classification benchmarks.

Extracting structured insights from free-text columns in tables—such as product reviews or user feedback—can be time-consuming and error-prone, especially when relying on traditional syntactic methods that often miss semantic meaning. This research introduces the semantic text column featurization problem, which aims to assign meaningful, context-aware labels to each entry in a text column.

The authors propose a scalable, efficient method that combines the power of LLMs with text embeddings. Instead of labeling an entire column manually or applying LLMs to every cell—an expensive process—this new method intelligently samples a diverse subset of entries, uses an LLM to generate semantic labels for just that subset, and then propagates those labels to the rest of the column using embedding similarity.


Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

diagram

This work introduces ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a new paradigm for LLM reasoning that expands beyond traditional language-only inference. 

While LLMs have made considerable strides in complex reasoning tasks, they remain limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this research, ARTIST brings together agentic reasoning, reinforcement learning (RL), and tool integration, designed to enable LLMs to autonomously decide when and how to invoke internal tools within multi-turn reasoning chains. ARTIST leverages outcome-based reinforcement learning to learn robust strategies for tool use and environment interaction without requiring step-level supervision.

Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies show that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions.


Materialism Podcast: MatterGen (opens in new tab)

What if you could find materials with tailored properties without ever entering the lab? The Materialism Podcast, which is dedicated to exploring materials science and engineering, talks with Tian Xie from Microsoft Research to discuss MatterGen, an AI tool which accelerates materials science discovery. Tune in to hear a discussion of the new Azure AI Foundry, where MatterGen will interact with and support MatterSim, an advanced deep learning model designed to simulate the properties of materials across a wide range of elements, temperatures, and pressures.


IN THE NEWS: Highlights of recent media coverage of Microsoft Research


When ChatGPT Broke an Entire Field: An Oral History 

Quanta Magazine | April 30, 2025

Large language models are everywhere, igniting discovery, disruption and debate in whatever scientific community they touch. But the one they touched first — for better, worse and everything in between — was natural language processing. What did that impact feel like to the people experiencing it firsthand?

To tell that story, Quanta interviewed 19 NLP experts, including Kalika Bali, senior principal researcher at Microsoft Research. From researchers to students, tenured academics to startup founders, they describe a series of moments — dawning realizations, elated encounters and at least one “existential crisis” — that changed their world. And ours.

The post Research Focus: Week of May 7, 2025 appeared first on Microsoft Research.

Read More

Microsoft Fusion Summit explores how AI can accelerate fusion research

Microsoft Fusion Summit explores how AI can accelerate fusion research

Sir Steven Cowley, professor and director of the Princeton Plasma Physics Laboratory and former head of the UK Atomic Energy Authority, giving a presentation.

The pursuit of nuclear fusion as a limitless, clean energy source has long been one of humanity’s most ambitious scientific goals. Research labs and companies worldwide are working to replicate the fusion process that occurs at the sun’s core, where isotopes of hydrogen combine to form helium, releasing vast amounts of energy. While scalable fusion energy is still years away, researchers are now exploring how AI can help accelerate fusion research and bring this energy to the grid sooner. 

In March 2025, Microsoft Research held its inaugural Fusion Summit, a landmark event that brought together distinguished speakers and panelists from within and outside Microsoft Research to explore this question. 

Ashley Llorens, Corporate Vice President and Managing Director of Microsoft Research Accelerator, opened the Summit by outlining his vision for a self-reinforcing system that uses AI to drive sustainability. Steven Cowley, laboratory director of the U.S. Department of Energy’s Princeton Plasma Physics Laboratory (opens in new tab), professor at Princeton University, and former head of the UK Atomic Energy Authority, followed with a keynote explaining the intricate science and engineering behind fusion reactors. His message was clear: advancing fusion will require international collaboration and the combined power of AI and high-performance computing to model potential fusion reactor designs. 

Applying AI to fusion research

North America’s largest fusion facility, DIII (opens in new tab)-D, operated by General Atomics and owned by the US Department of Energy (DOE), provides a unique platform for developing and testing AI applications for fusion research, thanks to its pioneering data and digital twin platform. 

Richard Buttery (opens in new tab) from DIII-D and Dave Humphreys (opens in new tab) from General Atomics demonstrated how the US DIII-D National Fusion Program (opens in new tab) is already applying AI to advance reactor design and operations, highlighting promising directions for future development. They provided examples of how to apply AI to active plasma control to avoid disruptive instabilities, using AI-controlled trajectories to avoid tearing modes, and implementing feedback control using machine learning-derived density limits for safer high-density operations. 

One persistent challenge in reactor design involves building the interior “first wall,” which must withstand extreme heat and particle bombardment. Zulfi Alam, corporate vice president of Microsoft Quantum (opens in new tab), discussed the potential of using quantum computing in fusion, particularly for addressing material challenges like hydrogen diffusion in reactors.

He noted that silicon nitride shows promise as a barrier to hydrogen and vapor and explained the challenge of binding it to the reaction chamber. He emphasized the potential of quantum computing to improve material prediction and synthesis, enabling more efficient processes. He shared that his team is also investigating advanced silicon nitride materials to protect this critical component from neutron and alpha particle damage—an innovation that could make fusion commercially viable.

Microsoft research blog

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

PromptWizard from Microsoft Research is now open source. It is designed to automate and simplify AI prompt optimization, combining iterative LLM feedback with efficient exploration and refinement techniques to create highly effective prompts in minutes.


Exploring AI’s broader impact on fusion engineering

Lightning talks from Microsoft Research labs addressed the central question of AI’s potential to accelerate fusion research and engineering. Speakers covered a wide range of applications—from using gaming AI for plasma control and robotics for remote maintenance to physics-informed AI for simulating materials and plasma behavior. Closing the session, Archie Manoharan, Microsoft’s director of nuclear engineering for Cloud Operations and Infrastructure, emphasized the need for a comprehensive energy strategy, one that incorporates renewables, efficiency improvements, storage solutions, and carbon-free sources like fusion.

The Summit culminated in a thought-provoking panel discussion moderated by Ade Famoti, featuring Archie Manoharan, Richard Buttery, Steven Cowley, and Chris Bishop, Microsoft Technical Fellow and director of Microsoft Research AI for Science. Their wide-ranging conversation explored the key challenges and opportunities shaping the field of fusion. 

The panel highlighted several themes: the role of new regulatory frameworks that balance innovation with safety and public trust; the importance of materials discovery in developing durable fusion reactor walls; and the game-changing role AI could play in plasma optimization and surrogate modelling of fusion’s underlying physics.

They also examined the importance of global research collaboration, citing projects like the International Thermonuclear Experimental Reactor (opens in new tab) (ITER), the world’s largest experimental fusion device under construction in southern France, as testbeds for shared progress. One persistent challenge, however, is data scarcity. This prompted a discussion of using physics-informed neural networks as a potential approach to supplement limited experimental data. 

Global collaboration and next steps

Microsoft is collaborating with ITER (opens in new tab) to help advance the technologies and infrastructure needed to achieve fusion ignition—the critical point where a self-sustaining fusion reaction begins, using Microsoft 365 Copilot, Azure OpenAI Service, Visual Studio, and GitHub (opens in new tab). Microsoft Research is now cooperating with ITER to identify where AI can be exploited to model future experiments to optimize its design and operations. 

Now Microsoft Research has signed a Memorandum of Understanding with the Princeton Plasma Physics Laboratory (PPPL) (opens in new tab) to foster collaboration through knowledge exchange, workshops, and joint research projects. This effort aims to address key challenges in fusion, materials, plasma control, digital twins, and experiment optimization. Together, Microsoft Research and PPPL will work to drive innovation and advances in these critical areas.

Fusion is a scientific challenge unlike any other and could be key to sustainable energy in the future. We’re excited about the role AI can play in helping make that vision a reality. To learn more, visit the Fusion Summit event page, or connect with us by email at FusionResearch@microsoft.com.

The post Microsoft Fusion Summit explores how AI can accelerate fusion research appeared first on Microsoft Research.

Read More

Abstracts: Societal AI with Xing Xie

Abstracts: Societal AI with Xing Xie

Xing Xie illustrated headshot

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts bring its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Partner Research Manager Xing Xie joins host Gretchen Huizinga to talk about his work on a white paper called Societal AI: Research Challenges and Opportunities. Part of a larger effort to understand the cultural impact of AI systems, this white paper is a result of a series of global conversations and collaborations on how AI systems interact with and influence human societies. 


Learn more:

Societal AI: Building human-centered AI systems
Microsoft Research Blog, May 2024

Transcript

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers. 

I’m here today with Xing Xie, a partner research manager at Microsoft Research and co-author of a white paper called Societal AI: Research Challenges and Opportunities. This white paper is a result of a series of global conversations and collaborations on how AI systems interact with and impact human societies. Xing Xie, great to have you back on the podcast. Welcome to Abstracts! 


XING XIE: Thank you for having me. 

HUIZINGA: So let’s start with a brief overview of the background for this white paper on Societal AI. In just a few sentences, tell us how the idea came about and what key principles drove the work. 

XIE: The idea for this white paper emerged in response to the shift we are witnessing in the AI landscape. Particularly since the release of ChatGPT in late 2022, these models didn’t just change the pace of AI research, they began reshaping our society, education, economy, and yeah, even the way we understand ourselves. At Microsoft Research Asia, we felt a strong urgency to better understand these changes. Over the past 30 months, we have been actively exploring this frontier in partnership with experts from psychology, sociology, law, and philosophy. This white paper serves three main purposes. First, to document what we have learned. Second, to guide future research directions. And last, to open up an effective communication channel with collaborators across different disciplines. 

HUIZINGA: Research on responsible AI is a relatively new discipline and it’s profoundly multidisciplinary. So tell us about the work that you drew on as you convened this series of workshops and summer schools, research collaborations and interdisciplinary dialogues. What kinds of people did you bring to the table and for what reason? 

XIE: Yeah. Responsible AI actually has been evolving within Microsoft for like about a decade. But with the rise of large language models, the scope and urgency of these challenges have grown exponentially. That’s why we have leaned heavily on interdisciplinary collaboration. For instance, in the Value Compass Project, we worked with philosophers to frame human values in a scientifically actionable way, something essential for aligning AI behavior. In our AI evaluation efforts, we drew from psychometrics to create more principled ways of assessing these systems. And with the sociologists, we have examined how AI affects education and social systems. This joint effort has been central to the work we share in this white paper. 

HUIZINGA: So white papers differ from typical research papers in that they don’t rely on a particular research methodology per se, but you did set, as a backdrop for your work, ten questions for consideration. So how did you decide on these questions and how or by what means did you attempt to answer them? 

XIE: Rather than follow a traditional research methodology, we built this white paper around ten fundamental, foundational research questions. These came from extensive dialogue, not only with social scientists, but also computer scientists working at the technical front of AI. These questions span both directions. First, how AI impacts society, and second, how social science can help solve technical challenges like alignment and safety. They reflect a dynamic agenda that we hope to evolve continuously through real-world engagement and deeper collaboration. 

HUIZINGA: Can you elaborate on… a little bit more on the questions that you chose to investigate as a group or groups in this? 

XIE: Sure, I think I can use the Value Compass Project as one example. In that project, our main goal is to try to study how we can better align the value of AI models with our human values. Here, one fundamental question is how we define our own human values. There actually is a lot of debate and discussions on this. Fortunately, we see in philosophy and sociology actually they have studied this for years, like, for like hundreds of years. They have defined some, like, such as basic human value framework, they have defined like modern foundation theory. We can borrow those expertise. Actually, we have worked with sociology and philosophers, try to borrow these expertise and define a framework that could be usable for AI. Actually, we have worked on, like, developing some initial frameworks and evaluation methods for this. 

HUIZINGA: So one thing that you just said was to frame philosophical issues in a scientifically actionable way. How hard was that? 

XIE: Yeah, it is actually not easy. I think that first of all, social scientists and AI researchers, we… usually we speak different languages. 

HUIZINGA: Right! 

XIE: Our research is at a very different pace. So at the very beginning, I think we should find out what’s the best way to talk to each other. So we have workshops, have joint research projects, we have them visit us, and also, we have supervised some joint interns. So that’s all the ways we try to find some common ground to work together. More specifically for this value framework, we have tried to understand what’s the latest program from their source and also try how to adapt them to an AI context. So that’s, I mean, it’s not easy, but it’s like enjoyable and exciting journey! 

HUIZINGA: Yeah, yeah, yeah. And I want to push in on one other question that I thought was really interesting, which you asked, which was how can we ensure AI systems are safe, reliable, controllable, especially as they become more autonomous? I think this is a big question for a lot of people. What kind of framework did you use to look at that? 

XIE: Yeah, there are many different aspects. I think alignment definitely is an aspect. That means how we can make sure we can have a way to truly and deeply embed our values into the AI model. Even after we define our value, we still need a way to make sure that it’s actually embedded in. And also evaluation I think is another topic. Even we have this AI…. looks safe and looks behavior good, but how we can evaluate that, how we can make sure it is actually doing the right thing. So we also have some collaboration with psychometrics people to define a more scientific evaluation framework for this purpose as well. 

HUIZINGA: Yeah, I remember talking to you about your psychometrics in the previous podcast… 

XIE: Yeah! 

HUIZINGA: …you were on and that was fascinating to me. And I hope… at some point I would love to have a bigger conversation on where you are now with that because I know it’s an evolving field. 

XIE: It’s evolving! 

HUIZINGA: Yeah, amazing! Well, let’s get back to this paper. White papers aren’t designed to produce traditional research findings, as it were, but there are still many important outcomes. So what would you say the most important takeaways or contributions of this paper are? 

XIE: Yeah, the key takeaway, I believe, is AI is no longer just a technical tool. It’s becoming a social actor. 

HUIZINGA: Mmm. 

XIE: So it must be studied as a dynamic evolving system that intersects with human values, cognition, culture, and governance. So we argue that interdisciplinary collaboration is no longer optional. It’s essential. Social sciences offer tools to understand the complexity, bias, and trust, concepts that are critical for AI’s safe and equitable deployment. So the synergy between technical and social perspectives is what will help us move from reactive fixes to proactive design. 

HUIZINGA: Let’s talk a little bit about the impact that a paper like this can have. And it’s more of a thought leadership piece, but who would you say will benefit most from the work that you’ve done in this white paper and why? 

XIE: We hope this work speaks to both AI and social science communities. For AI researchers, this white paper provides frameworks and real-world examples, like value evaluation systems and cross-cultural model training that can inspire new directions. And for social scientists, it opens doors to new tools and collaborative methods for studying human behavior, cognition, and institutions. And beyond academia, we believe policymakers and industry leaders can also benefit as the paper outlines practical governance questions and highlights emerging risks that demand timely attention. 

HUIZINGA: Finally, Xing, what would you say the outstanding challenges are for Societal AI, as you framed it, and how does this paper lay a foundation for future research agendas? Specifically, what kinds of research agendas might you see coming out of this foundational paper? 

XIE: We believe this white paper is not a conclusion, it’s a starting point. While the ten research questions are a strong foundation, they also expose deeper challenges. For example, how do we build a truly interdisciplinary field? How can we reconcile the different timelines, methods, and cultures of AI and social science? And how do we nurture talents who can work fluently across those both domains? We hope this white paper encourages others to take on these questions with us. Whether you are researcher, student, policymaker, or technologist, there is a role for you in shaping AI that not only works but works for society. So yeah, I look forward to the conversation with everyone. 

HUIZINGA: Well, Xing Xie, it’s always fun to talk to you. Thanks for joining us today and to our listeners, thanks for tuning in. If you want to read this white paper, and I highly recommend that you do, you can find a link at aka.ms/Abstracts, or you can find a link in our show notes that will take you to the Microsoft Research website. See you next time on Abstracts!

 

The post Abstracts: Societal AI with Xing Xie appeared first on Microsoft Research.

Read More