Collaborators: Healthcare Innovation to Impact

Collaborators: Healthcare Innovation to Impact

Microsoft Research Podcast | Collaborators: Healthcare Innovation to Impact | outline illustrations of Jonathan Carlson, Smitha Saligrama, Will Guyman, Cameron Runde, Dr. Matthew Lungren

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with. 

Amid the ongoing surge of AI research, healthcare is emerging as a leading area for real-world transformation. From driving efficiency gains for clinicians to improving patient outcomes, AI is beginning to make a tangible impact. Thousands of scientific papers have explored AI systems capable of analyzing medical documents and images with unprecedented accuracy. The latest work goes even further, showing how healthcare agents can collaborate—with each other and with human doctors—embedding AI directly into clinical workflows. 

In this discussion, we explore how teams across Microsoft are working together to generate advanced AI capabilities and solutions for developers and clinicians around the globe. Leading the conversation are Dr. Matthew Lungren, chief scientific officer for Microsoft Health and Life Sciences, and Jonathan Carlson, vice president and managing director of Microsoft Health Futures—two key leaders behind this collaboration. They’re joined by Smitha Saligrama, principal group engineering manager within Microsoft Health and Life Sciences, Will Guyman, group product manager within Microsoft Health and Life Sciences, and Cameron Runde, a senior strategy manager for Microsoft Research Health Futures—all of whom play crucial roles in turning AI breakthroughs into practical, life-saving innovations. 

Together, these experts examine how Microsoft is helping integrate cutting-edge AI into healthcare workflows—saving time today, and lives tomorrow.


Learn more

Developing next-generation cancer care management with multi-agent orchestration
Source Blog, May 2025

Healthcare Agent Orchestrator (opens in new tab)
GitHub

Azure AI Foundry Labs (opens in new tab)
Homepage

Transcript

[MUSIC] 

MATTHEW LUNGREN: You’re listening to Collaborators, a Microsoft Research podcast, showcasing the range of expertise that goes into transforming mind blowing ideas into  world changing technologies. Despite the advancements in AI over the decades, generative AI exploded into view in 2022, when ChatGPT became the, sort of, internet browser for AI and became the fastest adopted consumer software application in history.


JONATHAN CARLSON: From the beginning, healthcare stood out to us as an important opportunity for general reasoners to improve the lives and experiences of patients and providers. Indeed, in the past two years, there’s been an explosion of scientific papers looking at the application first of text reasoners and medicine, then multi-modal reasoners that can interpret medical images, and now, most recently, healthcare agents that can reason with each other. But even more impressive than the pace of research has been the surprisingly rapid diffusion of this technology into real world clinical workflows. 

LUNGREN: So today, we’ll talk about how our cross-company collaboration has shortened that gap and delivered advanced AI capabilities and solutions into the hands of developers and clinicians around the world, empowering everyone in health and life sciences to achieve more. I’m Doctor Matt Lungren, chief scientific officer for Microsoft Health and Life Sciences. 

CARLSON: And I’m Jonathan Carlson, vice president and managing director of Microsoft Health Futures. 

LUNGREN: And together we brought some key players leading in the space of AI and health care from across Microsoft. Our guests today are Smitha Saligrama, principal group engineering manager within Microsoft Health and Life Sciences, Will Guyman, group product manager within Microsoft Health and Life Sciences, and Cameron Runde, a senior strategy manager for Microsoft Health Futures. 

CARLSON: We’ve asked these brilliant folks to join us because each of them represents a mission critical group of cutting-edge stakeholders, scaling breakthroughs into purpose-built solutions and capabilities for health care. 

LUNGREN: We’ll hear today how generative AI capabilities can unlock reasoning across every data type in medicine: text, images, waveforms, genomics. And further, how multi-agent frameworks in healthcare can accelerate complex workflows, in some cases acting as a specialist team member, safely secured inside the Microsoft 365 tools used by hundreds of millions of healthcare enterprise users across the world. The opportunity to save time today and lives tomorrow with AI has never been larger.

[MUSIC FADES] 
 
MATTHEW LUNGREN: Jonathan. You know, it’s been really interesting kind of observing Microsoft Research over the decades. I’ve, you know, been watching you guys in my prior academic career. You are always on the front of innovation, particularly in health care. And I find it fascinating that, you know, millions of people are using the solutions that, you know, your team has developed over the years and yet you still find ways to stay cutting edge and state of the art, even in this accelerating time of technology and AI, particularly, how do you do that? [LAUGHS]

 JONATHAN CARLSON: I mean, it’s some of what’s in our DNA, I mean, we’ve been publishing in health and life sciences for two decades here. But when we launched Health Futures as a mission-focused lab about 7 or 8 years ago, we really started with the premise that the way to have impact was to really close the loop between, not just good ideas that get published, but good ideas that can actually be grounded in real problems that clinicians and scientists care about, that then allow us to actually go from that first proof of concept into an incubation, into getting real world feedback that allows us to close that loop. And now with, you know, the HLS organization here as a product group, we have the opportunity to work really closely with you all to not just prove what’s possible in the clinic or in the lab, but actually start scaling that into the broader community. 

CAMERON RUNDE: And one thing I’ll add here is that the problems that we’re trying to tackle in health care are extremely complex. And so, as Jonathan said, it’s really important that we come together and collaborate across disciplines as well as across the company of Microsoft and with our external collaborators, as well across the whole industry. 

CARLSON: So, Matt, back to you. What are you guys doing in the product group? How do you guys see these models getting into the clinic?

LUNGREN: You know, I think a lot of people, you know, think about AI is just, you know, maybe just even a few years old because of GPT and how that really captured the public’s consciousness. Right?

And so, you think about the speech-to-text technology of being able to dictate something, for a clinic note or for a visit, that was typically based on Nuance technology. And so there’s a lot of product understanding of the market, how to deliver something that clinicians will use, understanding the pain points and workflows and really that Health IT space, which is sometimes the third rail, I feel like with a lot of innovation in healthcare. 

But beyond that, I mean, I think now that we have this really powerful engine of Microsoft and the platform capabilities, we’re seeing, innovations on the healthcare side for data storage, data interoperability, with different types of medical data. You have new applications coming online, the ability, of course, to see generative AI now infused into the speech-to-text and, becoming Dragon Copilot, which is something that has been, you know, tremendously, received by the community. 

Physicians are able to now just have a conversation with a patient. They turn to their computer and the note is ready for them. There’s no more this, we call it keyboard liberation. I don’t know if you heard that before. And that’s just been tremendous. And there’s so much more coming from that side. And then there’s other parts of the workflow that we also get engaged in — the diagnostic workflow.

So medical imaging, sharing images across different hospital systems, the list goes on. And so now when you move into AI, we feel like there’s a huge opportunity to deliver capabilities into the clinical workflow via the products and solutions we already have. But, I mean, we’ll now that we’ve kind of expanded our team to involve Azure and platform, we’re really able to now focus on the developers.

WILL GUYMAN: Yeah. And you’re always telling me as a doctor how frustrating it is to be spending time at the computer instead of with your patients. I think you told me, you know, 4,000 clicks a day for the typical doctor, which is tremendous. And something like Dragon Copilot can save that five minutes per patient. But it can also now take actions after the patient encounter so it can draft the after-visit summary. 

It can order labs and medications for the referral. And that’s incredible. And we want to keep building on that. There’s so many other use cases across the ecosystem. And so that’s why in Azure AI Foundry, we have translated a lot of the research from Microsoft Research and made that available to developers to build and customize for their own applications. 

SMITHA SALIGRAMA: Yeah. And as you were saying, in our transformation of moving from solutions to platforms and as, scaling solutions to other, multiple scenarios, as we put our models in AI Foundry, we provide these developer capabilities like bring your own data and fine tune these models and then apply it to, scenarios that we couldn’t even imagine. So that’s kind of the platform play we’re scaling now. 

LUNGREN: Well, I want to do a reality check because, you know, I think to us that are now really focused on technology, it seems like, I’ve heard this story before, right. I, I remember even in, my academic clinical days where it felt like technology was always the quick answer and it felt like technology was, there was maybe a disconnect between what my problems were or what I think needed to be done versus kind of the solutions that were kind of, created or offered to us. And I guess at some level, how Jonathan, do you think about this? Because to do things well in the science space is one thing, to do things well in science, but then also have it be something that actually drives health care innovation and practice and translation. It’s tricky, right?

CARLSON: Yeah. I mean, as you said, I think one of the core pathologies of Big Tech is we assume every problem is a technology problem. And that’s all it will take to solve the problem. And I think, look, I was trained as a computational biologist, and that sits in the awkward middle between biology and computation. And the thing that we always have to remember, the thing that we were very acutely aware of when we set out, was that we are not the experts. We do have, you know, you as an M.D., we have everybody on the team, we have biologists on the team. 

But this is a big space. And the only way we’re going to have real impact, the only way we’re even going to pick the right problems to work on is if we really partner deeply, with providers, with EHR (electronic health records) vendors, with scientists, and really understand what’s important and again, get that feedback loop. 

RUNDE: Yeah, I think we really need to ground the work that we do in the science itself. You need to understand the broader ecosystem and the broader landscape, across health care and life sciences, so that we can tackle the most important problems, not just the problems that we think are important. Because, as Jonathan said, we’re not the experts in health care and life sciences. And that’s really the secret sauce. When you have the clinical expertise come together with the technical expertise. That’s how you really accelerate health care.

CARLSON: When we really launched this, this mission, 7 or 8 years ago, we really came in with the premise of, if we decide to stop, we want to be sure the world cares. And the only way that’s going to be true is if we’re really deeply embedded with the people that matter–the patients, the providers and the scientists.

LUNGREN: And now it really feels like this collaborative effort, you know, really can help start to extend that mission. Right. I think, you know, Will and Smitha, that we definitely feel the passion and the innovation. And we certainly benefit from those collaborations, too. But then we have these other partners and even customers, right, that we can start to tap into and have that flywheel keep spinning. 

GUYMAN: Yeah. And the whole industry is an ecosystem. So, we have our own data sets at Microsoft Research that you’ve trained amazing AI models with. And those are in the catalog. But then you’ve also partnered with institutions like Providence or Page AI . And those models are in the catalog with their data. And then there are third parties like Nvidia that have their own specialized proprietary data sets, and their models are there too. So, we have this ecosystem of open source models. And maybe Smitha, you want to talk about how developers can actually customize these. 

SALIGRAMA: Yeah. So we use the Azure AI Foundry ecosystem. Developers can feel at home if they’re using the AI Foundry. So they can look at our model cards that we publish as part of the models we publish, understand the use cases of these models, how to, quickly, bring up these APIs and, look at different use cases of how to apply these and even fine tune these models with their own data. Right. And then, use it for specific tasks that we couldn’t have even imagined. 

LUNGREN: Yeah it has been interesting to see we have these health care models in the catalog again, some that came from research, some that came from third parties and other product developers and Azure’s kind of becoming the home base, I think, for a lot of health and life science developers. They’re seeing all the different modalities, all the different capabilities. And then in combination with Azure OpenAI, which as we know, is incredibly competent in lots of different use cases. How are you looking at the use cases, and what are you seeing folks use these models for as they come to the catalog and start sharing their discoveries or products? 

GUYMAN: Well, the general-purpose large language models are amazing for medical general reasoning. So Microsoft Research has shown that that they can perform super well on, for example, like the United States medical licensing exam, they can exceed doctor performance if they’re just picking between different multiple-choice questions. But real medicine we know is messier. It doesn’t always start with the whole patient context provided as text in the prompt. You have to get the source data and that raw data is often non-text. The majority of it is non-text. It’s things like medical imaging, radiology, pathology, ophthalmology, dermatology. It goes on and on. And there’s endless signal data, lab data. And so all of this diverse data type needs to be processed through specialized models because much of that data is not available on the public internet. 

And that’s why we’re taking this partner approach, first party and third party models that can interpret all this kind of data and then connect them ultimately back to these general reasoners to reason over that. 

LUNGREN: So, you know, I’ve been at this company for a while and, you know, familiar with kind of how long it takes, generally to get, you know, a really good research paper, do all the studies, do all the data analysis, and then go through the process of publishing, right, which takes, as, you know, a long time and it’s, you know, very rigorous. 

And one of the things that struck me, last year, I think we, we started this big collaboration and, within a quarter, you had a Nature paper coming out from Microsoft Research, and that model that the Nature paper was describing was ready to be used by anyone on the Azure AI Foundry within that same quarter. It kind of blew my mind when I thought about it, you know, even though we were all, you know, working very hard to get that done. Any thoughts on that? I mean, has this ever happened in your career? And, you know, what’s the secret sauce to that? 

CARLSON: Yeah, I mean, the time scale from research to product has been massively compressed. And I’d push that even further, which is to say, the reason why it took a quarter was because we were laying the railroad tracks as we’re driving the train. We have examples right after that when we are launching on Foundry the same day we were publishing the paper. 

And frankly, the review times are becoming longer than it takes to actually productize the models. I think there’s two things that are going on with that are really converging. One is that the overall ecosystem is converging on a relatively small number of patterns, and that gives us, as a tech company, a reason to go off and really make those patterns hardened in a way that allows not just us, but third parties as well, to really have a nice workflow to publish these models. 

But the other is actually, I think, a change in how we work, you know, and for most of our history as an industrial research lab, we would do research and then we’d go pitch it to somebody and try and throw it over the fence. We’ve really built a much more integrated team. In fact, if you look at that Nature paper or any of the other papers, there’s folks from product teams. Many of you are on the papers along with our clinical collaborators.

RUNDE: Yeah. I think one thing that’s really important to note is that there’s a ton of different ways that you can have impact, right? So I like to think about phasing. In Health Futures at least, I like to think about phasing the work that we do. So first we have research, which is really early innovation. And the impact there is getting our technology and our tools out there and really sharing the learnings that we’ve had. 

So that can be through publications like you mentioned. It can be through open-sourcing our models. And then you go to incubation. So, this is, I think, one of the more new spaces that we’re getting into, which is maybe that blurred line between research and product. Right. Which is, how do we take the tools and technologies that we’ve built and get them into the hands of users, typically through our partnerships? 

Right. So, we partner very deeply and collaborate very deeply across the industry. And incubation is really important because we get that early feedback. We get an ability to pivot if we need to. And we also get the ability to see what types of impact our technology is having in the real world. And then lastly, when you think about scale, there’s tons of different ways that you can scale. We can scale third-party through our collaborators and really empower them to go to market to commercialize the things that we’ve built together. 

You can also think about scaling internally, which is why I’m so thankful that we’ve created this flywheel between research and product, and a lot of the models that we’ve built that have gone through research, have gone through incubation, have been able to scale on the Azure AI Foundry. But that’s not really our expertise. Right? The scale piece in research, that’s research and incubation. Smitha, how do you think about scaling? 

SALIGRAMA: So, there are several angles to scaling the models, the state-of-the-art models we see from the research team. The first angle is, the open sourcing, to get developer trust, and very generous commercial licenses so that they can use it and for their own, use cases. The second is, we also allow them to customize these models, fine tuning these models with their own data. So a lot of different angles of how we provide support and scaling, the state-of-the-art of models we get from the research org.

GUYMAN: And as one example, you know, University of Wisconsin Health, you know, which Matt knows well. They took one of our models, which is highly versatile. They customized it in Foundry and they optimized it to reliably identify abnormal chest X-rays, the most common imaging procedure, so they could improve their turnaround time triage quickly. And that’s just one example. But we have other partners like Sectra who are doing more of operations use cases automatically routing imaging to the radiologists, setting them up to be efficient. And then Page AI is doing, you know, biomarker identification for actually diagnostics and new drug discovery. So, there’s so many use cases that we have partners already who are building and customizing.

LUNGREN: The part that’s striking to me is just that, you know, we could all sit in a room and think about all the different ways someone might use these models on the catalog. And I’m still shocked at the stuff that people use them for and how effective they are. And I think part of that is, you know, again, we talk a lot about generative AI and healthcare and all the things you can do. Again, you know, in text, you refer to that earlier and certainly off the shelf, there’s really powerful applications. But there is, you know, kind of this tip of the iceberg effect where under the water, most of the data that we use to take care of our patients is not text. Right. It’s all the different other modalities. And I think that this has been an unlock right, sort of taking these innovations, innovations from the community, putting them in this ecosystem kind of catalog, essentially. Right. And then allowing folks to kind of, you know, build and develop applications with all these different types of data. Again, I’ve been surprised at what I’m seeing. 

CARLSON: This has been just one of the most profound shifts that’s happened in the last 12 months, really. I mean, two years ago we had general models in text that really shifted how we think about, I mean, natural language processing got totally upended by that. Turns out the same technology works for images as well. It doesn’t only allow you to automatically extract concepts from images, but allows you to align those image concepts with text concepts, which means that you can have a conversation with that image. And once you’re in that world now, you are a place where you can start stitching together these multimodal models that really change how you can interact with the data, and how you can start getting more information out of the raw primary data that is part of the patient journey.

LUNGREN: Well, and we’re going to get to that because I think you just touched on something. And I want to re-emphasize stitching these things together. There’s a lot of different ways to potentially do that. Right? There’s ways that you can literally train the model end to end with adapters and all kinds of other early fusion fusions. All kinds of ways. But one of the things that the word of the I guess the year is going to be agents and an agent is a very interesting term to think about how you might abstract away some of the components or the tasks that you want the model to, to accomplish in the midst of sort of a real human to maybe model interaction. Can you talk a little bit more about, how we’re thinking about agents in this, in this platform approach? 
 
GUYMAN: Well, this is our newest addition to the Azure AI Foundry. So there’s an agent catalog now where we have a set of pre-configured agents for health care. And then we also have a multi-agent orchestrator that can jump start the process of developers building their own multi-agent workflows to tackle some complex real-world tasks that clinicians have to deal with. And these agents basically combine a general reasoner, like a large language model, like a GPT 4o or an o series model with a specialized model, like a model that understands radiology or pathology with domain-specific knowledge and tools. So the knowledge might be, you know, public guidelines or, you know, medical journals or your own private data from your EHR or medical imaging system, and then tools like a code interpreter to deal with all of the numeric data or tools like that that the clinicians are using today, like PowerPoint, Word, Teams and etc. And so we’re allowing developers to build and customize each of these agents in Foundry and then deploy them into their workflows. 

LUNGREN: And, and I really like that concept because, you know, as, as a, as a from the user personas, I think about myself as a user. How am I going to interact with these agents? Where does it naturally fit? And I and I sort of, you know, I’ve seen some of the demonstrations and some of the work that’s going on with Stanford in particular, showing that, you know, and literally in a Teams chat, I can have my clinician colleagues and I can have specialized health care agents that kind of interact, like I’m interacting with a human on a chat.

It is a completely mind-blowing thing for me. And it’s a light bulb moment for me to I wonder, what have we, what have we heard from folks that have, you know, tried out this health care agent orchestrator in this kind of deployment environment via Teams?

GUYMAN: Well, someone joked, you know, are you sure you’re not using Teams because you work at Microsoft? [LAUGHS] But, then we actually were meeting with one of the, radiologists at one of our partners, and they said that that morning they had just done a Teams meeting, or they had met with other specialists to talk about a patient’s cancer case, or they were coming up with a treatment plan. 

And that was the light bulb moment for us. We realized, actually, Teams is already being used by physicians as an internal communication tool, as a tool to get work done. And especially since the pandemic, a lot of the meetings moved to virtual and telemedicine. And so it’s a great distribution channel for AI, which is often been a struggle for AI to actually get in the hands of clinicians. And so now we’re allowing developers to build and then deploy very easily and extend it into their own workflows. 

CARLSON: I think that’s such an important point. I mean, if you think about one of the really important concepts in computer science is an application programing interface, like some set of rules that allow two applications to talk to each other. One of the big pushes, really important pushes, in medicine has been standards that allow us to actually have data standards and APIs that allow these to talk to each other, and yet still we end up with these silos. There’s silos of data. There’s silos of applications.

And just like when you and I work on our phone, we have to go back and forth between applications. One of the things that I think agents do is that it takes the idea that now you can use language to understand intent and effectively program an interface, and it creates a whole new abstraction layer that allows us to simplify the interaction between not just humans and the endpoint, but also for developers. 

It allows us to have this abstraction layer that lets different developers focus on different types of models, and yet stitch them all together in a very, very natural, way, not just for the users, but for the ability to actually deploy those models. 

SALIGRAMA: Just to add to what Jonathan was mentioning, the other cool thing about the Microsoft Teams user interface is it’s also enterprise ready.

RUNDE: And one important thing that we’re thinking about, is exactly this from the very early research through incubation and then to scale, obviously. Right. And so early on in research, we are actively working with our partners and our collaborators to make sure that we have the right data privacy and consent in place. We’re doing this in incubation as well. And then obviously in scale. Yep. 

LUNGREN: So, I think AI has always been thought of as a savior kind of technology. We talked a little bit about how there’s been some ups and downs in terms of the ability for technology to be effective in health care. At the same time, we’re seeing a lot of new innovations that are really making a difference. But then we kind of get, you know, we talked about agents a little bit. It feels like we’re maybe abstracting too far. Maybe it’s things are going too fast, almost. What makes this different? I mean, in your mind is this truly a logical next step or is it going to take some time? 

CARLSON: I think there’s a couple things that have happened. I think first, on just a pure technology. What led to ChatGPT? And I like to think of really three major breakthroughs.

The first was new mathematical concepts of attention, which really means that we now have a way that a machine can figure out which parts of the context it should actually focus on, just the way our brains do. Right? I mean, if you’re a clinician and somebody is talking to you, the majority of that conversation is not relevant for the diagnosis. But, you know how to zoom in on the parts that matter. That’s a super powerful mathematical concept. The second one is this idea of self-supervision. So, I think one of the fundamental problems of machine learning has been that you have to train on labeled training data and labels are expensive, which means data sets are small, which means the final models are very narrow and brittle. And the idea of self-supervision is that you can just get a model to automatically learn concepts, and the language is just predict the next word. And what’s important about that is that leads to models that can actually manipulate and understand really messy text and pull out what’s important about that, and then and then stitch that back together in interesting ways.

And the third concept, that came out of those first two, was just the observational scale. And that’s that more is better, more data, more compute, bigger models. And that really leads to a reason to keep investing. And for these models to keep getting better. So that as a as a groundwork, that’s what led to ChatGPT. That’s what led to our ability now to not just have rule-based systems or simple machine learning based systems to take a messy EHR record, say, and pull out a couple concepts.

But to really feed the whole thing in and say, okay, I need you to figure out which concepts are in here. And is this particular attribute there, for example. That’s now led to the next breakthrough, which is all those core ideas apply to images as well. They apply to proteins, to DNA. And so we’re starting to see models that understand images and the concepts of images, and can actually map those back to text as well. 

So, you can look at a pathology image and say, not just at the cell, but it appears that there’s some certain sort of cancer in this particular, tissue there. And then you take those two things together and you layer on the fact that now you have a model, or a set of models, that can understand intent, can understand human concepts and biomedical concepts, and you can start stitching them together into specialized agents that can actually reason with each other, which at some level gives you an API as a developer to say, okay, I need to focus on a pathology model and get this really, really, sound while somebody else is focusing on a radiology model, but now allows us to stitch these all together with a user interface that we can now talk to through natural language. 

RUNDE: I’d like to double click a little bit on that medical abstraction piece that you mentioned. Just the amount of data, clinical data that there is for each individual patient. Let’s think about cancer patients for a second to make this real. Right. For every cancer patient, it could take a couple of hours to structure their information. And why is that important? Because, you have to get that information in a structured way and abstract relevant information to be able to unlock precision health applications right, for each patient. So, to be able to match them to a trial, right, someone has to sit there and go through all of the clinical notes from their entire patient care journey, from the beginning to the end. And that’s not scalable. And so one thing that we’ve been doing in an active project that we’ve been working on with a handful of our partners, but Providence specifically, I’ll call out, is using AI to actually abstract and curate that information. So that gives time back to the health care provider to spend with patients, instead of spending all their time curating this information. 

And this is super important because it sets the scene and the backbone for all those precision health applications. Like I mentioned, clinical trial matching, tumor boards is another really important example here. Maybe Matt, you can talk to that a little bit.

LUNGREN: It’s a great example. And you know it’s so funny. We’ve talked about this use case and the you know the health care agent orchestrator is sort of the at the initial lighthouse use case was a tumor board setting. And I remember when we first started working with some of the partners on this, I think we were you know, under a research kind of lens, thinking about what could this, what new diagnoses could have come up with or what new insights might have and what was really a really key moment for us, I think, was noticing that we had developed an agent that can take all of the multimodal data about a patient’s chart, organize it in a timeline, in chronological fashion, and then allow folks to click on different parts of the timeline to ground it back to the note. And just that, which doesn’t sound like a really interesting research paper. It was mind blowing for clinicians who, again, as you said, spend a great deal of time, often outside of the typical work hours, trying to organize these patient records in order to go present to a tumor board.

And a tumor board is a critical meeting that happens at many cancer centers where specialists all get together, come with their perspective, and make a comment on what would be the best next step in treatment. But the background in preparing for that is you know, again, organizing the data. But to your point, also, what are the clinical trials that are active? There are thousands of clinical trials. There’s hundreds every day added. How can anyone keep up with that? And these are the kinds of use cases that start to bubble up. And you realize that a technology that understands concepts, context and can reason over vast amounts of data with a language interface-that is a powerful tool. Even before we get to some of the, you know, unlocking new insights and even precision medicine, this is that idea of saving time before lives to me. And there’s an enormous amount of undifferentiated heavy lifting that happens in health care that these agents and these kinds of workflows can start to unlock. 

GUYMAN: And we’ve packaged these agents, the manual abstraction work that, you know, manually takes hours. Now we have an agent. It’s in Foundry along with the clinical trial matching agent, which I think at Providence you showed could double the match rate over the baseline that they were using by using the AI for multiple data sources. So, we have that and then we have this orchestration that is using this really neat technology from Microsoft Research. Semantic Kernel, Magentic One, Omni Parser. These are technologies that are good at figuring out which agent to use for a given task. So a clinician who’s used to working with other specialists, like a radiologist, a pathologist, a surgeon, they can now also consult these specialist agents who are experts in their domain and there’s shared memory across the agents. 

There’s turn taking, there’s negotiation between the agents. So, there’s this really interesting system that’s emerging. And again, this is all possible to be used through Teams. And there’s some great extensibility as well. We’ve been talking about that and working on some cool tools. 

SALIGRAMA: Yeah. Yeah. No, I think if I have to geek out a little bit on how all this agent tech orchestrations are coming up, like I’ve been in software engineering for decades, it’s kind of a next version of distributed systems where you have these services that talk to each other. It’s a more natural way because LLMs are giving these natural ways instead of a structured API ways of conversing. We have these agents which can naturally understand how to talk to each other. Right. So this is like the next evolution of our systems now. And the way we’re packaging all of this is multiple ways based on all the standards and innovation that’s happening in this space. So, first of all, we are building these agents that are very good at specific tasks, like, Will was saying like, a trial matching agent or patient timeline agents. 

So, we take all of these, and then we package it in a workflow and an orchestration. We use the standard, some of these coming from research. The Semantic Kernel, the Magentic-One. And then, all of these also allow us to extend these agents with custom agents that can be plugged in. So, we are open sourcing the entire agent orchestration in AI Foundry templates, so that developers can extend their own agents, and make their own workflows out of it. So, a lot of cool innovation happening to apply this technology to specific scenarios and workflows. 

LUNGREN: Well, I was going to ask you, like, so as part of that extension. So, like, you know, folks can say, hey, I have maybe a really specific part of my workflow that I want to use some agents for, maybe one of the agents that can do PubMed literature search, for example. But then there’s also agents that, come in from the outside, you know, sort of like I could, I can imagine a software company or AI company that has a built-in agent that plugs in as well. 

SALIGRAMA: Yeah. Yeah, absolutely. So, you can bring your own agent. And then we have these, standard ways of communicating with agents and integrating with the orchestration language so you can bring your own agent and extend this health care agent, agent orchestrator to your own needs. 

LUNGREN: I can just think of, like, in a group chat, like a bunch of different specialist agents. And I really would want an orchestrator to help find the right tool, to your point earlier, because I’m guessing this ecosystem is going to expand quickly. Yeah. And I may not know which tool is best for which question. I just want to ask the question. Right. 

SALIGRAMA: Yeah. Yeah. 

CARLSON: Well, I think to that point to I mean, you said an important point here, which is tools, and these are not necessarily just AI tools. Right? I mean, we’ve known this for a while, right? LLMS are not very good at math, but you can have it use a calculator and then it works very well. And you know you guys both brought up the universal medical abstraction a couple times. 

And one of the things that I find so powerful about that is we’ve long had this vision within the precision health community that we should be able to have a learning hospital system. We should be able to actually learn from the actual real clinical experiences that are happening every day, so that we can stop practicing medicine based off averages. 

There’s a lot of work that’s gone on for the last 20 years about how to actually do causal inference. That’s not an AI question. That’s a statistical question. The bottleneck, the reason why we haven’t been able to do that is because most of that information is locked up in unstructured text. And these other tools need essentially a table. 

And so now you can decompose this problem, say, well, what if I can use AI not to get to the causal answer, but to just structure the information. So now I can put it into the causal inference tool. And these sorts of patterns I think again become very, not just powerful for a programmer, but they start pulling together different specialties. And I think we’ll really see an acceleration, really, of collaboration across disciplines because of this. 

CARLSON: So, when I joined Microsoft Research 18 years ago, I was doing work in computational biology. And I would always have to answer the question: why is Microsoft in biomedicine? And I would always kind of joke saying, well, it is. We sell Office and Windows to every health care system in the world. We’re already in the space. And it really struck me to now see that we’ve actually come full circle. And now you can actually connect in Teams, Word, PowerPoint, which are these tools that everybody uses every day, but they’re actually now specialize-able through these agents. Can you guys talk a little bit about what that looks like from a developer perspective? How can provider groups actually start playing with this and see this come to life? 

SALIGRAMA: A lot of healthcare organizations already use Microsoft productivity tools, as you mentioned. So, they asked the developers, build these agents, and use our healthcare orchestrations, to plug in these agents and expose these in these productivity tools. They will get access to all these healthcare workers. So the healthcare agent orchestrator we have today integrates with Microsoft Teams, and it showcases an example of how you can at (@) mention these agents and talk to them like you were talking to another person in a Teams chat. And then it also provides examples of these agents and how they can use these productivity tools. One of the examples we have there is how they can summarize the assessments of this whole chat into a Word Doc, or even convert that into a PowerPoint presentation, for later on.

CARLSON: One of the things that has struck me is how easy it is to do. I mean, Will, I don’t know if you’ve worked with folks that have gone from 0 to 60, like, how fast? What does that look like? 

GUYMAN: Yeah, it’s funny for us, the technology to transfer all this context into a Word Document or PowerPoint presentation for a doctor to take to a meeting is relatively straightforward compared to the complicated clinical trial matching multimodal processing. The feedback has been tremendous in terms of, wow, that saves so much time to have this organized report that then I can show up to meeting with and the agents can come with me to that meeting because they’re literally having a Teams meeting, often with other human specialists. And the agents can be there and ask and answer questions and fact check and source all the right information on the fly. So, there’s a nice integration into these existing tools. 

LUNGREN: We worked with several different centers just to kind of understand, you know, where this might be useful. And, like, as I think we talked about before, the ideas that we’ve come up with again, this is a great one because it’s complex. It’s kind of hairy. There’s a lot of things happening under the hood that don’t necessarily require a medical license to do, right, to prepare for a tumor board and to organize data. But, it’s fascinating, actually. So, you know, folks have come up with ideas of, could I have an agent that can operate an MRI machine, and I can ask the agent to change some parameters or redo a protocol. We thought that was a pretty powerful use case. We’ve had others that have just said, you know, I really want to have a specific agent that’s able to kind of act like deep research does for the consumer side, but based on the context of my patient, so that it can search all the literature and pull the data in the papers that are relevant to this case. And the list goes on and on from operations all the way to clinical, you know, sort of decision making at some level. And I think that the research community that’s going to sprout around this will help us, guide us, I guess, to see what is the most high-impact use cases. Where is this effective? And maybe where it’s not effective.

But to me, the part that makes me so, I guess excited about this is just that I don’t have to think about, okay, well, then we have to figure out Health IT. Because it’s always, you know, we always have great ideas and research, and it always feels like there’s such a huge chasm to get it in front of the health care workers that might want to test this out. And it feels like, again, this productivity tool use case again with the enterprise security, the possibility for bringing in third parties to contribute really does feel like it’s a new surface area for innovation.

CARLSON: Yeah, I love that. Look. Let me end by putting you all on the spot. So, in three years, multimodal agents will do what? Matt, I’ll start with you. 

LUNGREN: I am convinced that it’s going to save massive amount of time before it saves many lives. 

RUNDE: I’ll focus on the patient care journey and diagnostic journey. I think it will kind of transform that process for the patient itself and shorten that process. 

GUYMAN: Yeah, I think we’ve seen already papers recently showing that different modalities surfaced complementary information. And so we’ll see kind of this AI and these agents becoming an essential companion to the physician, surfacing insights that would have been overlooked otherwise. 

SALIGRAMA: And similar to what you guys were saying, agents will become important assistants to healthcare workers, reducing a lot of documentation and workflow, excess work they have to do. 

CARLSON: I love that. And I guess for my part, I think really what we’re going to see is a massive unleash of creativity. We’ve had a lot of folks that have been innovating in this space, but they haven’t had a way to actually get it into the hands of early adopters. And I think we’re going to see that really lead to an explosion of creativity across the ecosystem. 

LUNGREN: So, where do we get started? Like where are the developers who are listening to this? The folks that are at, you know, labs, research labs and developing health care solutions. Where do they go to get started with the Foundry, the models we’ve talked about, the healthcare agent orchestrator. Where do they go?

GUYMAN: So AI.azure.com is the AI Foundry. It’s a website you can go as a developer. You can sign in with your Azure subscription, get your Azure account, your own VM, all that stuff. And you have an agent catalog, the model catalog. You can start from there. There is documentation and templates that you can then deploy to Teams or other applications. 

LUNGREN: And tutorials are coming. Right. We have recordings of tutorials. We’ll have Hackathons, some sessions and then more to come. Yeah, we’re really excited. 

[MUSIC] 

LUNGREN: Thank you so much, guys for joining us. 

CARLSON: Yes. Yeah. Thanks. 

SALIGRAMA: Thanks for having us. 

[MUSIC FADES] 

The post Collaborators: Healthcare Innovation to Impact appeared first on Microsoft Research.

Read More

Magentic-UI, an experimental human-centered web agent

Magentic-UI, an experimental human-centered web agent

This figure denotes a human figure above a small monitor on the left and a gear on the right with arrows pointing to each.

Modern productivity is rooted in the web—from searching for information and filling in forms to navigating dashboards. Yet, many of these tasks remain manual and repetitive. Today, we are introducing Magentic-UI, a new open-source research prototype of a human-centered agent that is meant to help researchers study open questions on human-in-the-loop approaches and oversight mechanisms for AI agents. This prototype collaborates with users on web-based tasks and operates in real time over a web browser. Unlike other computer use agents that aim for full autonomy, Magentic-UI offers a transparent and controllable experience for tasks that are action-oriented and require activities beyond just performing simple web searches.

Magentic-UI builds on Magentic-One (opens in new tab), a powerful multi-agent team we released last year, and is powered by AutoGen (opens in new tab), our leading agent framework. It is available under MIT license at https://github.com/microsoft/Magentic-UI (opens in new tab) and on Azure AI Foundry Labs (opens in new tab), the hub where developers, startups, and enterprises can explore groundbreaking innovations from Microsoft Research. Magentic-UI is integrated with Azure AI Foundry models and agents. Learn more about how to integrate Azure AI agents into the Magentic-UI multi-agent architecture by following this code sample (opens in new tab)

Magentic-UI can perform tasks that require browsing the web, writing and executing Python and shell code, and understanding files. Its key features include:

  1. Collaborative planning with users (co-planning). Magentic-UI allows users to directly modify its plan through a plan editor or by providing textual feedback before Magentic-UI executes any actions. 
  2. Collaborative execution with users (co-tasking). Users can pause the system and give feedback in natural language or demonstrate it by directly taking control of the browser.
  3. Safety with human-in-the-loop (action guards). Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors.
  4. Safety with human-in-the-loop. Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors. 
  5. Learning from experience (plan learning). Magentic-UI can learn and save plans from previous interactions to improve task completion for future tasks. 
A screenshot of Magentic-UI actively performing a task. The left side of the screen shows Magentic-UI stating its plan based on the following query
Figure 1: Screenshot of Magentic-UI actively performing a task. The left side of the screen shows Magentic-UI stating its plan and progress to accomplish a user’s complex goal. The right side shows the browser Magentic-UI is controlling. 

How is Magentic-UI human-centered?

While many web agents promise full autonomy, in practice users can be left unsure of what the agent can do, what it is currently doing, and whether they have enough control to intervene when something goes wrong or doesn’t occur as expected. By contrast, Magentic-UI considers user needs at every stage of interaction. We followed a human-centered design methodology in building Magentic-UI by prototyping and obtaining feedback from pilot users during its design. 

Co-planning - This figure describes how users can collaboratively plan with Magentic-UI. On the left hand side, users can accept the plan magentic-ui creates or re-create the plan. On the right hand side they can see the actions magnetic-ui is taking on the browser.
Figure 2: Co-planning – Users can collaboratively plan with Magentic-UI.

For example, after a person specifies and before Magentic-UI even begins to execute, it creates a clear step-by-step plan that outlines what it would do to accomplish the task. People can collaborate with Magentic-UI to modify this plan and then give final approval for Magentic-UI to begin execution. This is crucial as users may have expectations of how the task should be completed; communicating that information could significantly improve agent performance. We call this feature co-planning.

During execution, Magentic-UI shows in real time what specific actions it’s about to take. For example, whether it is about to click on a button or input a search query. It also shows in real time what it observed on the web pages it is visiting. Users can take control of the action at any point in time and give control back to the agent. We call this feature co-tasking.

Co-tasking. This fiture shows how magentic-ui shows the plan it created to the user and allows the user to push a button to accept the plan or modify it.
Figure 3: Co-tasking – Magentic-UI provides real-time updates about what it is about to do and what it already did, allowing users to collaboratively complete tasks with the agent.
Action-guards. TOn the left hand side,this figure shous how Magentic-UI will ask users for permission using Approve or Reject buttons before executing actions that it deems consequential or important. On the right hand side the figure shows magentic-ui picking Crab Wonton from Thai Thani's restaurant menu.
Figure 4: Action-guards – Magentic-UI will ask users for permission before executing actions that it deems consequential or important. 

Additionally, Magentic-UI asks for user permission before performing actions that are deemed irreversible, such as closing a tab or clicking a button with side effects. We call these “action guards”. The user can also configure Magentic-UI’s action guards to always ask for permission before performing any action. If the user deems an action risky (e.g., paying for an item), they can reject it. 

This figure shows how magentic-ui learns a plan by allowing the users to click on a button entitled
This figure shows how users can see their saved plans and click on a button to either run that same plan or edit it before running it.
Figure 5: Plan learning – Once a task is successfully completed, users can request Magentic-UI to learn a step-by-step plan from this experience.

After execution, the user can ask Magentic-UI to reflect on the conversation and infer and save a step-by-step plan for future similar tasks. Users can view and modify saved plans for Magentic-UI to reuse in the future in a saved-plans gallery. In a future session, users can launch Magentic-UI with the saved plan to either execute the same task again, like checking the price of a specific flight, or use the plan as a guide to help complete similar tasks, such as checking the price of a different type of flight. 

Combined, these four features—co-planning, co-tasking, action guards, and plan learning—enable users to collaborate effectively with Magentic-UI.

Architecture

Magentic-UI’s underlying system is a team of specialized agents adapted from AutoGen’s Magentic-One system. The agents work together to create a modular system:

  • Orchestrator is the lead agent, powered by a large language model (LLM), that performs co-planning with the user, decides when to ask the user for feedback, and delegates sub-tasks to the remaining agents to complete.
  • WebSurfer is an LLM agent equipped with a web browser that it can control. Given a request by the Orchestrator, it can click, type, scroll, and visit pages in multiple rounds to complete the request from the Orchestrator.
  • Coder is an LLM agent equipped with a Docker code-execution container. It can write and execute Python and shell commands and provide a response back to the Orchestrator.
  • FileSurfer is an LLM agent equipped with a Docker code-execution container and file-conversion tools from the MarkItDown (opens in new tab) package. It can locate files in the directory controlled by Magentic-UI, convert files to markdown, and answer questions about them.
This figure shows a histogram. It's a comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has aaccess to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost.
Figure 6: System architecture diagram of Magentic-UI

To interact with Magentic-UI, users can enter a text message and attach images. In response, Magentic-UI creates a natural-language step-by-step plan with which users can interact through a plan-editing interface. Users can add, delete, edit, regenerate steps, and write follow-up messages to iterate on the plan. While the user editing the plan adds an upfront cost to the interaction, it can potentially save a significant amount of time in the agent executing the plan and increase its chance at success.

The plan is stored inside the Orchestrator and is used to execute the task. For each step of the plan, the Orchestrator determines which of the agents (WebSurfer, Coder, FileSurfer) or the user should complete the step. Once that decision is made, the Orchestrator sends a request to one of the agents or the user and waits for a response. After the response is received, the Orchestrator decides whether that step is complete. If it is, the Orchestrator moves on to the following step.

Once all steps are completed, the Orchestrator generates a final answer that is presented to the user. If, while executing any of the steps, the Orchestrator decides that the plan is inadequate (for example, because a certain website is unreachable), the Orchestrator can replan with user permission and start executing a new plan.

All intermediate progress steps are clearly displayed to the user. Furthermore, the user can pause the execution of the plan and send additional requests or feedback. The user can also configure through the interface whether agent actions (e.g., clicking a button) require approval.

Evaluating Magentic-UI

Magentic-UI innovates through its ability to integrate human feedback in its planning and execution of tasks. We performed a preliminary automated evaluation to showcase this ability on the GAIA benchmark (opens in new tab) for agents with a user-simulation experiment.

Evaluation with simulated users

This figure shows a box diagram to describe the architecture of Magentic-UI.
Figure 7: Comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has aaccess to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost.

GAIA is a benchmark for general AI assistants, with multimodal question-answer pairs that are challenging, requiring the agents to navigate the web, process files, and execute code. The traditional evaluation setup with GAIA assumes the system will autonomously complete the task and return an answer, which is compared to the ground-truth answer. 

To evaluate the human-in-the-loop capabilities of Magentic-UI, we transform GAIA into an interactive benchmark by introducing the concept of a simulated user. Simulated users provide value in two ways: by having specific expertise that the agent may not possess, and by providing guidance on how the task should be performed.

We experiment with two types of simulated users to show the value of human-in-the-loop: (1) a simulated user that is more intelligent than the Magentic-UI agents and (2) a simulated user with the same intelligence as Magentic-UI agents but with additional information about the task. During co-planning, Magentic-UI takes feedback from this simulated user to improve its plan. During co-tasking, Magentic-UI can ask the (simulated) user for help when it gets stuck. Finally, if Magentic-UI does not provide a final answer, then the simulated user provides an answer instead.

The simulated user is an LLM without any tools, instructed to interact with Magentic-UI the way we expect a human would act. The first type of simulated user relies on OpenAI’s o4-mini, more performant at many tasks than the one powering the Magentic-UI agents (GPT-4o). For the second type of simulated user, we use GPT-4o for both the simulated user and the rest of the agents, but the user has access to side information about each task. Each task in GAIA has side information, which includes a human-written plan to solve the task. While this plan is not used as input in the traditional benchmark, in our interactive setting we provide this information to the second type of simulated user,which is powered by an LLM so that it can mimic a knowledgeable user. Importantly, we tuned our simulated user so as not to reveal the ground-truth answer directly as the answer is usually found inside the human written plan. Instead, it is prompted to guide Magentic-UI indirectly. We found that this tuning prevented the simulated user from inadvertently revealing the answer in all but 6% of tasks when Magentic-UI provides a final answer. 

On the validation subset of GAIA (162 tasks), we show the results of Magentic-One operating in autonomous mode, Magentic-UI operating in autonomous mode (without the simulated user), Magentic-UI with simulated user (1) (smarter model), Magentic-UI with simulated user (2) (side-information), and human performance. We first note that Magentic-UI in autonomous mode is within a margin of error of the performance of Magentic-One. Note that the same LLM (GPT-4o) is used for Magentic-UI and Magentic-One.

Magentic-UI with the simulated user that has access to side information improves the accuracy of autonomous Magentic-UI by 71%, from a 30.3% task-completion rate to a 51.9% task-completion rate. Moreover, Magentic-UI only asks for help from the simulated user in 10% of tasks and relies on the simulated user for the final answer in 18% of tasks. And in those tasks where it does ask for help, it asks for help on average 1.1 times. Magentic-UI with the simulated user powered by a smarter model improves to 42.6% where Magentic-UI asks for help in only 4.3% of tasks, asking for help an average of 1.7 times in those tasks. This demonstrates the potential of even lightweight human feedback for improving performance (e.g., task completion) of autonomous agents, especially at a fraction of the cost compared to people completing tasks entirely manually. 

Learning and reusing plans

As described above, once Magentic-UI completes a task, users have the option for Magentic-UI to learn a plan based on the execution of the task. These plans are saved in a plan gallery, which users and Magentic-UI can access in the future.

The user can select a plan from the plan gallery, which is displayed by clicking on the Saved Plans button. Alternatively, as a user enters a task that closely matches a previous task, the saved plan will be displayed even before the user is done typing. If no identical task is found, Magentic-UI can use AutoGen’s Task-Centric Memory (opens in new tab) to retrieve plans for any similar tasks. Our preliminary evaluations show that this retrieval is highly accurate, and when recalling a saved plan can be around 3x faster than generating a new plan. Once a plan is recalled or generated, the user can always accept it, modify it, or ask Magentic-UI to modify it for the specific task at hand. 

Safety and control

Magentic-UI can surf the live internet and execute code. With such capabilities, we need to ensure that Magentic-UI acts in a safe and secure manner. The following features, design decisions, and evaluations were made to ensure this:

  • Allow-list: Users can set a list of websites that Magentic-UI is allowed to access. If Magentic-UI needs to access a website outside of the allow-list, users must explicitly approve it through the interface
  • Anytime interruptions: At any point of Magentic-UI completing the task, the user can interrupt Magentic-UI and stop any pending code execution or web browsing.
  • Docker sandboxing: Magentic-UI controls a browser that is launched inside a Docker container with no credentials, which avoids risks with logged-in accounts and credentials. Moreover, any code execution is also performed inside a separate Docker container to avoid affecting the host environment in which Magentic-UI is running. This is illustrated in the system architecture of Magentic-UI (Figure 3).
  • Detection and approval of irreversible agent actions: Users can configure an action-approval policy (action guards) to determine which actions Magentic-UI can perform without user approval. In the extreme, users can specify that any action (e.g., any button click) needs explicit user approval. Users must press an “Accept” or “Deny” button for each action.

In addition to the above design decisions, we performed a red-team evaluation of Magentic-UI on a set of internal scenarios, which we developed to challenge the security and safety of Magentic-UI. Such scenarios include cross-site prompt injection attacks, where web pages contain malicious instructions distinct from the user’s original intent (e.g., to execute risky code, access sensitive files, or perform actions on other websites). It also contains scenarios comparable to phishing, which try to trick Magentic-UI into entering sensitive information, or granting permissions on impostor sites (e.g., a synthetic website that asks Magentic-UI to log in and enter Google credentials to read an article). In our preliminary evaluations, we found that Magentic-UI either refuses to complete the requests, stops to ask the user, or, as a final safety measure, is eventually unable to complete the request due to Docker sandboxing. We have found that this layered approach is effective for thwarting these attacks.

We have also released transparency notes, which can be found at: https://github.com/microsoft/magentic-ui/blob/main/TRANSPARENCY_NOTE.md (opens in new tab)

Open research questions 

Magentic-UI provides a tool for researchers to study critical questions in agentic systems and particularly on human-agent interaction. In a previous report (opens in new tab), we outlined 12 questions for human-agent communication, and Magentic-UI provides a vehicle to study these questions in a realistic setting. A key question among these is how we enable humans to efficiently intervene and provide feedback to the agent while executing a task. Humans should not have to constantly watch the agent. Ideally, the agent should know when to reach out for help and provide the necessary context for the human to assist it. A second question is about safety. As agents interact with the live web, they may become prone to attacks from malicious actors. We need to study what necessary safeguards are needed to protect the human from side effects without adding a heavy burden on the human to verify every agent action. There are also many other questions surrounding security, personalization, and learning that Magentic-UI can help with studying. 

Conclusion

Magentic-UI is an open-source agent prototype that works with people to complete complex tasks that require multi-step planning and browser use. As agentic systems expand in the scope of tasks they can complete, Magentic-UI’s design enables better transparency into agent actions and enables human control to ensure safety and reliability. Moreover, by facilitating human intervention, we can improve performance while still reducing human cost in completing tasks on aggregate. Today we have released the first version of Magentic-UI. Looking ahead, we plan to continue developing it in the open with the goal of improving its capabilities and answering research questions on human-agent collaboration. We invite the research community to extend and reuse Magentic-UI for their scientific explorations and domains. 

The post Magentic-UI, an experimental human-centered web agent appeared first on Microsoft Research.

Read More

Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers

Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers

AI Revolution podcast | Episode 5 - Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers | outline illustration of Carey Goldberg, Peter Lee, and Dr. Isaac (Zak) Kohane

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee. 

In this episode, Lee reunites with his coauthors Carey Goldberg (opens in new tab) and Dr. Zak Kohane (opens in new tab) to review the predictions they made and reflect on what has and hasn’t materialized based on discussions with the series’ early guests: frontline clinicians, patient/consumer advocates, technology developers, and policy and ethics thinkers. Together, the coauthors explore how generative AI is being used on the ground today—from clinical note-taking to empathetic patient communication—and discuss the ongoing tensions around safety, equity, and institutional adoption. The conversation also surfaces deeper questions about values embedded in AI systems and the future role of human clinicians.


Learn more

Compared with What? Measuring AI against the Health Care We Have (opens in new tab) (Kohane) 
Publication | October 2024 

Medical Artificial Intelligence and Human Values (opens in new tab) (Kohane) 
Publication | May 2024 

Managing Patient Use of Generative Health AI (opens in new tab) (Goldberg) 
Publication | December 2024 

Patient Portal — When Patients Take AI into Their Own Hands (opens in new tab) (Goldberg) 
Publication | April 2024 

To Do No Harm — and the Most Good — with AI in Health Care (opens in new tab) (Goldberg) 
Publication | February 2024 

This time, the hype about AI in medicine is warranted (opens in new tab) (Goldberg) 
Opinion article | April 2023 

The AI Revolution in Medicine: GPT-4 and Beyond   
Book | Peter Lee, Carey Goldberg, Isaac Kohane | April 2023

Transcript

[MUSIC]     

[BOOK PASSAGE]  

PETER LEE: “We need to start understanding and discussing AI’s potential for good and ill now. Or rather, yesterday. … GPT-4 has game-changing potential to improve medicine and health.” 

[END OF BOOK PASSAGE]  

[THEME MUSIC]     

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.     

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?      

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.


[THEME MUSIC FADES]  

The passage I read at the top is from the book’s prologue.   

When Carey, Zak, and I wrote the book, we could only speculate how generative AI would be used in healthcare because GPT-4 hadn’t yet been released. It wasn’t yet available to the very people we thought would be most affected by it. And while we felt strongly that this new form of AI would have the potential to transform medicine, it was such a different kind of technology for the world, and no one had a user’s manual for this thing to explain how to use it effectively and also how to use it safely.  

So we thought it would be important to give healthcare professionals and leaders a framing to start important discussions around its use. We wanted to provide a map not only to help people navigate a new world that we anticipated would happen with the arrival of GPT-4 but also to help them chart a future of what we saw as a potential revolution in medicine.  

So I’m super excited to welcome my coauthors: longtime medical/science journalist Carey Goldberg and Dr. Zak Kohane, the inaugural chair of Harvard Medical School’s Department of Biomedical Informatics and the editor-in-chief for The New England Journal of Medicine AI.  

We’re going to have two discussions. This will be the first one about what we’ve learned from the people on the ground so far and how we are thinking about generative AI today.  

[TRANSITION MUSIC] 

Carey, Zak, I’m really looking forward to this. 

CAREY GOLDBERG: It’s nice to see you, Peter.  

LEE: [LAUGHS] It’s great to see you, too. 

GOLDBERG: We missed you. 

ZAK KOHANE: The dynamic gang is back. [LAUGHTER] 

LEE: Yeah, and I guess after that big book project two years ago, it’s remarkable that we’re still on speaking terms with each other. [LAUGHTER] 

In fact, this episode is to react to what we heard in the first four episodes of this podcast. But before we get there, I thought maybe we should start with the origins of this project just now over two years ago. And, you know, I had this early secret access to Davinci 3, now known as GPT-4.  

I remember, you know, experimenting right away with things in medicine, but I realized I was in way over my head. And so I wanted help. And the first person I called was you, Zak. And you remember we had a call, and I tried to explain what this was about. And I think I saw skepticism in—polite skepticism—in your eyes. But tell me, you know, what was going through your head when you heard me explain this thing to you? 

KOHANE: So I was divided between the fact that I have tremendous respect for you, Peter. And you’ve always struck me as sober. And we’ve had conversations which showed to me that you fully understood some of the missteps that technology—ARPA, Microsoft, and others—had made in the past. And yet, you were telling me a full science fiction compliant story [LAUGHTER] that something that we thought was 30 years away was happening now.  

LEE: Mm-hmm. 

KOHANE: And it was very hard for me to put together. And so I couldn’t quite tell myself this is BS, but I said, you know, I need to look at it. Just this seems too good to be true. What is this? So it was very hard for me to grapple with it. I was thrilled that it might be possible, but I was thinking, How could this be possible

LEE: Yeah. Well, even now, I look back, and I appreciate that you were nice to me, because I think a lot of people would have [LAUGHS] been much less polite. And in fact, I myself had expressed a lot of very direct skepticism early on.  

After ChatGPT got released, I think three or four days later, I received an email from a colleague running … who runs a clinic, and, you know, he said, “Wow, this is great, Peter. And, you know, we’re using this ChatGPT, you know, to have the receptionist in our clinic write after-visit notes to our patients.”  

And that sparked a huge internal discussion about this. And you and I knew enough about hallucinations and about other issues that it seemed important to write something about what this could do and what it couldn’t do. And so I think, I can’t remember the timing, but you and I decided a book would be a good idea. And then I think you had the thought that you and I would write in a hopelessly academic style [LAUGHTER] that no one would be able to read.  

So it was your idea to recruit Carey, I think, right? 

KOHANE: Yes, it was. I was sure that we both had a lot of material, but communicating it effectively to the very people we wanted to would not go well if we just left ourselves to our own devices. And Carey is super brilliant at what she does. She’s an idea synthesizer and public communicator in the written word and amazing. 

LEE: So yeah. So, Carey, we contact you. How did that go? 

GOLDBERG: So yes. On my end, I had known Zak for probably, like, 25 years, and he had always been the person who debunked the scientific hype for me. I would turn to him with like, “Hmm, they’re saying that the Human Genome Project is going to change everything.” And he would say, “Yeah. But first it’ll be 10 years of bad news, and then [LAUGHTER] we’ll actually get somewhere.”  
 
So when Zak called me up at seven o’clock one morning, just beside himself after having tried Davinci 3, I knew that there was something very serious going on. And I had just quit my job as the Boston bureau chief of Bloomberg News, and I was ripe for the plucking. And I also … I feel kind of nostalgic now about just the amazement and the wonder and the awe of that period. We knew that when generative AI hit the world, there would be all kinds of snags and obstacles and things that would slow it down, but at that moment, it was just like the holy crap moment. [LAUGHTER] And it’s fun to think about it now. 

LEE: Yeah. I think ultimately, you know, recruiting Carey, you were [LAUGHS] so important because you basically went through every single page of this book and made sure … I remember, in fact, it’s affected my writing since because you were coaching us that every page has to be a page turner. There has to be something on every page that motivates people to want to turn the page and get to the next one. 

KOHANE: I will see that and raise that one. I now tell GPT-4, please write this in the style of Carey Goldberg.  

GOLDBERG: [LAUGHTER] No way! Really?  

KOHANE: Yes way. Yes way. Yes way. 

GOLDBERG: Wow. Well, I have to say, like, it’s not hard to motivate readers when you’re writing about the most transformative technology of their lifetime. Like, I think there’s a gigantic hunger to read and to understand. So you were not hard to work with, Peter and Zak. [LAUGHS] 

LEE: All right. So I think we have to get down to work [LAUGHS] now.  

Yeah, so for these podcasts, you know, we’re talking to different types of people to just reflect on what’s actually happening, what has actually happened over the last two years. And so the first episode, we talked to two doctors. There’s Chris Longhurst at UC San Diego and Sara Murray at UC San Francisco. And besides being doctors and having AI affect their clinical work, they just happen also to be leading the efforts at their respective institutions to figure out how best to integrate AI into their health systems. 

And, you know, it was fun to talk to them. And I felt like a lot of what they said was pretty validating for us. You know, they talked about AI scribes. Chris, especially, talked a lot about how AI can respond to emails from patients, write referral letters. And then, you know, they both talked about the importance of—I think, Zak, you used the phrase in our book “trust but verify”—you know, to have always a human in the loop.   

What did you two take away from their thoughts overall about how doctors are using … and I guess, Zak, you would have a different lens also because at Harvard, you see doctors all the time grappling with AI. 

KOHANE: So on the one hand, I think they’ve done some very interesting studies. And indeed, they saw that when these generative models, when GPT-4, was sending a note to patients, it was more detailed, friendlier. 

But there were also some nonobvious results, which is on the generation of these letters, if indeed you review them as you’re supposed to, it was not clear that there was any time savings. And my own reaction was, Boy, every one of these things needs institutional review. It’s going to be hard to move fast.  

And yet, at the same time, we know from them that the doctors on their smartphones are accessing these things all the time. And so the disconnect between a healthcare system, which is duty bound to carefully look at every implementation, is, I think, intimidating.  

LEE: Yeah. 

KOHANE: And at the same time, doctors who just have to do what they have to do are using this new superpower and doing it. And so that’s actually what struck me …  

LEE: Yeah. 

KOHANE: … is that these are two leaders and they’re doing what they have to do for their institutions, and yet there’s this disconnect. 

And by the way, I don’t think we’ve seen any faster technology adoption than the adoption of ambient dictation. And it’s not because it’s time saving. And in fact, so far, the hospitals have to pay out of pocket. It’s not like insurance is paying them more. But it’s so much more pleasant for the doctors … not least of which because they can actually look at their patients instead of looking at the terminal and plunking down.  

LEE: Carey, what about you? 

GOLDBERG: I mean, anecdotally, there are time savings. Anecdotally, I have heard quite a few doctors saying that it cuts down on “pajama time” to be able to have the note written by the AI and then for them to just check it. In fact, I spoke to one doctor who said, you know, basically it means that when I leave the office, I’ve left the office. I can go home and be with my kids. 

So I don’t think the jury is fully in yet about whether there are time savings. But what is clear is, Peter, what you predicted right from the get-go, which is that this is going to be an amazing paper shredder. Like, the main first overarching use cases will be back-office functions. 

LEE: Yeah, yeah. Well, and it was, I think, not a hugely risky prediction because, you know, there were already companies, like, using phone banks of scribes in India to kind of listen in. And, you know, lots of clinics actually had human scribes being used. And so it wasn’t a huge stretch to imagine the AI.

[TRANSITION MUSIC] 

So on the subject of things that we missed, Chris Longhurst shared this scenario, which stuck out for me, and he actually coauthored a paper on it last year. 

LEE: [LAUGHS] So, Carey, maybe I’ll start with you. What did we understand about this idea of empathy out of AI at the time we wrote the book, and what do we understand now? 

GOLDBERG: Well, it was already clear when we wrote the book that these AI models were capable of very persuasive empathy. And in fact, you even wrote that it was helping you be a better person, right. [LAUGHS] So their human qualities, or human imitative qualities, were clearly superb. And we’ve seen that borne out in multiple studies, that in fact, patients respond better to them … that they have no problem at all with how the AI communicates with them. And in fact, it’s often better.  

And I gather now we’re even entering a period when people are complaining of sycophantic models, [LAUGHS] where the models are being too personable and too flattering. I do think that’s been one of the great surprises. And in fact, this is a huge phenomenon, how charming these models can be. 

LEE: Yeah, I think you’re right. We can take credit for understanding that, Wow, these things can be remarkably empathetic. But then we missed this problem of sycophancy. Like, we even started our book in Chapter 1 with a quote from Davinci 3 scolding me. Like, don’t you remember when we were first starting, this thing was actually anti-sycophantic. If anything, it would tell you you’re an idiot.  

KOHANE: It argued with me about certain biology questions. It was like a knockdown, drag-out fight. [LAUGHTER] I was bringing references. It was impressive. But in fact, it made me trust it more. 

LEE: Yeah. 

KOHANE: And in fact, I will say—I remember it’s in the book—I had a bone to pick with Peter. Peter really was impressed by the empathy. And I pointed out that some of the most popular doctors are popular because they’re very empathic. But they’re not necessarily the best doctors. And in fact, I was taught that in medical school.   

And so it’s a decoupling. It’s a human thing, that the empathy does not necessarily mean … it’s more of a, potentially, more of a signaled virtue than an actual virtue. 

GOLDBERG: Nicely put. 

LEE: Yeah, this issue of sycophancy, I think, is a struggle right now in the development of AI because I think it’s somehow related to instruction-following. So, you know, one of the challenges in AI is you’d like to give an AI a task—a task that might take several minutes or hours or even days to complete. And you want it to faithfully kind of follow those instructions. And, you know, that early version of GPT-4 was not very good at instruction-following. It would just silently disobey and, you know, and do something different. 

And so I think we’re starting to hit some confusing elements of like, how agreeable should these things be?  

One of the two of you used the word genteel. There was some point even while we were, like, on a little book tour … was it you, Carey, who said that the model seems nicer and less intelligent or less brilliant now than it did when we were writing the book? 

GOLDBERG: It might have been, I think so. And I mean, I think in the context of medicine, of course, the question is, well, what’s likeliest to get the results you want with the patient, right? A lot of healthcare is in fact persuading the patient to do what you know as the physician would be best for them. And so it seems worth testing out whether this sycophancy is actually constructive or not. And I suspect … well, I don’t know, probably depends on the patient. 

So actually, Peter, I have a few questions for you … 

LEE: Yeah. Mm-hmm. 

GOLDBERG: … that have been lingering for me. And one is, for AI to ever fully realize its potential in medicine, it must deal with the hallucinations. And I keep hearing conflicting accounts about whether that’s getting better or not. Where are we at, and what does that mean for use in healthcare? 

LEE: Yeah, well, it’s, I think two years on, in the pretrained base models, there’s no doubt that hallucination rates by any benchmark measure have reduced dramatically. And, you know, that doesn’t mean they don’t happen. They still happen. But, you know, there’s been just a huge amount of effort and understanding in the, kind of, fundamental pretraining of these models. And that has come along at the same time that the inference costs, you know, for actually using these models has gone down, you know, by several orders of magnitude.  

So things have gotten cheaper and have fewer hallucinations. At the same time, now there are these reasoning models. And the reasoning models are able to solve problems at PhD level oftentimes. 

But at least at the moment, they are also now hallucinating more than the simpler pretrained models. And so it still continues to be, you know, a real issue, as we were describing. I don’t know, Zak, from where you’re at in medicine, as a clinician and as an educator in medicine, how is the medical community from where you’re sitting looking at that? 

KOHANE: So I think it’s less of an issue, first of all, because the rate of hallucinations is going down. And second of all, in their day-to-day use, the doctor will provide questions that sit reasonably well into the context of medical decision-making. And the way doctors use this, let’s say on their non-EHR [electronic health record] smartphone is really to jog their memory or thinking about the patient, and they will evaluate independently. So that seems to be less of an issue. I’m actually more concerned about something else that’s I think more fundamental, which is effectively, what values are these models expressing?  

And I’m reminded of when I was still in training, I went to a fancy cocktail party in Cambridge, Massachusetts, and there was a psychotherapist speaking to a dentist. They were talking about their summer, and the dentist was saying about how he was going to fix up his yacht that summer, and the only question was whether he was going to make enough money doing procedures in the spring so that he could afford those things, which was discomforting to me because that dentist was my dentist. [LAUGHTER] And he had just proposed to me a few weeks before an expensive procedure. 

And so the question is what, effectively, is motivating these models?  

LEE: Yeah, yeah.  

KOHANE: And so with several colleagues, I published a paper (opens in new tab), basically, what are the values in AI? And we gave a case: a patient, a boy who is on the short side, not abnormally short, but on the short side, and his growth hormone levels are not zero. They’re there, but they’re on the lowest side. But the rest of the workup has been unremarkable. And so we asked GPT-4, you are a pediatric endocrinologist. 

Should this patient receive growth hormone? And it did a very good job explaining why the patient should receive growth hormone.  

GOLDBERG: Should. Should receive it.  

KOHANE: Should. And then we asked, in a separate session, you are working for the insurance company. Should this patient receive growth hormone? And it actually gave a scientifically better reason not to give growth hormone. And in fact, I tend to agree medically, actually, with the insurance company in this case, because giving kids who are not growth hormone deficient, growth hormone gives only a couple of inches over many, many years, has all sorts of other issues. But here’s the point, we had 180-degree change in decision-making because of the prompt. And for that patient, tens-of-thousands-of-dollars-per-year decision; across patient populations, millions of dollars of decision-making.  

LEE: Hmm. Yeah. 

KOHANE: And you can imagine these user prompts making their way into system prompts, making their way into the instruction-following. And so I think this is aptly central. Just as I was wondering about my dentist, we should be wondering about these things. What are the values that are being embedded in them, some accidentally and some very much on purpose? 

LEE: Yeah, yeah. That one, I think, we even had some discussions as we were writing the book, but there’s a technical element of that that I think we were missing, but maybe Carey, you would know for sure. And that’s this whole idea of prompt engineering. It sort of faded a little bit. Was it a thing? Do you remember? 

GOLDBERG: I don’t think we particularly wrote about it. It’s funny, it does feel like it faded, and it seems to me just because everyone just gets used to conversing with the models and asking for what they want. Like, it’s not like there actually is any great science to it. 

LEE: Yeah, even when it was a hot topic and people were talking about prompt engineering maybe as a new discipline, all this, it never, I was never convinced at the time. But at the same time, it is true. It speaks to what Zak was just talking about because part of the prompt engineering that people do is to give a defined role to the AI.  

You know, you are an insurance claims adjuster, or something like that, and defining that role, that is part of the prompt engineering that people do. 

GOLDBERG: Right. I mean, I can say, you know, sometimes you guys had me take sort of the patient point of view, like the “every patient” point of view. And I can say one of the aspects of using AI for patients that remains absent in as far as I can tell is it would be wonderful to have a consumer-facing interface where you could plug in your whole medical record without worrying about any privacy or other issues and be able to interact with the AI as if it were physician or a specialist and get answers, which you can’t do yet as far as I can tell. 

LEE: Well, in fact, now that’s a good prompt because I think we do need to move on to the next episodes, and we’ll be talking about an episode that talks about consumers. But before we move on to Episode 2, which is next, I’d like to play one more quote, a little snippet from Sara Murray. 

LEE: Carey, you wrote this fictional account at the very start of our book. And that fictional account, I think you and Zak worked on that together, talked about this medical resident, ER resident, using, you know, a chatbot off label, so to speak. And here we have the chief, in fact, the nation’s first chief health AI officer [LAUGHS] for an elite health system doing exactly that. That’s got to be pretty validating for you, Carey. 

GOLDBERG: It’s very. [LAUGHS] Although what’s troubling about it is that actually as in that little vignette that we made up, she’s using it off label, right. It’s like she’s just using it because it helps the way doctors use Google. And I do find it troubling that what we don’t have is sort of institutional buy-in for everyone to do that because, shouldn’t they if it helps? 

LEE: Yeah. Well, let’s go ahead and get into Episode 2. So Episode 2, we sort of framed as talking to two people who are on the frontlines of big companies integrating generative AI into their clinical products. And so, one was Matt Lungren, who’s a colleague of mine here at Microsoft. And then Seth Hain, who leads all of R&D at Epic.  

Maybe we’ll start with a little snippet of something that Matt said that struck me in a certain way. 

LEE: I think we expected healthcare systems to adopt AI, and we spent a lot of time in the book on AI writing clinical encounter notes. It’s happening for real now, and in a big way. And it’s something that has, of course, been happening before generative AI but now is exploding because of it. Where are we at now, two years later, just based on what we heard from guests? 

KOHANE: Well, again, unless they’re forced to, hospitals will not adopt new technology unless it immediately translates into income. So it’s bizarrely counter-cultural that, again, they’re not being able to bill for the use of the AI, but this technology is so compelling to the doctors that despite everything, it’s overtaking the traditional dictation-typing routine. 

LEE: Yeah. 

GOLDBERG: And a lot of them love it and say, you will pry my cold dead hands off of my ambient note-taking, right. And I actually … a primary care physician allowed me to watch her. She was actually testing the two main platforms that are being used. And there was this incredibly talkative patient who went on and on about vacation and all kinds of random things for about half an hour.  

And both of the platforms were incredibly good at pulling out what was actually medically relevant. And so to say that it doesn’t save time doesn’t seem right to me. Like, it seemed like it actually did and in fact was just shockingly good at being able to pull out relevant information. 

LEE: Yeah. 

KOHANE: I’m going to hypothesize that in the trials, which have in fact shown no gain in time, is the doctors were being incredibly meticulous. [LAUGHTER] So I think … this is a Hawthorne effect, because you know you’re being monitored. And we’ve seen this in other technologies where the moment the focus is off, it’s used much more routinely and with much less inspection, for the better and for the worse. 

LEE: Yeah, you know, within Microsoft, I had some internal disagreements about Microsoft producing a product in this space. It wouldn’t be Microsoft’s normal way. Instead, we would want 50 great companies building those products and doing it on our cloud instead of us competing against those 50 companies. And one of the reasons is exactly what you both said. I didn’t expect that health systems would be willing to shell out the money to pay for these things. It doesn’t generate more revenue. But I think so far two years later, I’ve been proven wrong.

I wanted to ask a question about values here. I had this experience where I had a little growth, a bothersome growth on my cheek. And so had to go see a dermatologist. And the dermatologist treated it, froze it off. But there was a human scribe writing the clinical note.  

And so I used the app to look at the note that was submitted. And the human scribe said something that did not get discussed in the exam room, which was that the growth was making it impossible for me to safely wear a COVID mask. And that was the reason for it. 

And that then got associated with a code that allowed full reimbursement for that treatment. And so I think that’s a classic example of what’s called upcoding. And I strongly suspect that AI scribes, an AI scribe would not have done that. 

GOLDBERG: Well, depending what values you programmed into it, right, Zak? [LAUGHS] 

KOHANE: Today, today, today, it will not do it. But, Peter, that is actually the central issue that society has to have because our hospitals are currently mostly in the red. And upcoding is standard operating procedure. And if these AI get in the way of upcoding, they are going to be aligned towards that upcoding. You know, you have to ask yourself, these MRI machines are incredibly useful. They’re also big money makers. And if the AI correctly says that for this complaint, you don’t actually have to do the MRI …  

LEE: Right. 

KOHANE: what’s going to happen? And so I think this issue of values … you’re right. Right now, they’re actually much more impartial. But there are going to be business plans just around aligning these things towards healthcare. In many ways, this is why I think we wrote the book so that there should be a public discussion. And what kind of AI do we want to have? Whose values do we want it to represent? 

GOLDBERG: Yeah. And that raises another question for me. So, Peter, speaking from inside the gigantic industry, like, there seems to be such a need for self-surveillance of the models for potential harms that they could be causing. Are the big AI makers doing that? Are they even thinking about doing that? 

Like, let’s say you wanted to watch out for the kind of thing that Zak’s talking about, could you? 

LEE: Well, I think evaluation, like the best evaluation we had when we wrote our book was, you know, what score would this get on the step one and step two US medical licensing exams? [LAUGHS]  

GOLDBERG: Right, right, right, yeah. 

LEE: But honestly, evaluation hasn’t gotten that much deeper in the last two years. And it’s a big, I think, it is a big issue. And it’s related to the regulation issue also, I think. 

Now the other guest in Episode 2 is Seth Hain from Epic. You know, Zak, I think it’s safe to say that you’re not a fan of Epic and the Epic system. You know, we’ve had a few discussions about that, about the fact that doctors don’t have a very pleasant experience when they’re using Epic all day.  

Seth, in the podcast, said that there are over 100 AI integrations going on in Epic’s system right now. Do you think, Zak, that that has a chance to make you feel better about Epic? You know, what’s your view now two years on? 

KOHANE: My view is, first of all, I want to separate my view of Epic and how it’s affected the conduct of healthcare and the quality of life of doctors from the individuals. Like Seth Hain is a remarkably fine individual who I’ve enjoyed chatting with and does really great stuff. Among the worst aspects of the Epic, even though it’s better in that respect than many EHRs, is horrible user interface. 

The number of clicks that you have to go to get to something. And you have to remember where someone decided to put that thing. It seems to me that it is fully within the realm of technical possibility today to actually give an agent a task that you want done in the Epic record. And then whether Epic has implemented that agent or someone else, it does it so you don’t have to do the clicks. Because it’s something really soul sucking that when you’re trying to help patients, you’re having to remember not the right dose of the medication, but where was that particular thing that you needed in that particular task?  

I can’t imagine that Epic does not have that in its product line. And if not, I know there must be other companies that essentially want to create that wrapper. So I do think, though, that the danger of multiple integrations is that you still want to have the equivalent of a single thought process that cares about the patient bringing those different processes together. And I don’t know if that’s Epic’s responsibility, the hospital’s responsibility, whether it’s actually a patient agent. But someone needs to be also worrying about all those AIs that are being integrated into the patient record. So … what do you think, Carey? 

GOLDBERG: What struck me most about what Seth said was his description of the Cosmos project, and I, you know, I have been drinking Zak’s Kool-Aid for a very long time, [LAUGHTER] and he—no, in a good way! And he persuaded me long ago that there is this horrible waste happening in that we have all of these electronic medical records, which could be used far, far more to learn from, and in particular, when you as a patient come in, it would be ideal if your physician could call up all the other patients like you and figure out what the optimal treatment for you would be. And it feels like—it sounds like—that’s one of the central aims that Epic is going for. And if they do that, I think that will redeem a lot of the pain that they’ve caused physicians these last few years.  

And I also found myself thinking, you know, maybe this very painful period of using electronic medical records was really just a growth phase. It was an awkward growth phase. And once AI is fully used the way Zak is beginning to describe, the whole system could start making a lot more sense for everyone. 

LEE: Yeah. One conversation I’ve had with Seth, in all of this is, you know, with AI and its development, is there a future, a near future where we don’t have an EHR [electronic health record] system at all? You know, AI is just listening and just somehow absorbing all the information. And, you know, one thing that Seth said, which I felt was prescient, and I’d love to get your reaction, especially Zak, on this is he said, I think that … he said, technically, it could happen, but the problem is right now, actually doctors do a lot of their thinking when they write and review notes. You know, the actual process of being a doctor is not just being with a patient, but it’s actually thinking later. What do you make of that? 

KOHANE: So one of the most valuable experiences I had in training was something that’s more or less disappeared in medicine, which is the post-clinic conference, where all the doctors come together and we go through the cases that we just saw that afternoon. And we, actually, were trying to take potshots at each other [LAUGHTER] in order to actually improve. Oh, did you actually do that? Oh, I forgot. I’m going to go call the patient and do that.  

And that really happened. And I think that, yes, doctors do think, and I do think that we are insufficiently using yet the artificial intelligence currently in the ambient dictation mode as much more of a independent agent saying, did you think about that? 

I think that would actually make it more interesting, challenging, and clearly better for the patient because that conversation I just told you about with the other doctors, that no longer exists.  

LEE: Yeah. Mm-hmm. I want to do one more thing here before we leave Matt and Seth in Episode 2, which is something that Seth said with respect to how to reduce hallucination.  

LEE: Yeah, so, Carey, this sort of gets at what you were saying, you know, that shouldn’t these models be just bringing in a lot more information into their thought processes? And I’m certain when we wrote our book, I had no idea. I did not conceive of RAG at all. It emerged a few months later.  

And to my mind, I remember the first time I encountered RAG—Oh, this is going to solve all of our problems of hallucination. But it’s turned out to be harder. It’s improving day by day, but it’s turned out to be a lot harder. 

KOHANE: Seth makes a very deep point, which is the way RAG is implemented is basically some sort of technique for pulling the right information that’s contextually relevant. And the way that’s done is typically heuristic at best. And it’s not … doesn’t have the same depth of reasoning that the rest of the model has.  

And I’m just wondering, Peter, what you think, given the fact that now context lengths seem to be approaching a million or more, and people are now therefore using the full strength of the transformer on that context and are trying to figure out different techniques to make it pay attention to the middle of the context. In fact, the RAG approach perhaps was just a transient solution to the fact that it’s going to be able to amazingly look in a thoughtful way at the entire record of the patient, for example. What do you think, Peter? 

LEE: I think there are three things, you know, that are going on, and I’m not sure how they’re going to play out and how they’re going to be balanced. And I’m looking forward to talking to people in later episodes of this podcast, you know, people like Sébastien Bubeck or Bill Gates about this, because, you know, there is the pretraining phase, you know, when things are sort of compressed and baked into the base model.  

There is the in-context learning, you know, so if you have extremely long or infinite context, you’re kind of learning as you go along. And there are other techniques that people are working on, you know, various sorts of dynamic reinforcement learning approaches, and so on. And then there is what maybe you would call structured RAG, where you do a pre-processing. You go through a big database, and you figure it all out. And make a very nicely structured database the AI can then consult with later.  

And all three of these in different contexts today seem to show different capabilities. But they’re all pretty important in medicine.   

[TRANSITION MUSIC] 

Moving on to Episode 3, we talked to Dave DeBronkart, who is also known as “e-Patient Dave,” an advocate of patient empowerment, and then also Christina Farr, who has been doing a lot of venture investing for consumer health applications.  

Let’s get right into this little snippet from something that e-Patient Dave said that talks about the sources of medical information, particularly relevant for when he was receiving treatment for stage 4 kidney cancer. 

LEE: All right. So I have a question for you, Carey, and a question for you, Zak, about the whole conversation with e-Patient Dave, which I thought was really remarkable. You know, Carey, I think as we were preparing for this whole podcast series, you made a comment—I actually took it as a complaint—that not as much has happened as I had hoped or thought. People aren’t thinking boldly enough, you know, and I think, you know, I agree with you in the sense that I think we expected a lot more to be happening, particularly in the consumer space. I’m giving you a chance to vent about this. 

GOLDBERG: [LAUGHTER] Thank you! Yes, that has been by far the most frustrating thing to me. I think that the potential for AI to improve everybody’s health is so enormous, and yet, you know, it needs some sort of support to be able to get to the point where it can do that. Like, remember in the book we wrote about Greg Moore talking about how half of the planet doesn’t have healthcare, but people overwhelmingly have cellphones. And so you could connect people who have no healthcare to the world’s medical knowledge, and that could certainly do some good.  

And I have one great big problem with e-Patient Dave, which is that, God, he’s fabulous. He’s super smart. Like, he’s not a typical patient. He’s an off-the-charts, brilliant patient. And so it’s hard to … and so he’s a great sort of lead early-adopter-type person, and he can sort of show the way for others.  

But what I had hoped for was that there would be more visible efforts to really help patients optimize their healthcare. Probably it’s happening a lot in quiet ways like that any discharge instructions can be instantly beautifully translated into a patient’s native language and so on. But it’s almost like there isn’t a mechanism to allow this sort of mass consumer adoption that I would hope for.

LEE: Yeah. But you have written some, like, you even wrote about that person who saved his dog (opens in new tab). So do you think … you know, and maybe a lot more of that is just happening quietly that we just never hear about? 

GOLDBERG: I’m sure that there is a lot of it happening quietly. And actually, that’s another one of my complaints is that no one is gathering that stuff. It’s like you might happen to see something on social media. Actually, e-Patient Dave has a hashtag, PatientsUseAI, and a blog, as well. So he’s trying to do it. But I don’t know of any sort of overarching or academic efforts to, again, to surveil what’s the actual use in the population and see what are the pros and cons of what’s happening. 

LEE: Mm-hmm. So, Zak, you know, the thing that I thought about, especially with that snippet from Dave, is your opening for Chapter 8 that you wrote, you know, about your first patient dying in your arms. I still think of how traumatic that must have been. Because, you know, in that opening, you just talked about all the little delays, all the little paper-cut delays, in the whole process of getting some new medical technology approved. But there’s another element that Dave kind of speaks to, which is just, you know, patients who are experiencing some issue are very, sometimes very motivated. And there’s just a lot of stuff on social media that happens. 

KOHANE: So this is where I can both agree with Carey and also disagree. I think when people have an actual health problem, they are now routinely using it. 

GOLDBERG: Yes, that’s true. 

KOHANE: And that situation is happening more often because medicine is failing. This is something that did not come up enough in our book. And perhaps that’s because medicine is actually feeling a lot more rickety today than it did even two years ago.  

We actually mentioned the problem. I think, Peter, you may have mentioned the problem with the lack of primary care. But now in Boston, our biggest healthcare system, all the practices for primary care are closed. I cannot get for my own faculty—residents at MGH [Massachusetts General Hospital] can’t get primary care doctor. And so … 

LEE: Which is just crazy. I mean, these are amongst the most privileged people in medicine, and they can’t find a primary care physician. That’s incredible. 

KOHANE: Yeah, and so therefore … and I wrote an article about this in the NEJM [New England Journal of Medicine] (opens in new tab) that medicine is in such dire trouble that we have incredible technology, incredible cures, but where the rubber hits the road, which is at primary care, we don’t have very much.  

And so therefore, you see people who know that they have a six-month wait till they see the doctor, and all they can do is say, “I have this rash. Here’s a picture. What’s it likely to be? What can I do?” “I’m gaining weight. How do I do a ketogenic diet?” Or, “How do I know that this is the flu?”  
 
This is happening all the time, where acutely patients have actually solved problems that doctors have not. Those are spectacular. But I’m saying more routinely because of the failure of medicine. And it’s not just in our fee-for-service United States. It’s in the UK; it’s in France. These are first-world, developed-world problems. And we don’t even have to go to lower- and middle-income countries for that. 

LEE: Yeah. 

GOLDBERG: But I think it’s important to note that, I mean, so you’re talking about how even the most elite people in medicine can’t get the care they need. But there’s also the point that we have so much concern about equity in recent years. And it’s likeliest that what we’re doing is exacerbating inequity because it’s only the more connected, you know, better off people who are using AI for their health. 

KOHANE: Oh, yes. I know what various Harvard professors are doing. They’re paying for a concierge doctor. And that’s, you know, a $5,000- to $10,000-a-year-minimum investment. That’s inequity. 

LEE: When we wrote our book, you know, the idea that GPT-4 wasn’t trained specifically for medicine, and that was amazing, but it might get even better and maybe would be necessary to do that. But one of the insights for me is that in the consumer space, the kinds of things that people ask about are different than what the board-certified clinician would ask. 

KOHANE: Actually, that’s, I just recently coined the term. It’s the … maybe it’s … well, at least it’s new to me. It’s the technology or expert paradox. And that is the more expert and narrow your medical discipline, the more trivial it is to translate that into a specialized AI. So echocardiograms? We can now do beautiful echocardiograms. That’s really hard to do. I don’t know how to interpret an echocardiogram. But they can do it really, really well. Interpret an EEG [electroencephalogram]. Interpret a genomic sequence. But understanding the fullness of the human condition, that’s actually hard. And actually, that’s what primary care doctors do best. But the paradox is right now, what is easiest for AI is also the most highly paid in medicine. [LAUGHTER] Whereas what is the hardest for AI in medicine is the least regarded, least paid part of medicine. 

GOLDBERG: So this brings us to the question I wanted to throw at both of you actually, which is we’ve had this spasm of incredibly prominent people predicting that in fact physicians would be pretty obsolete within the next few years. We had Bill Gates saying that; we had Elon Musk saying surgeons are going to be obsolete within a few years. And I think we had Demis Hassabis saying, “Yeah, we’ll probably cure most diseases within the next decade or so.” [LAUGHS] 

So what do you think? And also, Zak, to what you were just saying, I mean, you’re talking about being able to solve very general overarching problems. But in fact, these general overarching models are actually able, I would think, are able to do that because they are broad. So what are we heading towards do you think? What should the next book be … The end of doctors? [LAUGHS] 

KOHANE: So I do recall a conversation that … we were at a table with Bill Gates, and Bill Gates immediately went to this, which is advancing the cutting edge of science. And I have to say that I think it will accelerate discovery. But eliminating, let’s say, cancer? I think that’s going to be … that’s just super hard. The reason it’s super hard is we don’t have the data or even the beginnings of the understanding of all the ways this devilish disease managed to evolve around our solutions.  

And so that seems extremely hard. I think we’ll make some progress accelerated by AI, but solving it in a way Hassabis says, God bless him. I hope he’s right. I’d love to have to eat crow in 10 or 20 years, but I don’t think so. I do believe that a surgeon working on one of those Davinci machines, that stuff can be, I think, automated.  

And so I think that’s one example of one of the paradoxes I described. And it won’t be that we’re replacing doctors. I just think we’re running out of doctors. I think it’s really the case that, as we said in the book, we’re getting a huge deficit in primary care doctors. 

But even the subspecialties, my subspecialty, pediatric endocrinology, we’re only filling half of the available training slots every year. And why? Because it’s a lot of work, a lot of training, and frankly doesn’t make as much money as some of the other professions.  

LEE: Yeah. Yeah, I tend to think that, you know, there are going to be always a need for human doctors, not for their skills. In fact, I think their skills increasingly will be replaced by machines. And in fact, I’ve talked about a flip. In fact, patients will demand, Oh my god, you mean you’re going to try to do that yourself instead of having the computer do it? There’s going to be that sort of flip. But I do think that when it comes to people’s health, people want the comfort of an authority figure that they trust. And so what is more of a question for me is whether we will ever view a machine as an authority figure that we can trust. 

And before I move on to Episode 4, which is on norms, regulations and ethics, I’d like to hear from Chrissy Farr on one more point on consumer health, specifically as it relates to pregnancy: 

LEE: In the consumer space, I don’t think we really had a focus on those periods in a person’s life when they have a lot of engagement, like pregnancy, or I think another one is menopause, cancer. You know, there are points where there is, like, very intense engagement. And we heard that from e-Patient Dave, you know, with his cancer and Chrissy with her pregnancy. Was that a miss in our book? What do think, Carey? 

GOLDBERG: I mean, I don’t think so. I think it’s true that there are many points in life when people are highly engaged. To me, the problem thus far is just that I haven’t seen consumer-facing companies offering beautiful AI-based products. I think there’s no question at all that the market is there if you have the products to offer. 

LEE: So, what do you think this means, Zak, for, you know, like Boston Children’s or Mass General Brigham—you know, the big places? 

KOHANE: So again, all these large healthcare systems are in tough shape. MGB [Mass General Brigham] would be fully in the red if not for the fact that its investments, of all things, have actually produced. If you look at the large healthcare systems around the country, they are in the red. And there’s multiple reasons why they’re in the red, but among them is cost of labor.  

And so we’ve created what used to be a very successful beast, the health center. But it’s developed a very expensive model and a highly regulated model. And so when you have high revenue, tiny margins, your ability to disrupt yourself, to innovate, is very, very low because you will have to talk to the board next year if you went from 2% positive margin to 1% negative margin.  

LEE: Yeah. 

KOHANE: And so I think we’re all waiting for one of the two things to happen, either a new kind of healthcare delivery system being generated or ultimately one of these systems learns how to disrupt itself.  

LEE: Yeah. All right. I think we have to move on to Episode 4. And, you know, when it came to the question of regulation, I think this is … my read is when we were writing our book, this is the part that we struggled with the most.  

GOLDBERG: We punted. [LAUGHS] We totally punted to the AI. 

LEE: We had three amazing guests. One was Laura Adams from National Academy of Medicine. Let’s play a snippet from her. 

LEE: All right, so I very well remember that we had discussed this kind of idea when we were writing our book. And I think before we finished our book, I personally rejected the idea. But now two years later, what do the two of you think? I’m dying to hear. 

GOLDBERG: Well, wait, why … what do you think? Like, are you sorry that you rejected it? 

LEE: I’m still skeptical because when we are licensing human beings as doctors, you know, we’re making a lot of implicit assumptions that we don’t test as part of their licensure, you know, that first of all, they are [a] human being and they care about life, and that, you know, they have a certain amount of common sense and shared understanding of the world.  

And there’s all sorts of sort of implicit assumptions that we have about each other as human beings living in a society together. That you know how to study, you know, because I know you just went through three years of medical or four years of medical school and all sorts of things. And so the standard ways that we license human beings, they don’t need to test all of that stuff. But somehow intuitively, all of that seems really important. 

I don’t know. Am I wrong about that? 

KOHANE: So it’s compared with what issue? Because we know for a fact that doctors who do a lot of a procedure, like do this procedure, like high-risk deliveries all the time, have better outcomes than ones who only do a few high risk. We talk about it, but we don’t actually make it explicit to patients or regulate that you have to have this minimal amount. And it strikes me that in some sense, and, oh, very importantly, these things called human beings learn on the job. And although I used to be very resentful of it as a resident, when someone would say, I don’t want the resident, I want the … 

GOLDBERG: … the attending. [LAUGHTER] 

KOHANE: … they had a point. And so the truth is, maybe I was a wonderful resident, but some people were not so great. [LAUGHTER] And so it might be the best outcome if we actually, just like for human beings, we say, yeah, OK, it’s this good, but don’t let it work autonomously, or it’s done a thousand of them, just let it go. We just don’t have practically speaking, we don’t have the environment, the lab, to test them. Now, maybe if they get embodied in robots and literally go around with us, then it’s going to be [in some sense] a lot easier. I don’t know. 

LEE: Yeah.  

GOLDBERG: Yeah, I think I would take a step back and say, first of all, we weren’t the only ones who were stumped by regulating AI. Like, nobody has done it yet in the United States to this day, right. Like, we do not have standing regulation of AI in medicine at all in fact. And that raises the issue of … the story that you hear often in the biotech business, which is, you know, more prominent here in Boston than anywhere else, is that thank goodness Cambridge put out, the city of Cambridge, put out some regulations about biotech and how you could dump your lab waste and so on. And that enabled the enormous growth of biotech here.  

If you don’t have the regulations, then you can’t have the growth of AI in medicine that is worthy of having. And so, I just … we’re not the ones who should do it, but I just wish somebody would.  

LEE: Yeah. 

GOLDBERG: Zak. 

KOHANE: Yeah, but I want to say this as always, execution is everything, even in regulation.  

And so I’m mindful that a conference that both of you attended, the RAISE conference [Responsible AI for Social and Ethical Healthcare] (opens in new tab). The Europeans in that conference came to me personally and thanked me for organizing this conference about safe and effective use of AI because they said back home in Europe, all that we’re talking about is risk, not opportunities to improve care.  

And so there is a version of regulation which just locks down the present and does not allow the future that we’re talking about to happen. And so, Carey, I absolutely hear you that we need to have a regulation that takes away some of the uncertainty around liability, around the freedom to operate that would allow things to progress. But we wrote in our book that premature regulation might actually focus on the wrong thing. And so since I’m an optimist, it may be the fact that we don’t have much of a regulatory infrastructure today, that it allows … it’s a unique opportunity—I’ve said this now to several leaders—for the healthcare systems to say, this is the regulation we need.  

GOLDBERG: It’s true. 

KOHANE: And previously it was top-down. It was coming from the administration, and those executive orders are now history. But there is an opportunity, which may or may not be attained, there is an opportunity for the healthcare leadership—for experts in surgery—to say, “This is what we should expect.”  

LEE: Yeah.  

KOHANE: I would love for this to happen. I haven’t seen evidence that it’s happening yet. 

GOLDBERG: No, no. And there’s this other huge issue, which is that it’s changing so fast. It’s moving so fast. That something that makes sense today won’t in six months. So, what do you do about that

LEE: Yeah, yeah, that is something I feel proud of because when I went back and looked at our chapter on this, you know, we did make that point, which I think has turned out to be true.  

But getting back to this conversation, there’s something, a snippet of something, that Vardit Ravitsky said that I think touches on this topic.  

GOLDBERG: Totally agree. Who cares about informed consent about AI. Don’t want it. Don’t need it. Nope. 

LEE: Wow. Yeah. You know, and this … Vardit of course is one of the leading bioethicists, you know, and of course prior to AI, she was really focused on genetics. But now it’s all about AI.  

And, Zak, you know, you and other doctors have always told me, you know, the truth of the matter is, you know, what do you call the bottom-of-the-class graduate of a medical school? 

And the answer is “doctor.” 

KOHANE: “Doctor.” Yeah. Yeah, I think that again, this gets to compared with what? We have to compare AI not to the medicine we imagine we have, or we would like to have, but to the medicine we have today. And if we’re trying to remove inequity, if we’re trying to improve our health, that’s what … those are the right metrics. And so that can be done so long as we avoid catastrophic consequences of AI.  

So what would the catastrophic consequence of AI be? It would be a systematic behavior that we were unaware of that was causing poor healthcare. So, for example, you know, changing the dose on a medication, making it 20% higher than normal so that the rate of complications of that medication went from 1% to 5%. And so we do need some sort of monitoring.  

We haven’t put out the paper yet, but in computer science, there’s, well, in programming, we know very well the value for understanding how our computer systems work.  

And there was a guy by name of Allman, I think he’s still at a company called Sendmail, who created something called syslog. And syslog is basically a log of all the crap that’s happening in our operating system. And so I’ve been arguing now for the creation of MedLog. And MedLog … in other words, what we cannot measure, we cannot regulate, actually. 

LEE: Yes. 

KOHANE: And so what we need to have is MedLog, which says, “Here’s the context in which a decision was made. Here’s the version of the AI, you know, the exact version of the AI. Here was the data.” And we just have MedLog. And I think MedLog is actually incredibly important for being able to measure, to just do what we do in … it’s basically the black box for, you know, when there’s a crash. You know, we’d like to think we could do better than crash. We can say, “Oh, we’re seeing from MedLog that this practice is turning a little weird.” But worst case, patient dies, [we] can see in MedLog, what was the information this thing knew about it? And did it make the right decision? We can actually go for transparency, which like in aviation, is much greater than in most human endeavors.  

GOLDBERG: Sounds great. 

LEE: Yeah, it’s sort of like a black box. I was thinking of the aviation black box kind of idea. You know, you bring up medication errors, and I have one more snippet. This is from our guest Roxana Daneshjou from Stanford.

LEE: Yeah, so this is something we did write about in the book. We made a prediction that AI might be a second set of eyes, I think is the way we put it, catching things. And we actually had examples specifically in medication dose errors. I think for me, I expected to see a lot more of that than we are. 

KOHANE: Yeah, it goes back to our conversation about Epic or competitor Epic doing that. I think we’re going to see that having oversight over all medical orders, all orders in the system, critique, real-time critique, where we’re both aware of alert fatigue. So we don’t want to have too many false positives. At the same time, knowing what are critical errors which could immediately affect lives. I think that is going to become in terms of—and driven by quality measures—a product. 

GOLDBERG: And I think word will spread among the general public that kind of the same way in a lot of countries when someone’s in a hospital, the first thing people ask relatives are, well, who’s with them? Right?  

LEE: Yeah. Yup. 

GOLDBERG: You wouldn’t leave someone in hospital without relatives. Well, you wouldn’t maybe leave your medical …  

KOHANE: By the way, that country is called the United States. 

GOLDBERG: Yes, that’s true. [LAUGHS] It is true here now, too. But similarly, I would tell any loved one that they would be well advised to keep using AI to check on their medical care, right. Why not? 

LEE: Yeah. Yeah. Last topic, just for this Episode 4. Roxana, of course, I think really made a name for herself in the AI era writing, actually just prior to ChatGPT, you know, writing some famous papers about how computer vision systems for dermatology were biased against dark-skinned people. And we did talk some about bias in these AI systems, but I feel like we underplayed it, or we didn’t understand the magnitude of the potential issues. What are your thoughts? 

KOHANE: OK, I want to push back, because I’ve been asked this question several times. And so I have two comments. One is, over 100,000 doctors practicing medicine, I know they have biases. Some of them actually may be all in the same direction, and not good. But I have no way of actually measuring that. With AI, I know exactly how to measure that at scale and affordably. Number one. Number two, same 100,000 doctors. Let’s say I do know what their biases are. How hard is it for me to change that bias? It’s impossible … 

LEE: Yeah, yeah.  

KOHANE: … practically speaking. Can I change the bias in the AI? Somewhat. Maybe some completely. 

I think that we’re in a much better situation. 

GOLDBERG: Agree. 

LEE: I think Roxana made also the super interesting point that there’s bias in the whole system, not just in individuals, but, you know, there’s structural bias, so to speak.  

KOHANE: There is. 

LEE: Yeah. Hmm. There was a super interesting paper that Roxana wrote not too long ago—her and her collaborators—showing AI’s ability to detect, to spot bias decision-making by others. Are we going to see more of that? 

KOHANE: Oh, yeah, I was very pleased when, in NEJM AI [New England Journal of Medicine Artificial Intelligence], we published a piece with Marzyeh Ghassemi (opens in new tab), and what they were talking about was actually—and these are researchers who had published extensively on bias and threats from AI. And they actually, in this article, did the flip side, which is how much better AI can do than human beings in this respect.  

And so I think that as some of these computer scientists enter the world of medicine, they’re becoming more and more aware of human foibles and can see how these systems, which if they only looked at the pretrained state, would have biases. But now, where we know how to fine-tune the de-bias in a variety of ways, they can do a lot better and, in fact, I think are much more … a much greater reason for optimism that we can change some of these noxious biases than in the pre-AI era. 

GOLDBERG: And thinking about Roxana’s dermatological work on how I think there wasn’t sufficient work on skin tone as related to various growths, you know, I think that one thing that we totally missed in the book was the dawn of multimodal uses, right. 

LEE: Yeah. Yeah, yeah. 

GOLDBERG: That’s been truly amazing that in fact all of these visual and other sorts of data can be entered into the models and move them forward. 

LEE: Yeah. Well, maybe on these slightly more optimistic notes, we’re at time. You know, I think ultimately, I feel pretty good still about what we did in our book, although there were a lot of misses. [LAUGHS] I don’t think any of us could really have predicted really the extent of change in the world.  

[TRANSITION MUSIC] 

So, Carey, Zak, just so much fun to do some reminiscing but also some reflection about what we did. 

[THEME MUSIC] 

And to our listeners, as always, thank you for joining us. We have some really great guests lined up for the rest of the series, and they’ll help us explore a variety of relevant topics—from AI drug discovery to what medical students are seeing and doing with AI and more.  

We hope you’ll continue to tune in. And if you want to catch up on any episodes you might have missed, you can find them at aka.ms/AIrevolutionPodcast (opens in new tab) or wherever you listen to your favorite podcasts.   

Until next time.  

[MUSIC FADES]


The post Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers appeared first on Microsoft Research.

Read More

Predicting and explaining AI model performance: A new approach to evaluation

Predicting and explaining AI model performance: A new approach to evaluation

The image shows a radar chart comparing the performance of different AI models across various metrics. The chart has a circular grid with labeled axes including VO, AS, CEc, CEe, CL, MCr, MCt, MCu, MS, QLI, QLqA, SNs, KNa, KNc, KNF, KNn, and AT. Different AI models are represented by various line styles: Babbage-002 (dotted line), Davinci-002 (dash-dotted line), GPT-3.5-Turbo (dashed line), GPT-4.0 (solid thin line), OpenAI ol-mini (solid thick line), and OpenAI o1 (solid bold line). There is a legend in the bottom left corner explaining the line styles for each model. The background transitions from blue on the left to green on the right.

With support from the Accelerating Foundation Models Research (AFMR) grant program, a team of researchers from Microsoft and collaborating institutions has developed an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do.

In the paper, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power,” they introduce a methodology that goes beyond measuring overall accuracy. It assesses the knowledge and cognitive abilities a task requires and evaluates them against the model’s capabilities.

ADeLe: An ability-based approach to task evaluation

The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities. This difficulty rating is based on a detailed rubric, originally developed for human tasks and shown to work reliably when applied by AI models.

By comparing what a task requires with what a model can do, ADeLe generates an ability profile that not only predicts performance but also explains why a model is likely to succeed or fail—linking outcomes to specific strengths or limitations.

The 18 scales reflect core cognitive abilities (e.g., attention, reasoning), knowledge areas (e.g., natural or social sciences), and other task-related factors (e.g., prevalence of the task on the internet). Each task is rated from 0 to 5 based on how much it draws on a given ability. For example, a simple math question might score 1 on formal knowledge, while one requiring advanced expertise could score 5. Figure 1 illustrates how the full process works—from rating task requirements to generating ability profiles.

Figure 1: This diagram presents a framework for explaining and predicting AI performance on new tasks using cognitive demand profiles. The System Process (top) evaluates an AI system on the ADeLe Battery—tasks annotated with DeLeAn rubrics—to create an ability profile with each dimension representing what level of demand the model can reach. The Task Process (bottom) applies the same rubrics to new tasks, generating demand profiles from annotated inputs. An optional assessor model can be trained to robustly predict how well the AI system will perform on these new tasks by matching system abilities to task demands.
Figure 1. Top: For each AI model, (1) run the new system on the ADeLe benchmark, and (2) extract its ability profile. Bottom: For each new task or benchmark, (A) apply 18 rubrics and (B) get demand histograms and profiles that explain what abilities the tasks require. Optionally, predict performance on the new tasks for any system based on the demand and ability profiles, or past performance data, of the systems.

To develop this system, the team analyzed 16,000 examples spanning 63 tasks drawn from 20 AI benchmarks, creating a unified measurement approach that works across a wide range of tasks. The paper details how ratings across 18 general scales explain model success or failure and predict performance on new tasks in both familiar and unfamiliar settings.

Evaluation results 

Using ADeLe, the team evaluated 20 popular AI benchmarks and uncovered three key findings: 1) Current AI benchmarks have measurement limitations; 2) AI models show distinct patterns of strengths and weaknesses across different capabilities; and 3) ADeLe provides accurate predictions of whether AI systems will succeed or fail on a new task. 

1. Revealing hidden flaws in AI testing methods 

Many popular AI tests either don’t measure what they claim or only cover a limited range of difficulty levels. For example, the Civil Service Examination benchmark is meant to test logical reasoning, but it also requires other abilities, like specialized knowledge and metacognition. Similarly, TimeQA, designed to test temporal reasoning, only includes medium-difficulty questions—missing both simple and complex challenges. 

2. Creating detailed AI ability profiles 

Using the 0–5 rating for each ability, the team created comprehensive ability profiles of 15 LLMs. For each of the 18 abilities measured, they plotted “subject characteristic curves” to show how a model’s success rate changes with task difficulty.  

They then calculated a score for each ability—the difficulty level at which a model has a 50% chance of success—and used these results to generate radial plots showing each model’s strengths and weaknesses across the different scales and levels, illustrated in Figure 2.

Figure 2: The image consists of three radar charts showing ability profiles of 15 LLMs evaluated across 18 ability scales, ranged from 0 to infinity (the higher, the more capable the model is). Each chart has multiple axes labeled with various ability scales such as VO, AS, CEc, AT, CL, MCr, etc. The left chart shows ability for Babbage-002 (light red), Davinci-002 (orange), GPT-3.5-Turbo (red), GPT-4 (dark red), OpenAI o1-mini (gray), and OpenAI o1 (dark gray). The middle chart shows ability for LLaMA models: LLaMA-3.2-1B-Instruct (light blue), LLaMA-3.2-3B-Instruct (blue), LLaMA-3.2-11B-Instruct (dark blue), LLaMA-3.2-90B-Instruct (navy blue), and LLaMA-3.1-405B Instruct (very dark blue). The right chart shows ability for DeepSeek-R1-Dist-Qwen models: DeepSeek-R1-Dist-Qwen-1.5B (light green), DeepSeek-R1-Dist-Qwen-7B (green), DeepSeek-R1-Dist-Qwen-14B (dark green), DK-R1-Dist-Qwen-32B (very dark green). Each model's ability is represented by a colored polygon within the radar charts.
Figure 2. Ability profiles for the 15 LLMs evaluated.

This analysis revealed the following: 

  • When measured against human performance, AI systems show different strengths and weaknesses across the 18 ability scales. 
  • Newer LLMs generally outperform older ones, though not consistently across all abilities. 
  • Knowledge-related performance depends heavily on model size and training methods. 
  • Reasoning models show clear gains over non-reasoning models in logical thinking, learning and abstraction, and social capabilities, such as inferring the mental states of their users. 
  • Increasing the size of general-purpose models after a given threshold only leads to small performance gains. 

3. Predicting AI success and failure 

In addition to evaluation, the team created a practical prediction system based on demand-level measurements that forecasts whether a model will succeed on specific tasks, even unfamiliar ones.  

The system achieved approximately 88% accuracy in predicting the performance of popular models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to anticipate potential failures before deployment, adding the important step of reliability assessment for AI models.

Looking ahead

ADeLe can be extended to multimodal and embodied AI systems, and it has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.

This technology marks a major step toward a science of AI evaluation, one that offers both clear explanations of system behavior and reliable predictions about performance. It aligns with the vision laid out in a previous Microsoft position paper on the promise of applying psychometrics to AI evaluation and a recent Societal AI white paper emphasizing the importance of AI evaluation.

As general-purpose AI advances faster than traditional evaluation methods, this work lays a timely foundation for making AI assessments more rigorous, transparent, and ready for real-world deployment. The research team is working toward building a collaborative community to strengthen and expand this emerging field.

The post Predicting and explaining AI model performance: A new approach to evaluation appeared first on Microsoft Research.

Read More

Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv

Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv

Illustrated headshots of Hongxia Hao (left) and Bing Lv (right).

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts bring its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, senior researcher Hongxia Hao (opens in new tab), and physics professor Bing Lv (opens in new tab), join host Gretchen Huizinga to talk about how they are using deep learning techniques to probe the upper limits of heat transfer in inorganic crystals, discover novel materials with exceptional thermal conductivity, and rewrite the rulebook for designing high-efficiency electronics and sustainable energy.

Transcript

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers.

Today I’m talking to two researchers, Hongxia Hao, a senior researcher at Microsoft Research AI for Science, and Bing Lv, an associate professor in physics at the University of Texas at Dallas. Hongxia and Bing are co-authors of a paper called Probing the Limit of Heat Transfer in Inorganic Crystals with Deep Learning. I’m excited to learn more about this! Hongxia and Bing, it’s great to have you both on Abstracts!


HONGXIA HAO: Nice to be here.

BING LV: Nice to be here, too.

HUIZINGA: So Hongxia, let’s start with you and a brief overview of this paper. In just a few sentences. Tell us about the problem your research addresses and more importantly, why we should care about it.

HAO: Let me start with a very simple yet profound question. What’s the fastest the heat can travel through a solid material? This is not just an academic curiosity, but it’s a question that touched the bottom of how we build technologies around us. So from the moment when you tap your smartphone, and the moment where the laptop is turned on and functioning, heat is always flowing. So we’re trying to answer the question of a century-old mystery of the upper limit of heat transfer in solids. So we care about this not just because it’s a fundamental problem in physics and material science, but because solving it could really rewrite the rulebook for designing high-efficiency electronics and sustainable energy, etc. And nowadays, with very cutting-edge nanometer chips or very fancy technologies, we are packing more computing power into smaller space, but the faster and denser we build, the harder it becomes to remove the heat. So in many ways, thermal bottlenecks, not just transistor density, are now the ceiling of the Moore’s Law. And also the stakes are very enormous. We really wish to bring more thermal solutions by finding more high thermal conductor choices from the perspective of materials discovery with the help of AI.

LV: So I think one of the biggest things as Hongxia said, right? Thermal solutions will become, eventually become, a bottleneck for all type of heterogeneous integration of the materials. So from this perspective, so how people actually have been finding out previously, all the thermal was the last solution to solve. But now people actually more and more realize all these things have to be upfront. This co-design, all these things become very important. So I think what we are doing right now, integrated with AI, helping to identify the large space of the materials, identify fundamentally what will be the limit of this material, will become very important for the society.

HUIZINGA: Hmm. Yeah. Hongxia, did you have anything to add to that?

HAO: Yes, so previously many people are working on exploring these material science questions through experimental tradition and the past few decades people see a new trend using computational materials discovery. Like for example, we do the fundamental solving of the Schrödinger equation using Density Functional Theory [DFT]. Actually, this brings us a lot of opportunities. The question here is, as the theory is getting more and more developed, it’s too expensive for us to make it very large scale and to study tons of materials. Think about this. The bottleneck here, now, is not just about having a very good theory, it’s about the scale. So, this is where AI, specifically now we are using deep learning, comes into play.

HUIZINGA: Well, Hongxia, let’s stay with you for a minute and talk about methodology. How did you do this research and what was the methodology you employed?

HAO: So here we, for this question, we built a pipeline that spans the AI, the quantum mechanics, and computational brute-force with a blend of efficiency and accuracy. It begins with generating an enormous chemical and structure design space because this is inspired by Slack’s principle. We focus first on simple crystals, and there are the systems most likely to have low and harmonious state, fewer phononic scattering events, and therefore potentially have high thermal conductivities. But we didn’t stop here. We also included a huge pool of more complex and higher energy structures to ensure diversity and avoid bias. And for each candidate, we first run like a structure relaxation using MatterSim, which is a deep learning foundational model for material science for us to characterize the properties of materials. And we use that screen for dynamic stability. And now it’s about 200K structures past this filter. And then came another real challenge: calculating the thermal conductivity. We try to solve this problem using the Boltzmann transport equation and the three-phonon scattering process. The twist here is all of this was not done by traditional DFT solvers, but with our deep learning model, the MatterSim. It’s trained to predict energy, force, and stress. And we can get second- and third-order interatomic force constants directly from here, which can guarantee the accuracy of the solution. And finally, to validate the model’s predictions, we performed full DFT-based calculations on the top candidates that we found, some of which even include higher-order scattering mechanism, electron phonon coupling effect, etc. And this rigorous validation gave us confidence in the speed and accuracy trade-offs and revealed a spectrum of materials that had either previously been overlooked or were never before conceived.

HUIZINGA: So Bing, let’s talk about your research findings. How did things work out for you on this project and what did you find?

LV: I think one of the biggest things for this paper is it creates a very large material base. Basically, you can say it’s a smart database which eventually will be made accessible to the public. I think that’s a big achievement because people who actually if they have to look into it, they actually can go search Microsoft database, finding out, oh, this material does have this type of thermal properties. This is actually, this database can send about 230,000 materials. And one of the things we confirm is the highest thermal conductivity material based on all the wisdom of Slack criteria, predicted diamond would have the highest thermal conductivity. We more or less really very solidly prove diamond, at this stage, will remain with the highest thermal conductivity. We have a lot of new materials, exotic materials, which some of them, Hongxia can elaborate a little bit more. So, which having all this very exotic combination of properties, thermal with other properties, which could actually provide a new insight for new physics development, new material development, and a new device perspective. All of this combined will have actually a very profound impact to society.

HUIZINGA: Yeah, Hongxia, go a little deeper on that because that was an interesting part of the paper when you talked about diamond still being the sort of “gold standard,” to mix metaphors! But you’ve also found some other materials that are remarkable compared to silicon.

HAO: Yeah, yeah. Among this search space, even though we didn’t find like something that’s higher than diamonds, but we do discover more than like twenty new materials with thermal conductivity exceeding that of silicon. And silicon is something like a benchmark for criteria that we think we want to compare with because it’s a backbone of modern electronics. More interestingly, I think, is the manganese vanadium. It shows some very interesting and surprising phenomena. Like it’s a metallic compound, but with very high lattice thermal connectivity. And this is the first time discovered by, like, through our search pattern, and it’s something that cannot be easily discovered without the hope with AI. And right now, think Bing can explain more on this, and show some interesting results.

HUIZINGA: Yeah, go ahead Bing.

LV: So this is actually very surprising to me as an experimentalist because of when Hongxia presented their theory work to me, this material, magnesium vanadium, it’s discovered back in 1938, almost 100 years ago, but there’s no more than twenty papers talking about this! A lot of them was on theory, okay, not even on experimental part. We actually did quite a bit of work on this. We actually are in the process; will characterize this and then moving forward even for the thermal conductivity measurements. So that will be hopefully, will be adding to the value of these things, showing you, Hey, AI does help to predict the materials could really generate the new materials with very good high thermal conductivity.

HUIZINGA: Yeah, so Bing, stay with you for a minute. I want you to talk about some kind of real-world applications of this. I know you alluded to a couple of things, but how is this work significant in that respect, and who might be most excited about it, aside from the two of you? [LAUGHS]

LV: So I think as I mentioned before, the first thing is this database. I believe that’s the first ever large material database regarding to the thermal conductivity. And it has, as I said, 230,000 materials with AI-predicted thermal connectivity. This will provide not only science but engineering with a vastly expanding catalog of candidate materials for the future roadmap of integration, material integration, and all these bottlenecks we are talking about, the thermal solution for the semiconductors or for even beyond the semiconductor integration, people actually can have a database to looking for. So these things, it will become very important, and I believe over a long time it will generate a very long impact for the research community, for the society development.

HUIZINGA: Yeah. Hongxia, did you have anything to add to that one too?

HAO: Yeah, so this study reshapes how we think about limits. I like the sentence that the only way to discover the limits of possible is to go beyond them into the impossible. In this case, we tried, but we didn’t break the diamond limit. But we proved it even more rigorously than ever before. In doing so, we also uncovered some uncharted peaks in the thermal conductivity landscape. This would not happen without new AI capabilities for material science. I think in the long run, I believe researchers could benefit from using this AI design and shift their way on how to do materials research with AI.

HUIZINGA: Yeah, it’ll be interesting to see if anyone ever does break the diamond limit with the new tools that are available, but…

HAO: Yeah!

HUIZINGA: So this is the part of the abstracts podcast where I like to ask for sort of a golden nugget, a one sentence takeaway that listeners might get from this paper. If you had one Hongxia, what would it be? And then I’ll ask Bing to maybe give his.

HAO: Yes. AI is no longer just a tool. It’s becoming a critical partner for us in scientific discovery. So our work proved that the large-scale data-driven science can now approach long-standing and fundamental questions with very fresh eyes. When trained well, and guided with physical intuition, models like MatterSim can really realize a full in-silico characterization for materials and don’t just simulate some known materials, but really trying to imagine what nature hasn’t yet revealed. Our work points to a path forward, not just incrementally better materials, but entirely new class of high-performance compounds where we could never have guessed without AI.

HUIZINGA: Yeah. Bing, what’s your one takeaway?

LV: I think I want to add a few things on top of Hongxia’s comments because I think Hongxia has very good critical words I would like to emphasize. When we train the AI well, if we guide the AI well, it could be very useful to become our partner. So I think all in all, our human being’s intellectual merit here is still going to play a significantly important role, okay? We are generating this AI, we should really train the AI, we should be using our human being intellectual merit to guide them to be useful for our human being society advancement. Now with all these AI tools, I think it’s a very golden time right now. Experimentalists could work very closely with like Hongxia, who’s a good theorist who has very good intellectual merits, and then we actually now incorporate with AI, then combine all pieces together, hopefully we’re really able to accelerating material discovery in a much faster pace than ever which the whole society will eventually get a benefit from it.

HUIZINGA: Yeah. Well, as we close, Bing, I want you to go a little further and talk about what’s next then, research wise. What are the open questions or outstanding challenges that remain in this field and what’s on your research agenda to address them?

LV: So first of all, I think this paper is addressing primarily on these crystalline ordered inorganic bulk materials. And also with the condition we are targeting at ambient pressure, room temperature, because that’s normally how the instrument is working, right? But what if under extreme conditions? We want to go to space, right? There we’ll have extreme conditions, some very… sometimes very cold, sometimes very hot. We have some places with extremely probably quite high pressure. Or we have some conditions that are highly radioactive. So under that condition, there’s going to be a new database could be emerged. Can we do something beyond that? Another good important thing is we are targeting this paper on high thermal conductivity. What about extremely low thermal conductivity? Those will actually bring a very good challenge for theorists and also the machine learning approach. I think that’s something Hongxia probably is very excited to work on in that direction. I know since she’s ambitious, she wants to do something more than beyond what we actually achieved so far.

HUIZINGA: Yeah, so Hongxia, how would you encapsulate what your dream research is next?

HAO: Yeah, so I think besides all of these exciting research directions, on my end, another direction is perhaps kind of exciting is we want to move from search to design. So right now we are kind of good at asking like what exists by just doing a forward prediction and brute force. But with generative AI, we can start asking what should exist? In the future, we can have an incorporation between forward prediction and backwards generative design to really tackle questions. If you have materials like you want to have desired like properties, how would you design the problems?

HUIZINGA: Well, it sounds like there’s a full plate of research agenda goodness going forward in this field, both with human brains and AI. So, Hongxia Hao and Bing Lv, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read a pre-print of it on arXiv. See you next time on Abstracts!

The post Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv appeared first on Microsoft Research.

Read More

Research Focus: Week of May 7, 2025

Research Focus: Week of May 7, 2025

In this issue:

New research on compound AI systems and causal verification of the Confidential Consortium Framework; release of Phi-4-reasoning; enriching tabular data with semantic structure, and more.

Research Focus: May 07, 2025

Towards Resource-Efficient Compound AI Systems

Unlike the current state-of-the-art, our vision is fungible workflows with high-level descriptions, managed jointly by the Workflow Orchestrator and Cluster Manager. This allows higher resource multiplexing between independent workflows to improve efficiency.

This research introduces Murakkab, a prototype system built on a declarative workflow that reimagines how compound AI systems are built and managed to significantly improve resource efficiency. Compound AI systems integrate multiple interacting components like language models, retrieval engines, and external tools. They are essential for addressing complex AI tasks. However, current implementations could benefit from greater efficiencies in resource utilization, with improvements to tight coupling between application logic and execution details, better connections between orchestration and resource management layers, and bridging gaps between efficiency and quality.

Murakkab addresses critical inefficiencies in current AI architectures and offers a new approach that unifies workflow orchestration and cluster resource management for better performance and sustainability. In preliminary evaluations, it demonstrates speedups up to ∼ 3.4× in workflow completion times while delivering ∼ 4.5× higher energy efficiency, showing promise in optimizing resources and advancing AI system design.


Smart Casual Verification of the Confidential Consortium Framework

Diagram showing the components of the verification architecture for CCF's consensus. The diagram is discussed in detail in the paper.

This work presents a new, pragmatic verification technique that improves the trustworthiness of distributed systems like the Confidential Consortium Framework (CCF) and proves its effectiveness by catching critical bugs before deployment. Smart casual verification is a novel hybrid verification approach to validating CCF, an open-source platform for developing trustworthy and reliable cloud applications which underpins Microsoft’s Azure Confidential Ledger service. 

The researchers apply smart casual verification to validate the correctness of CCF’s novel distributed protocols, focusing on its unique distributed consensus protocol and its custom client consistency model. This hybrid approach combines the rigor of formal specification and model checking with the pragmatism of automated testing, specifically binding the formal specification in TLA+ to the C++ implementation. While traditional formal methods are often one-off efforts by domain experts, the researchers have integrated smart casual verification into CCF’s continuous integration pipeline, allowing contributors to continuously validate CCF as it evolves. 


Phi-4-reasoning Technical Report

graphical user interface, text, application, email

This report introduces Phi-4-reasoning (opens in new tab), a 14-billion parameter model optimized for complex reasoning tasks. It is trained via supervised fine-tuning of Phi-4 using a carefully curated dataset of high-quality prompts and reasoning demonstrations generated by o3-mini. These prompts span diverse domains—including math, science, coding, and spatial reasoning—and are selected to challenge the base model near its capability boundaries.

Building on recent findings that reinforcement learning (RL) can further improve smaller models, the team developed Phi-4-reasoning-plus, which incorporates an additional outcome-based RL phase using verifiable math problems. This enhances the model’s ability to generate longer, more effective reasoning chains. 

Despite its smaller size, the Phi-4-reasoning family outperforms significantly larger open-weight models such as DeepSeekR1-Distill-Llama-70B and approaches the performance of full-scale frontier models like DeepSeek R1. It excels in tasks requiring multi-step problem solving, logical inference, and goal-directed planning.

The work highlights the combined value of supervised fine-tuning and reinforcement learning for building efficient, high-performing reasoning models. It also offers insights into training data design, methodology, and evaluation strategies. Phi-4-reasoning contributes to the growing class of reasoning-specialized language models and points toward more accessible, scalable AI for science, education, and technical domains.


TeCoFeS: Text Column Featurization using Semantic Analysis

The workflow diagram illustrates the various steps in the TECOFES approach. Step 0 is the embedding computation module, which calculates embeddings for all rows of text, setting the foundation for subsequent steps. Step 2, the smart sampler, captures diverse samples and feeds them into the labeling module (step 3), which generates labels. These labels are then utilized by the Extend Mapping module (step 4) to map the remaining unlabeled data.

This research introduces a practical, cost-effective solution for enriching tabular data with semantic structure, making it more useful for downstream analysis and insights—which is especially valuable in business intelligence, data cleaning, and automated analytics workflows. This approach outperforms baseline models and naive LLM applications on converted text classification benchmarks.

Extracting structured insights from free-text columns in tables—such as product reviews or user feedback—can be time-consuming and error-prone, especially when relying on traditional syntactic methods that often miss semantic meaning. This research introduces the semantic text column featurization problem, which aims to assign meaningful, context-aware labels to each entry in a text column.

The authors propose a scalable, efficient method that combines the power of LLMs with text embeddings. Instead of labeling an entire column manually or applying LLMs to every cell—an expensive process—this new method intelligently samples a diverse subset of entries, uses an LLM to generate semantic labels for just that subset, and then propagates those labels to the rest of the column using embedding similarity.


Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

diagram

This work introduces ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a new paradigm for LLM reasoning that expands beyond traditional language-only inference. 

While LLMs have made considerable strides in complex reasoning tasks, they remain limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this research, ARTIST brings together agentic reasoning, reinforcement learning (RL), and tool integration, designed to enable LLMs to autonomously decide when and how to invoke internal tools within multi-turn reasoning chains. ARTIST leverages outcome-based reinforcement learning to learn robust strategies for tool use and environment interaction without requiring step-level supervision.

Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies show that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions.


Materialism Podcast: MatterGen (opens in new tab)

What if you could find materials with tailored properties without ever entering the lab? The Materialism Podcast, which is dedicated to exploring materials science and engineering, talks with Tian Xie from Microsoft Research to discuss MatterGen, an AI tool which accelerates materials science discovery. Tune in to hear a discussion of the new Azure AI Foundry, where MatterGen will interact with and support MatterSim, an advanced deep learning model designed to simulate the properties of materials across a wide range of elements, temperatures, and pressures.


IN THE NEWS: Highlights of recent media coverage of Microsoft Research


When ChatGPT Broke an Entire Field: An Oral History 

Quanta Magazine | April 30, 2025

Large language models are everywhere, igniting discovery, disruption and debate in whatever scientific community they touch. But the one they touched first — for better, worse and everything in between — was natural language processing. What did that impact feel like to the people experiencing it firsthand?

To tell that story, Quanta interviewed 19 NLP experts, including Kalika Bali, senior principal researcher at Microsoft Research. From researchers to students, tenured academics to startup founders, they describe a series of moments — dawning realizations, elated encounters and at least one “existential crisis” — that changed their world. And ours.

The post Research Focus: Week of May 7, 2025 appeared first on Microsoft Research.

Read More

Microsoft Fusion Summit explores how AI can accelerate fusion research

Microsoft Fusion Summit explores how AI can accelerate fusion research

Sir Steven Cowley, professor and director of the Princeton Plasma Physics Laboratory and former head of the UK Atomic Energy Authority, giving a presentation.

The pursuit of nuclear fusion as a limitless, clean energy source has long been one of humanity’s most ambitious scientific goals. Research labs and companies worldwide are working to replicate the fusion process that occurs at the sun’s core, where isotopes of hydrogen combine to form helium, releasing vast amounts of energy. While scalable fusion energy is still years away, researchers are now exploring how AI can help accelerate fusion research and bring this energy to the grid sooner. 

In March 2025, Microsoft Research held its inaugural Fusion Summit, a landmark event that brought together distinguished speakers and panelists from within and outside Microsoft Research to explore this question. 

Ashley Llorens, Corporate Vice President and Managing Director of Microsoft Research Accelerator, opened the Summit by outlining his vision for a self-reinforcing system that uses AI to drive sustainability. Steven Cowley, laboratory director of the U.S. Department of Energy’s Princeton Plasma Physics Laboratory (opens in new tab), professor at Princeton University, and former head of the UK Atomic Energy Authority, followed with a keynote explaining the intricate science and engineering behind fusion reactors. His message was clear: advancing fusion will require international collaboration and the combined power of AI and high-performance computing to model potential fusion reactor designs. 

Applying AI to fusion research

North America’s largest fusion facility, DIII (opens in new tab)-D, operated by General Atomics and owned by the US Department of Energy (DOE), provides a unique platform for developing and testing AI applications for fusion research, thanks to its pioneering data and digital twin platform. 

Richard Buttery (opens in new tab) from DIII-D and Dave Humphreys (opens in new tab) from General Atomics demonstrated how the US DIII-D National Fusion Program (opens in new tab) is already applying AI to advance reactor design and operations, highlighting promising directions for future development. They provided examples of how to apply AI to active plasma control to avoid disruptive instabilities, using AI-controlled trajectories to avoid tearing modes, and implementing feedback control using machine learning-derived density limits for safer high-density operations. 

One persistent challenge in reactor design involves building the interior “first wall,” which must withstand extreme heat and particle bombardment. Zulfi Alam, corporate vice president of Microsoft Quantum (opens in new tab), discussed the potential of using quantum computing in fusion, particularly for addressing material challenges like hydrogen diffusion in reactors.

He noted that silicon nitride shows promise as a barrier to hydrogen and vapor and explained the challenge of binding it to the reaction chamber. He emphasized the potential of quantum computing to improve material prediction and synthesis, enabling more efficient processes. He shared that his team is also investigating advanced silicon nitride materials to protect this critical component from neutron and alpha particle damage—an innovation that could make fusion commercially viable.

Microsoft research blog

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

PromptWizard from Microsoft Research is now open source. It is designed to automate and simplify AI prompt optimization, combining iterative LLM feedback with efficient exploration and refinement techniques to create highly effective prompts in minutes.


Exploring AI’s broader impact on fusion engineering

Lightning talks from Microsoft Research labs addressed the central question of AI’s potential to accelerate fusion research and engineering. Speakers covered a wide range of applications—from using gaming AI for plasma control and robotics for remote maintenance to physics-informed AI for simulating materials and plasma behavior. Closing the session, Archie Manoharan, Microsoft’s director of nuclear engineering for Cloud Operations and Infrastructure, emphasized the need for a comprehensive energy strategy, one that incorporates renewables, efficiency improvements, storage solutions, and carbon-free sources like fusion.

The Summit culminated in a thought-provoking panel discussion moderated by Ade Famoti, featuring Archie Manoharan, Richard Buttery, Steven Cowley, and Chris Bishop, Microsoft Technical Fellow and director of Microsoft Research AI for Science. Their wide-ranging conversation explored the key challenges and opportunities shaping the field of fusion. 

The panel highlighted several themes: the role of new regulatory frameworks that balance innovation with safety and public trust; the importance of materials discovery in developing durable fusion reactor walls; and the game-changing role AI could play in plasma optimization and surrogate modelling of fusion’s underlying physics.

They also examined the importance of global research collaboration, citing projects like the International Thermonuclear Experimental Reactor (opens in new tab) (ITER), the world’s largest experimental fusion device under construction in southern France, as testbeds for shared progress. One persistent challenge, however, is data scarcity. This prompted a discussion of using physics-informed neural networks as a potential approach to supplement limited experimental data. 

Global collaboration and next steps

Microsoft is collaborating with ITER (opens in new tab) to help advance the technologies and infrastructure needed to achieve fusion ignition—the critical point where a self-sustaining fusion reaction begins, using Microsoft 365 Copilot, Azure OpenAI Service, Visual Studio, and GitHub (opens in new tab). Microsoft Research is now cooperating with ITER to identify where AI can be exploited to model future experiments to optimize its design and operations. 

Now Microsoft Research has signed a Memorandum of Understanding with the Princeton Plasma Physics Laboratory (PPPL) (opens in new tab) to foster collaboration through knowledge exchange, workshops, and joint research projects. This effort aims to address key challenges in fusion, materials, plasma control, digital twins, and experiment optimization. Together, Microsoft Research and PPPL will work to drive innovation and advances in these critical areas.

Fusion is a scientific challenge unlike any other and could be key to sustainable energy in the future. We’re excited about the role AI can play in helping make that vision a reality. To learn more, visit the Fusion Summit event page, or connect with us by email at FusionResearch@microsoft.com.

The post Microsoft Fusion Summit explores how AI can accelerate fusion research appeared first on Microsoft Research.

Read More

Abstracts: Societal AI with Xing Xie

Abstracts: Societal AI with Xing Xie

Xing Xie illustrated headshot

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts bring its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Partner Research Manager Xing Xie joins host Gretchen Huizinga to talk about his work on a white paper called Societal AI: Research Challenges and Opportunities. Part of a larger effort to understand the cultural impact of AI systems, this white paper is a result of a series of global conversations and collaborations on how AI systems interact with and influence human societies. 


Learn more:

Societal AI: Building human-centered AI systems
Microsoft Research Blog, May 2024

Transcript

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers. 

I’m here today with Xing Xie, a partner research manager at Microsoft Research and co-author of a white paper called Societal AI: Research Challenges and Opportunities. This white paper is a result of a series of global conversations and collaborations on how AI systems interact with and impact human societies. Xing Xie, great to have you back on the podcast. Welcome to Abstracts! 


XING XIE: Thank you for having me. 

HUIZINGA: So let’s start with a brief overview of the background for this white paper on Societal AI. In just a few sentences, tell us how the idea came about and what key principles drove the work. 

XIE: The idea for this white paper emerged in response to the shift we are witnessing in the AI landscape. Particularly since the release of ChatGPT in late 2022, these models didn’t just change the pace of AI research, they began reshaping our society, education, economy, and yeah, even the way we understand ourselves. At Microsoft Research Asia, we felt a strong urgency to better understand these changes. Over the past 30 months, we have been actively exploring this frontier in partnership with experts from psychology, sociology, law, and philosophy. This white paper serves three main purposes. First, to document what we have learned. Second, to guide future research directions. And last, to open up an effective communication channel with collaborators across different disciplines. 

HUIZINGA: Research on responsible AI is a relatively new discipline and it’s profoundly multidisciplinary. So tell us about the work that you drew on as you convened this series of workshops and summer schools, research collaborations and interdisciplinary dialogues. What kinds of people did you bring to the table and for what reason? 

XIE: Yeah. Responsible AI actually has been evolving within Microsoft for like about a decade. But with the rise of large language models, the scope and urgency of these challenges have grown exponentially. That’s why we have leaned heavily on interdisciplinary collaboration. For instance, in the Value Compass Project, we worked with philosophers to frame human values in a scientifically actionable way, something essential for aligning AI behavior. In our AI evaluation efforts, we drew from psychometrics to create more principled ways of assessing these systems. And with the sociologists, we have examined how AI affects education and social systems. This joint effort has been central to the work we share in this white paper. 

HUIZINGA: So white papers differ from typical research papers in that they don’t rely on a particular research methodology per se, but you did set, as a backdrop for your work, ten questions for consideration. So how did you decide on these questions and how or by what means did you attempt to answer them? 

XIE: Rather than follow a traditional research methodology, we built this white paper around ten fundamental, foundational research questions. These came from extensive dialogue, not only with social scientists, but also computer scientists working at the technical front of AI. These questions span both directions. First, how AI impacts society, and second, how social science can help solve technical challenges like alignment and safety. They reflect a dynamic agenda that we hope to evolve continuously through real-world engagement and deeper collaboration. 

HUIZINGA: Can you elaborate on… a little bit more on the questions that you chose to investigate as a group or groups in this? 

XIE: Sure, I think I can use the Value Compass Project as one example. In that project, our main goal is to try to study how we can better align the value of AI models with our human values. Here, one fundamental question is how we define our own human values. There actually is a lot of debate and discussions on this. Fortunately, we see in philosophy and sociology actually they have studied this for years, like, for like hundreds of years. They have defined some, like, such as basic human value framework, they have defined like modern foundation theory. We can borrow those expertise. Actually, we have worked with sociology and philosophers, try to borrow these expertise and define a framework that could be usable for AI. Actually, we have worked on, like, developing some initial frameworks and evaluation methods for this. 

HUIZINGA: So one thing that you just said was to frame philosophical issues in a scientifically actionable way. How hard was that? 

XIE: Yeah, it is actually not easy. I think that first of all, social scientists and AI researchers, we… usually we speak different languages. 

HUIZINGA: Right! 

XIE: Our research is at a very different pace. So at the very beginning, I think we should find out what’s the best way to talk to each other. So we have workshops, have joint research projects, we have them visit us, and also, we have supervised some joint interns. So that’s all the ways we try to find some common ground to work together. More specifically for this value framework, we have tried to understand what’s the latest program from their source and also try how to adapt them to an AI context. So that’s, I mean, it’s not easy, but it’s like enjoyable and exciting journey! 

HUIZINGA: Yeah, yeah, yeah. And I want to push in on one other question that I thought was really interesting, which you asked, which was how can we ensure AI systems are safe, reliable, controllable, especially as they become more autonomous? I think this is a big question for a lot of people. What kind of framework did you use to look at that? 

XIE: Yeah, there are many different aspects. I think alignment definitely is an aspect. That means how we can make sure we can have a way to truly and deeply embed our values into the AI model. Even after we define our value, we still need a way to make sure that it’s actually embedded in. And also evaluation I think is another topic. Even we have this AI…. looks safe and looks behavior good, but how we can evaluate that, how we can make sure it is actually doing the right thing. So we also have some collaboration with psychometrics people to define a more scientific evaluation framework for this purpose as well. 

HUIZINGA: Yeah, I remember talking to you about your psychometrics in the previous podcast… 

XIE: Yeah! 

HUIZINGA: …you were on and that was fascinating to me. And I hope… at some point I would love to have a bigger conversation on where you are now with that because I know it’s an evolving field. 

XIE: It’s evolving! 

HUIZINGA: Yeah, amazing! Well, let’s get back to this paper. White papers aren’t designed to produce traditional research findings, as it were, but there are still many important outcomes. So what would you say the most important takeaways or contributions of this paper are? 

XIE: Yeah, the key takeaway, I believe, is AI is no longer just a technical tool. It’s becoming a social actor. 

HUIZINGA: Mmm. 

XIE: So it must be studied as a dynamic evolving system that intersects with human values, cognition, culture, and governance. So we argue that interdisciplinary collaboration is no longer optional. It’s essential. Social sciences offer tools to understand the complexity, bias, and trust, concepts that are critical for AI’s safe and equitable deployment. So the synergy between technical and social perspectives is what will help us move from reactive fixes to proactive design. 

HUIZINGA: Let’s talk a little bit about the impact that a paper like this can have. And it’s more of a thought leadership piece, but who would you say will benefit most from the work that you’ve done in this white paper and why? 

XIE: We hope this work speaks to both AI and social science communities. For AI researchers, this white paper provides frameworks and real-world examples, like value evaluation systems and cross-cultural model training that can inspire new directions. And for social scientists, it opens doors to new tools and collaborative methods for studying human behavior, cognition, and institutions. And beyond academia, we believe policymakers and industry leaders can also benefit as the paper outlines practical governance questions and highlights emerging risks that demand timely attention. 

HUIZINGA: Finally, Xing, what would you say the outstanding challenges are for Societal AI, as you framed it, and how does this paper lay a foundation for future research agendas? Specifically, what kinds of research agendas might you see coming out of this foundational paper? 

XIE: We believe this white paper is not a conclusion, it’s a starting point. While the ten research questions are a strong foundation, they also expose deeper challenges. For example, how do we build a truly interdisciplinary field? How can we reconcile the different timelines, methods, and cultures of AI and social science? And how do we nurture talents who can work fluently across those both domains? We hope this white paper encourages others to take on these questions with us. Whether you are researcher, student, policymaker, or technologist, there is a role for you in shaping AI that not only works but works for society. So yeah, I look forward to the conversation with everyone. 

HUIZINGA: Well, Xing Xie, it’s always fun to talk to you. Thanks for joining us today and to our listeners, thanks for tuning in. If you want to read this white paper, and I highly recommend that you do, you can find a link at aka.ms/Abstracts, or you can find a link in our show notes that will take you to the Microsoft Research website. See you next time on Abstracts!

 

The post Abstracts: Societal AI with Xing Xie appeared first on Microsoft Research.

Read More

Societal AI: Building human-centered AI systems

Societal AI: Building human-centered AI systems

Societal AI surrounded by a circle with two directional arrows in the center of a rectangle with Computer Science and a computer icon on the left with a directional arrow pointing to Social Science on the right with two avatar icons.

In October 2022, Microsoft Research Asia hosted a workshop that brought together experts in computer science, psychology, sociology, and law as part of Microsoft’s commitment to responsible AI (opens in new tab). The event led to ongoing collaborations exploring AI’s societal implications, including the Value Compass (opens in new tab) project.

As these efforts grew, researchers focused on how AI systems could be designed to meet the needs of people and institutions in areas like healthcare, education, and public services. This work culminated in Societal AI: Research Challenges and Opportunities, a white paper that explores how AI can better align with societal needs. 

What is Societal AI?

Societal AI is an emerging interdisciplinary area of study that examines how AI intersects with social systems and public life. It focuses on two main areas: (1) the impact of AI technologies on fields like education, labor, and governance; and (2) the challenges posed by these systems, such as evaluation, accountability, and alignment with human values. The goal is to guide AI development in ways that respond to real-world needs.

The white paper offers a framework for understanding these dynamics and provides recommendations for integrating AI responsibly into society. This post highlights the paper’s key insights and what they mean for future research.

Tracing the development of Societal AI

Societal AI began nearly a decade ago at Microsoft Research Asia, where early work on personalized recommendation systems uncovered risks like echo chambers, where users are repeatedly exposed to similar viewpoints, and polarization, which can deepen divisions between groups. Those findings led to deeper investigations into privacy, fairness, and transparency, helping inform Microsoft’s broader approach to responsible AI.

The rapid rise of large-scale AI models in recent years has made these concerns more urgent. Today, researchers across disciplines are working to define shared priorities and guide AI development in ways that reflect social needs and values.

Key insights

The white paper outlines several important considerations for the field:

Interdisciplinary framework: Bridges technical AI research with the social sciences, humanities, policy studies, and ethics to address AI’s far-reaching societal effects.

Actionable research agenda: Identifies ten research questions that offer a roadmap for researchers, policymakers, and industry leaders.

Global perspective: Highlights the importance of different cultural perspectives and international cooperation in shaping responsible AI development dialogue.

Practical insights: Balances theory with real-world applications, drawing from collaborative research projects.

“AI’s impact extends beyond algorithms and computation—it challenges us to rethink fundamental concepts like trust, creativity, agency, and value systems,” says Lidong Zhou, managing director of Microsoft Research Asia. “It recognizes that developing more powerful AI models is not enough; we must examine how AI interacts with human values, institutions, and diverse cultural contexts.”

This figure presents the framework of Societal AI research. The left part of the figure illustrates that computer scientists can contribute their expertise in machine learning, natural language processing (NLP), human-computer interaction (HCI), and social computing to this research direction. The right part of the figure highlights the importance of social scientists from various disciplines—including psychology, law, sociology, and philosophy—being deeply involved in the research. The center of the figure displays ten notable Societal AI research areas that require cross-disciplinary collaboration between computer scientists and social scientists. These ten areas, listed in counter-clockwise order starting from the top, are: AI safety and reliability, AI fairness and inclusiveness, AI value alignment, AI capability evaluation, human-AI collaboration, AI interpretability and transparency, AI’s impact on scientific discoveries, AI’s impact on labor and global business, AI’s impact on human cognition and creativity, and the regulatory and governance framework for AI.
Figure 1. Societal AI research agenda

Guiding principles for responsible integration

 The research agenda is grounded in three key principles: 

  • Harmony: AI should minimize conflict and build trust to support acceptance. 
  • Synergy: AI should complement human capabilities, enabling outcomes that neither humans nor machines could achieve alone.  
  • Resilience: AI should be robust and adaptable as social and technological conditions evolve.  

Ten critical questions

These questions span both technical and societal concerns:  

  1. How can AI be aligned with diverse human values and ethical principles?
  2. How can AI systems be designed to ensure fairness and inclusivity across different cultures, regions, and demographic groups?
  3. How can we ensure AI systems are safe, reliable, and controllable, especially as they become more autonomous?
  4. How can human-AI collaboration be optimized to enhance human abilities?
  5. How can we effectively evaluate AI’s capabilities and performance in new, unforeseen tasks and environments?
  6. How can we enhance AI interpretability to ensure transparency in its decision-making processes?
  7. How will AI reshape human cognition, learning, and creativity, and what new capabilities might it unlock?
  8. How will AI redefine the nature of work, collaboration, and the future of global business models?
  9. How will AI transform research methodologies in the social sciences, and what new insights might it enable?
  10. How should regulatory frameworks evolve to govern AI development responsibly and foster global cooperation?

This list will evolve alongside AI’s developing societal impact, ensuring the agenda remains relevant over time. Building on these questions, the white paper underscores the importance of sustained, cross-disciplinary collaboration to guide AI development in ways that reflect societal priorities and public interest.

“This thoughtful and comprehensive white paper from Microsoft Research Asia represents an important early step forward in anticipating and addressing the societal implications of AI, particularly large language models (LLMs), as they enter the world in greater numbers and for a widening range of purposes,” says research collaborator James A. Evans (opens in new tab), professor of sociology at the University of Chicago.

Looking ahead

Microsoft is committed to fostering collaboration and invites others to take part in developing governance systems. As new challenges arise, the responsible use of AI for the public good will remain central to our research.

We hope the white paper serves as both a guide and a call to action, emphasizing the need for engagement across research, policy, industry, and the public.

For more information, and to access the full white paper, visit the Microsoft Research Societal AI page. Listen to the author discuss more about the research in this podcast.

Acknowledgments

We are grateful for the contributions of the researchers, collaborators, and reviewers who helped shape this white paper.

The post Societal AI: Building human-centered AI systems appeared first on Microsoft Research.

Read More

Laws, norms, and ethics for AI in health

Laws, norms, and ethics for AI in health

Peter Lee, Vardit Ravitsky, Laura Adams, and Dr. Roxana Daneshjou illustrated headshots.

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee. 

In this episode, Laura Adams (opens in new tab), Vardit Ravitsky (opens in new tab), and Dr. Roxana Daneshjou (opens in new tab), experts at the intersection of healthcare, ethics, and technology, join Lee to discuss the responsible implementation of AI in healthcare. Adams, a strategic advisor at the National Academy of Medicine leading the development of a national AI code of conduct, shares her initial curiosity and skepticism of generative AI and then her recognition of the technology as a transformative tool requiring new governance approaches. Ravitsky, bioethicist and president and CEO of The Hastings Center for Bioethics, examines how AI is reshaping healthcare relationships and the need for bioethics to proactively guide implementation. Daneshjou, a Stanford physician-scientist bridging dermatology, biomedical data science, and AI, discusses her work on identifying, understanding, and mitigating bias in AI systems and also leveraging AI to better serve patient needs.


Learn more

Health Care Artificial Intelligence Code of Conduct (opens in new tab) (Adams) 
Project homepage | National Academy of Medicine 

Artificial Intelligence in Health, Health Care, and Biomedical Science: An AI Code of Conduct Principles and Commitments Discussion Draft (opens in new tab) (Adams) 
National Academy of Medicine commentary paper | April 2024 

Ethics of AI in Health and Biomedical Research (opens in new tab) (Ravitsky) 
The Hastings Center for Bioethics 

Ethics in Patient Preferences for Artificial Intelligence–Drafted Responses to Electronic Messages (opens in new tab) (Ravitsky) 
Publication | March 2025 

Daneshjou Lab (opens in new tab) (Daneshjou) 
Lab homepage 

Red teaming ChatGPT in medicine to yield real-world insights on model behavior (opens in new tab) (Daneshjou) 
Publication | March 2025 

Dermatologists’ Perspectives and Usage of Large Language Models in Practice: An Exploratory Survey (opens in new tab) (Daneshjou) 
Publication | October 2024 

Deep learning-aided decision support for diagnosis of skin disease across skin tones (opens in new tab) (Daneshjou) 
Publication | February 2024 

Large language models propagate race-based medicine (opens in new tab) (Daneshjou) 
Publication | October 2023 

Disparities in dermatology AI performance on a diverse, curated clinical image set (opens in new tab) (Daneshjou) 
Publication | August 2022

Transcript

[MUSIC]    

[BOOK PASSAGE]  

PETER LEE: “… This is the moment for broad, thoughtful consideration of how to ensure maximal safety and also maximum access. Like any medical tool, AI needs those guardrails to keep patients as safe as possible. But it’s a tricky balance: those safety measures must not mean that the great advantages that we document in this book end up unavailable to many who could benefit from them. One of the most exciting aspects of this moment is that the new AI could accelerate healthcare in a direction that is better for patients, all patients, and providers as well—if they have access.” 

[END OF BOOK PASSAGE]    

[THEME MUSIC]    

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.    

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?     

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here. 


[THEME MUSIC FADES] 

The passage I read at the top there is from Chapter 9, “Safety First.” 

One needs only to look at examples such as laws mandating seatbelts in cars and, more recently, internet regulation to know that policy and oversight are often playing catch-up with emerging technologies. When we were writing our book, Carey, Zak, and I didn’t claim that putting frameworks in place to allow for innovation and adoption while prioritizing inclusiveness and protecting patients from hallucination and other harms would be easy. In fact, in our writing, we posed more questions than answers in the hopes of highlighting the complexities at hand and supporting constructive discussion and action in this space.  

In this episode, I’m pleased to welcome three experts who have been thinking deeply about these matters: Laura Adams, Vardit Ravitsky, and Dr. Roxana Daneshjou.  

Laura is an expert in AI, digital health, and human-centered care. As a senior advisor at the National Academy of Medicine, or NAM, she guides strategy for the academy’s science and technology portfolio and leads the Artificial Intelligence Code of Conduct national initiative.  

Vardit is president and CEO of The Hastings Center for Bioethics, a bioethics and health policy institute. She leads research projects funded by the National Institutes of Health, is a member of the committee developing the National Academy of Medicine’s AI Code of Conduct, and is a senior lecturer at Harvard Medical School.  

Roxana is a board-certified dermatologist and an assistant professor of both dermatology and biomedical data science at Stanford University. Roxana is among the world’s thought leaders in AI, healthcare, and medicine, thanks in part to groundbreaking work on AI biases and trustworthiness. 

One of the good fortunes I’ve had in my career is the chance to work with both Laura and Vardit, mainly through our joint work with the National Academy of Medicine. They’re both incredibly thoughtful and decisive leaders working very hard to help the world of healthcare—and healthcare regulators—come to grips with generative AI. And over the past few years, I’ve become an avid reader of all of Roxana’s research papers. Her work is highly technical, super influential but also informative in a way that spans computer science, medicine, bioethics, and law.  

These three leaders—one from the medical establishment, one from the bioethics field, and the third from clinical research—provide insights into three incredibly important dimensions of the issues surrounding regulations, norms, and ethics of AI in medicine. 

[TRANSITION MUSIC] 

Here is my interview with Laura Adams: 

LEE: Laura, I’m just incredibly honored and excited that you’re joining us here today, so welcome. 

ADAMS: Thank you, Peter, my pleasure. Excited to be here. 

LEE: So, Laura, you know, I’ve been working with you at the NAM for a while, and you are a strategic advisor at the NAM. But I think a lot of our listeners might not know too much about the National Academy of Medicine and then, within the National Academy of Medicine, what a strategic advisor does.  

So why don’t we start there. You know, how would you explain to a person’s mother or father what the National Academy of Medicine is? 

ADAMS: Sure. National Academy was formed more than 50 years ago. It was formed by the federal government, but it is not the federal government. It was formed as an independent body to advise the nation and the federal government on issues of science and primarily technology-related issues, as well.  

So with that 50 years, some probably know of the National Academy of Medicine when it was the Institute of Medicine and produced such publications as To Err is Human (opens in new tab) and Crossing the Quality Chasm (opens in new tab), both of which were seminal publications that I think had a dramatic impact on quality, safety, and how we saw our healthcare system and what we saw in terms of its potential. 

LEE: So now, for your role within NAM, what does the senior advisor do? What do you do?  

ADAMS: What I do there is in the course of leading the AI Code of Conduct project, my role there was in framing the vision for that project, really understanding what did we want it to do, what impact did we want it to make.  

So for example, some thought that it might be that we wanted everyone to use our code of conduct. And my advice on that was let’s use this as a touchstone. We want people to think about their own codes of conduct for their use of AI. That’s a valuable exercise, to decide what you value, what your aspirations are.  

I also then do a lot of the field alignment around that work. So I probably did 50 talks last year—conference presentations, webinars, different things—where the code of conduct was presented so that the awareness could be raised around it so people could see the practicality of using that tool.   
 
Especially the six commitments that were based on the idea of complex adaptive systems simple rules, where we could recall those in the heat of decision-making around AI, in the heat of application, or even in the planning and strategic thinking around it.  

LEE: All right, we’re going to want to really break into a lot of details here.  

But I would just like to rewind the clock a little bit and talk about your first encounters with AI. And there’s sort of, I guess, two eras. There’s the era of AI and machine learning before ChatGPT, before the generative AI era, and then afterwards.  

Before the era of generative AI, what was your relationship with the idea of artificial intelligence? Was it a big part of your role and something you thought about, or was it just one of many technologies that you considered? 

ADAMS: It was one of many.  

LEE: Yeah.  

ADAMS: Watching it help us evolve from predictive analytics to predictive AI, which of course I was fascinated by the fact that it could use structured and unstructured data, that it could learn from its own processes. These things were really quite remarkable, but my sense about it was that it was one of many.  

We were looking at telemedicine. We were looking at [a] variety of other things, particularly wearables and things that were affecting and empowering patients to take better care of themselves and take more … have more agency around their own care. So I saw it as one of many.  

And then the world changed in 2022, changed dramatically. 

LEE: [LAUGHS] OK. Right. OK, so November 2022, ChatGPT. Later in the spring of 2023, GPT-4. And so, you know, what were your first encounters, and what were you feeling? What were you experiencing? 

ADAMS: At the time, I was curious, and I thought, I think I’m seeing four things here that make this way different.  

And one was, and it proved to be true over time, the speed with which this evolved. And I was watching it evolve very, very quickly and thinking, this is almost, this is kind of mind blowing how fast this is getting better.  

And then this idea that, you know, we could scale this. As we were watching the early work with ambient listening, I was working with a group of physicians that were lamenting the cost and the unavailability of scribes. They wanted to use scribes. And I’m thinking, We don’t have to incur the cost of that. We don’t have to struggle with the unavailability of that type of … someone in the workforce.   

And then I started watching the ubiquity, and I thought, Oh, my gosh, this is unlike any other technology that we’ve seen. Because with electronic health records, for example, it’s had its place, but it was over here. We had another digital technology, maybe telehealth, over here. This was one, and I thought, there will be no aspect of healthcare that will be left untouched by AI. That blew my mind.  

LEE: Yeah. 

ADAMS: And then I think the last thing was the democratization. And I realized: Wow, anyone with a smartphone has access to the most powerful large language models in the world.  
 
And I thought, This, to me, is a revolution in cheap expertise. Those were the things that really began to stun me, and I just knew that we were in a way different era. 

LEE: It’s interesting that you first talked about ambient listening. Why was that of particular initial interest to you specifically? 

ADAMS: It was because one of the things that we were putting together in our code of conduct, which began pre-generative AI, was the idea that we wanted to renew the moral well-being and the sense of shared purpose to the healthcare workforce. That’s one of the six principles.  

And I knew that the cognitive burden was becoming unbearable. When we came out of COVID, it was such a huge wake-up call to understand exactly what was going on at that point of care and how challenging it had become because information overload is astonishing in and of itself. And that idea that we have so much in the way of documentation that needed to be done and how much of a clinician’s time was taken up doing that rather than doing the thing that they went into the profession to do. And that was interact with people, that was to heal, that was to develop human connection that had a healing effect, and they just … so much of the time was taken away from that activity.  

I also looked at it and because I studied diffusion of innovations theory and understand what causes something to move rapidly across a social system and get adopted, it has to have a clear relative advantage. It has to be compatible with the way that processes work. 

So I didn’t see that this was going to be a hugely disruptive activity to workflow, which is a challenge of most digital tools, is that they’re designed without that sense of, how does this impact the workflow? And then I just thought that it was going to be a front runner in adoption, and it might then start to create that tsunami, that wave of interest in this, and I don’t think I was wrong. 

LEE: I have to ask you, because I’ve been asking every guest, there must have been moments early on in the encounter with generative AI where you felt doubt or skepticism. Is that true, or did you immediately think, Wow, this is something very important? 

ADAMS: No, I did feel doubt and skepticism.  

My understanding tells me of it, and told me of it in the very beginning, that this is trained on the internet with all of its flaws. When we think about AI, we think about it being very futuristic, but it’s trained on data from the past. I’m well aware of how flawed that data, how biased that data is, mostly men, mostly white men, when we think about it during a certain age grouping of.  

So I knew that we had inherent massive flaws in the training data and that concerned me. I saw other things about it that also concerned me. I saw that … its difficulty in beginning to use it and govern it effectively.  

You really do have to put a good governance system in if you’re going to put this into a care delivery system. And I began to worry about widening a digital divide that already was a chasm. And that was between those well-resourced, usually urban, hospitals and health systems that are serving the well-resourced, and the inner-city hospital in Chicago or the rural hospital in Nebraska or the Mississippi community health center.  

LEE: Yes. So I think this skepticism about technology, new technologies, in healthcare is very well-earned. So you’ve zeroed in on this issue of technology where oftentimes we hope it’ll reduce or eliminate biases but actually seems to oftentimes have the opposite effect.  

And maybe this is a good segue then into this really super-important national effort that you’re leading on the AI code of conduct. Because in a way, I think those failures of the past and even just the idea—the promise—that technology should make a doctor or a nurse’s job easier, not harder, even that oftentimes seems not to have panned out in the way that we hope.  

And then there’s, of course, the well-known issue of hallucinations or of mistakes being made. You know, how did those things inform this effort around a code of conduct, and why a code of conduct? 

ADAMS: Those things weighed heavily on me as the initiative leader because I had been deeply involved in the spread of electronic health records, not really knowing and understanding that electronic health records were going to have the effect that they had on the providers that use them.  

Looking back now, I think that there could have been design changes, but we probably didn’t have as much involvement of providers in the design. And in some cases, we did. We just didn’t understand what it would take to work it into their workflows.  

So I wanted to be sure that the code of conduct took into consideration and made explicit some of the things that I believe would have helped us had we had those guardrails or those guidelines explicit for us.  

And those are things like our first one is to protect and advance human health and connection.  

We also wanted to see things about openly sharing and monitoring because we know that for this particular technology, it’s emergent. We’re going to have to do a much better job at understanding whether what we’re doing works and works in the real world.  

So the reason for a code of conduct was we knew that … the good news, when the “here comes AI and it’s barreling toward us,” the good news was that everybody was putting together guidelines, frameworks, and principle sets. The bad news was same. That everybody was putting together their own guideline, principle, and framework set.  

And I thought back to how much I struggled when I worked in the world of health information exchange and built a statewide health information exchange and then turned to try to exchange that data across the nation and realized that we had a patchwork of privacy laws and regulations across the state; it was extremely costly to try to move data.  

And I thought we actually need, in addition to data interoperability, we need governance interoperability, where we can begin to agree on a core set of principles that will more easily allow us to move ahead and achieve some of the potential and the vision that we have for AI if we are not working with a patchwork of different guidelines, principles, and frameworks.  

So that was the impetus behind it. Of course, we again want it to be used as a touchstone, not everybody wholesale adopt what we’ve said.  

LEE: Right. 

ADAMS: We want people to think about this and think deeply about it. 

LEE: Yeah, Laura, I always am impressed with just how humble you are. You were indeed, you know, one of the prime instigators of the digitization of health records leading to electronic health record systems. And I don’t think you need to feel bad about that. That was a tremendous advance. I mean, moving a fifth of the US economy to be digital, I think, is significant.  

Also, our listeners might want to know that you led something called the Rhode Island Quality Institute, which was really, I think, maybe the, arguably, the most important early kind of examples that set a pattern for how and why health data might actually lead very directly to improvements in human health at a statewide level or at a population level. And so I think your struggles and frustrations on, you know, how to expand that nationwide, I think, are really, really informative.  

So let’s get into what these principles are, you know, what’s in the code of conduct.  

ADAMS: Yeah, the six simple rules were derived out of a larger set of principles that we pulled together. And the origin of all of this was we did a fairly extensive landscape review. We looked at least at 60 different sets of these principles, guidelines, and frameworks. We looked for areas of real convergence. We looked for areas where there was inconsistencies. And we looked for out-and-out gaps.  

The out-and-out gaps that we saw at the time were things like a dearth of references to advancing human health as the priority. Also monitoring post-implementation. So at the time, we were watching these evolve and we thought these are very significant gaps. Also, the impact on the environment was a significant gap, as well. And so when we pull that together, we developed a set of principles and cross-walked those with learning health system principles (opens in new tab).  

And then once we got that, we again wanted to distill that down into a set of commitments which we knew that people could find accessible. And we published that draft set of principles (opens in new tab) last year. And we have a new publication that will be coming out in the coming months that will be the revised set of principles and code commitments that we got because we took this out publicly. 

So we opened it up for public comment once we did the draft last year. Again, many of those times that I spoke about this, almost all of those times came with an invitation for feedback, and conversations that we had with people shaped it. And it is in no way, shape, or form a final code of conduct, this set of principles and commitments, because we see this as dynamic. But what we also knew about this was that we wanted to build this with a super solid foundation, a set of immutables, the things that don’t change at some vicissitudes or the whims of this or the whims of that. We wanted those things that were absolutely foundational.  

LEE: Yeah, so we’ll provide a link to the documents that describe the current state of this, but can we give an example of one or two of these principles and one or two of the commitments? 

ADAMS: Sure. I’ve mentioned the “protect and advance human health and connection” as the primary aim. We also want to ensure the equitable distribution of risks and benefits, and that equitable distribution of risks and benefits is something that I was referring to earlier about when I see well-resourced organizations. And one that’s particularly important to me is engaging people as partners with agency at every stage of the AI lifecycle.  

That one matters because this one talks about and speaks to the idea that we want to begin bringing in those that are affected by AI, those on whom AI is used, into the early development and conceptualization of what we want this new tool, this new application, to do. So that includes the providers that use it, the patients. And we find that when we include them—the ethicists that come along with that—we develop much better applications, much more targeted applications that do what we intend them to do in a more precise way.  

The other thing about that engaging with agency, by agency we mean that person, that participant can affect the decisions and they can affect the outcome. So it isn’t that they’re sort of a token person coming into the table and we’ll allow you to tell your story or so, but this is an active participant.  

We practiced what we preached when we developed the code of conduct, and we brought patient advocates in to work with us on the development of this, work with us on our applications, that first layer down of what the applications would look like, which is coming out in this new paper.  

We really wanted that component of this because I’m also seeing that patients are definitely not passive users of this, and they’re having an agency moment, let’s say, with generative AI because they’ve discovered a new capacity to gain information, to—in many ways—claim some autonomy in all of this.   

And I think that there is a disruption underway right now, a big one that has been in the works for many years, but it feels to me like AI may be the tipping point for that disruption of the delivery system as we know it.  

LEE: Right. I think it just exudes sort of inclusivity and thoughtfulness in the whole process. During this process, were there surprises, things that you didn’t expect? Things about AI technology itself that surprised you? 

ADAMS: The surprises that came out of this process for me, one of them was I surprised myself. We were working on the commentary paper, and Steven Lin from Stanford had strong input into that paper. And when we looked at what we thought were missing, he said, “Let’s make sure we have the environmental impact.” And I said, “Oh, really, Steven, we really want to think about things that are more directly aligned with health,” which I couldn’t believe came out of my own mouth. [LAUGHTER] 

And Steven, without saying, “Do you hear yourself?” I mean, I think he could have said that. But he was more diplomatic than that. And he persisted a bit longer and said, “I think it’s actually the greatest threat to human health.” And I said, “Of course, you’re right.” [LAUGHS] 

But that was surprising and embarrassing for me. But it was eye-opening in that even when I thought that I had understood the gaps and the using this as a touchstone. So the learning that took place and how rapidly that learning was happening among people involved in this.  

The other thing that was surprising for me was the degree at which patients became vastly facile with using it to the extent that it helped them begin to, again, build their own capacity.  

The #PatientsUseAI from Dave deBronkart—watch that one. This is more revolutionary than we think. And so I watched that, the swell of that happening, and it sort of shocked me because I was envisioning this as, again, a tool for use in the clinical setting.  

LEE: Yeah. OK, so we’re running now towards the end of our time together. And I always like to end our conversations with a more provocative topic. And I thought for you, I’d like to use the very difficult word regulation.  

And when I think about the book that Carey, Zak, and I wrote, we have a chapter on regulation, but honestly, we didn’t have ideas. We couldn’t understand how this would be regulated. And so we just defaulted to publishing a conversation about regulation with GPT-4. And in a way, I think … I don’t know that I or my coauthors were satisfied with that.  

In your mind, where do we stand two years later now when we think about the need or not to regulate AI, particularly in its application to healthcare, and where has the thinking evolved to? 

ADAMS: There are two big differences that I see in that time that has elapsed. And the first one is we have understood the insufficiency of simply making sure that AI-enabled devices are safe prior to going out into implementation settings.  

We recognize now that there’s got to be this whole other aspect of regulation and assurance that these things are functioning as intended and we have the capacity to do that in the point of care type of setting. So that’s been one of the major ones. The other thing is how wickedly challenging it is to regulate generative AI.  

I think one of the most provocative and exciting articles (opens in new tab) that I saw written recently was by Bakul Patel and David Blumenthal, who posited, should we be regulating generative AI as we do a licensed and qualified provider?  

Should it be treated in the sense that it’s got to have a certain amount of training and a foundation that’s got to pass certain tests? It has to demonstrate that it’s improving and keeping up with current literature. Does it … be responsible for mistakes that it makes in some way, shape, or form? Does it have to report its performance?  

And I’m thinking, what a provocative idea …  

LEE: Right. 

ADAMS: … but it’s worth considering. I chair the Global Opportunities Group for a regulatory and innovation AI sandbox in the UK. And we’re hard at work thinking about, how do you regulate something as unfamiliar and untamed, really, as generative AI?  

So I’d like to see us think more about this idea of sandboxes, more this idea of should we be just completely rethinking the way that we regulate. To me, that’s where the new ideas will come because the danger, of course, in regulating in the old way … first of all, we haven’t kept up over time, even with predictive AI; even with pre-generative AI, we haven’t kept up.  

And what worries me about continuing on in that same vein is that we will stifle innovation … 

LEE: Yes. 

ADAMS: … and we won’t protect from potential harms. Nobody wants an AI Chernobyl, nobody.  

LEE: Right 
 
ADAMS: But I worry that if we use those old tools on the new applications that we will not only not regulate, then we’ll stifle innovation. And when I see all of the promise coming out of this for things that we thought were unimaginable, then that would be a tragedy. 

LEE: You know, I think the other reflection I’ve had on this is the consumer aspect of it, because I think a lot of our current regulatory frameworks are geared towards experts using the technology.  

ADAMS: Yes. 

LEE: So when you have a medical device, you know you have a trained, board-certified doctor or licensed nurse using the technology. But when you’re putting things in the hands of a consumer, I think somehow the surface area of risk seems wider to me. And so I think that’s another thing that somehow our current regulatory concepts aren’t really ready for. 

ADAMS: I would agree with that. I think a few things to consider, vis-a-vis that, is that this revolution of patients using it is unstoppable. So it will happen. But we’re considering a project here at the National Academy about patients using AI and thinking about: let’s explore all the different facets of that. Let’s understand, what does safe usage look like? What might we do to help this new development enhance the patient-provider relationship and not erode it as we saw, “Don’t confuse your Google search with my medical degree” type of approach.  

Thinking about: how does it change the identity of the provider? How does it … what can we do to safely build a container in which patients can use this without giving them the sense that it’s being taken away, or that … because I just don’t see that happening. I don’t think they’re going to let it happen.  

That, to me, feels extremely important for us to explore all the dimensions of that. And that is one project that I hope to be following on to the AI Code of Conduct and applying the code of conduct principles with that project.  

LEE: Well, Laura, thank you again for joining us. And thank you even more for your tremendous national, even international, leadership on really helping mobilize the greatest institutions in a diverse way to fully confront the realities of AI in healthcare. I think it’s tremendously important work. 

ADAMS: Peter, thank you for having me. This has been an absolute pleasure. 

[TRANSITION MUSIC]  

I’ve had the opportunity to watch Laura in action as she leads a national effort to define an AI code of conduct. And our conversation today has only heightened my admiration for her as a national leader.  

What impresses me is Laura’s recognition that technology adoption in healthcare has had a checkered history and furthermore oftentimes not accommodated the huge diversity of stakeholders that are affected equally. 

The concept of an AI code of conduct seems straightforward in some ways, but you can tell that every word in the emerging code has been chosen carefully. And Laura’s tireless engagement traveling to virtually every corner of the United States, as well as to several other countries, shows real dedication. 

And now here’s my conversation with Vardit Ravitsky:

LEE: Vardit, thank you so much for joining. 

RAVITSKY: It’s a real pleasure. I’m honored that you invited me. 

LEE: You know, we’ve been lucky. We’ve had a few chances to interact and work together within the National Academy of Medicine and so on. But I think for many of the normal subscribers to the Microsoft Research Podcast, they might not know what The Hastings Center for Bioethics is and then what you as the leader of The Hastings Center do every day. So I’d like to start there, first off, with what is The Hastings Center? 

RAVITSKY: Mostly, we’re a research center. We’ve been around for more than 55 years. And we’re considered one of the organizations that actually founded the field known today as bioethics, which is the field that explores the policy implications, the ethical, social issues in biomedicine. So we look at how biotechnology is rolled out; we look at issues of equity, of access to care. We look at issues at the end of life, the beginning of life, how our natural environment impacts our health. Any aspect of the delivery of healthcare, the design of the healthcare system, and biomedical research leading to all this. Any aspect that has an ethical implication is something that we’re happy to explore.  

We try to have broad conversations with many, many stakeholders, people from different disciplines, in order to come up with guidelines and recommendations that would actually help patients, families, communities.  

We also have an editorial department. We publish academic journals. We publish a blog. And we do a lot of public engagement activities—webinars, in-person events. So, you know, we just try to promote the thinking of the public and of experts on the ethical aspects of health and healthcare.  

LEE: One thing I’ve been impressed with, with your work and the work of The Hastings Center is it really confronts big questions but also gets into a lot of practical detail. And so we’ll get there. But before that just a little bit about you then. The way I like to ask this question is: how do you explain to your parents what you do every day? [LAUGHS] 

RAVITSKY: Funny that you brought my parents into this, Peter, because I come from a family of philosophers. Everybody in my family is in humanities, in academia. When I was 18, I thought that that was the only profession [LAUGHTER] and that I absolutely had to become a philosopher, or else what else can you do with your life? 

I think being a bioethicist is really about, on one hand, keeping an eye constantly on the science as it evolves. When a new scientific development occurs, you have to understand what’s happening so that you can translate that outside of science. So if we can now make a gamete from a skin cell so that babies will be created differently, you have to understand how that’s done, what that means, and how to talk about it.  

The second eye you keep on the ethics literature. What ethical frameworks, theories, principles have we developed over the last decades that are now relevant to this technology. So you’re really a bridge between science, biomedicine on one hand and humanities on the other hand.  

LEE: OK. So let’s shift to AI. And here I’d like to start with a kind of an origin story because I’m assuming before generative AI and ChatGPT became widely known and available, you must have had some contact with ideas in data science, in machine learning, and, you know, in the concept of AI before ChatGPT. Is that true? And, you know, what were some of those early encounters like for you? 

RAVITSKY: The earlier issues that I heard people talk about in the field were really around diagnostics and reading images and, Ooh, it looks like machines could perform better than radiologists. And, Oh, what if women preferred that their mammographies be read by these algorithms? And, Does that threaten us clinicians? Because it sort of highlights our limitations and weaknesses as, you know, the weakness of the human eye and the human brain.  

So there were early concerns about, will this outperform the human and potentially take away our jobs? Will it impact our authority with patients? What about de-skilling clinicians or radiologists or any type of diagnostician losing the ability … some abilities that they’ve had historically because machines take over? So those were the early-day reflections and interestingly some of them remain even now with generative AI.  

All those issues of the standing of a clinician, and what sets us apart, and will a machine ever be able to perform completely autonomously, and what about empathy, and what about relationships? Much of that translated later on into the, you know, more advanced technology. 

LEE: I find it interesting that you use words like our and we to implicitly refer to humans, homo sapiens, to human beings. And so do you see a fundamental distinction, a hard distinction that separates humans from machines? Or, you know, how … if there are replacements of some human capabilities or some things that human beings do by machines, you know, how do you think about that? 

RAVITSKY: Ooh, you’re really pushing hard on the philosopher in me here. [LAUGHTER] I’ve read books and heard lectures by those who think that the line is blurred, and I don’t buy that. I think there’s a clear line between human and machine.  

I think the issue of AGI—of artificial general intelligence—and will that amount to consciousness … again, it’s such a profound, deep philosophical challenge that I think it would take a lot of conceptual work to get there. So how do we define consciousness? How do we define morality? The way it stands now, I look into the future without being a technologist, without being an AI developer, and I think, maybe I hope, that the line will remain clear. That there’s something about humanity that is irreplaceable.  

But I’m also remembering that Immanuel Kant, the famous German philosopher, when he talked about what it means to be a part of the moral universe, what it means to be a moral agent, he talked about rationality and the ability to implement what he called the categorical imperative. And he said that would apply to any creature, not just humans. 

And that’s so interesting. It’s always fascinated me that so many centuries ago, he said such a progressive thing. 

LEE: That’s amazing, yeah. 

RAVITSKY: It is amazing because I often, as an ethicist, I don’t just ask myself, What makes us human? I ask myself, What makes us worthy of moral respect? What makes us holders of rights? What gives us special status in the universe that other creatures don’t have? And I know this has been challenged by people like Peter Singer who say [that] some animals should have the same respect. “And what about fetuses and what about people in a coma?” I know the landscape is very fraught.  

But the notion of what makes humans deserving of special moral treatment to me is the core question of ethics. And if we think that it’s some capacities that give us this respect, that make us hold that status, then maybe it goes beyond human. So it doesn’t mean that the machine is human, but maybe at [a] certain point, these machines will deserve a certain type of moral respect that … it’s hard for us right now to think of a machine as deserving that respect. That I can see. 

But completely collapsing the distinction between human and machine? I don’t think so, and I hope not. 

LEE: Yeah. Well, you know, in a way I think it’s easier to entertain this type of conversation post-ChatGPT. And so now, you know, what was your first personal encounter with what we now call generative AI, and what went through your mind as you had that first encounter? 

RAVITSKY: No one’s ever asked me this before, Peter. It almost feels exposing to share your first encounter. [LAUGHTER]  

So I just logged on, and I asked a simple question, but it was an ethical question. I framed an ethical dilemma because I thought, if I ask it to plan a trip, like all my friends already did, it’s less interesting to me.  

And within seconds, a pretty thoughtful, surprisingly nuanced analysis was kind of trickling down my screen, and I was shocked. I was really taken aback. I was almost sad because I think my whole life I was hoping that only humans can generate this kind of thinking using moral and ethical terms.  

And then I started tweaking my question, and I asked for specific philosophical approaches to this. And it just kept surprising me in how well it performed.  

So I literally had to catch my breath and, you know, sit down and go, OK, this is a new world, something very important and potentially scary is happening here. How is this going to impact my teaching? How is this going to impact my writing? How is this going to impact health? Like, it was really a moment of shock. 

LEE: I think the first time I had the privilege of meeting you, I heard you speak and share some of your initial framing of how, you know, how to think about the potential ethical implications of AI and the human impacts of AI in the future. Keeping in mind that people listening to this podcast will tend to be technologists and computer scientists as well as some medical educators and practicing clinicians, you know, what would you like them to know or understand most about your thoughts? 

RAVITSKY: I think from early on, Peter, I’ve been an advocate in favor of bioethics as a field positioning itself to be a facilitator of implementing AI. I think on one hand, if we remain the naysayers as we have been regarding other technologies, we will become irrelevant. Because it’s happening, it’s happening fast, we have to keep our eye on the ball, and not ask, “Should we do it?” But rather ask, “How should we do it?” 

And one of the reasons that bioethics is going to be such a critical player is that the stakes are so high. The risk of making a mistake in diagnostics is literally life and death; the risk of breaches of privacy that would lead to patients losing trust and refusing to use these tools; the risk of clinicians feeling overwhelmed and replaceable. The risks are just too high.  

And therefore, creating guardrails, creating frameworks with principles that sensitize us to the ethical aspects, that is critically important for AI and health to succeed. And I’m saying it as someone who wants it very badly to succeed. 

LEE: You are actually seeing a lot of healthcare organizations adopting and deploying AI. Has any aspect of that been surprising to you? Have you expected it to be happening faster or slower or unfolding in a different way? 

RAVITSKY: One thing that surprises me is how it seems to be isolated. Different systems, different institutions making their own, you know, decisions about what to acquire and how to implement. I’m not seeing consistency. And I’m not even seeing anybody at a higher level collecting all the information about who’s buying and implementing what under what types of principles and what are their outcomes? What are they seeing?  

It seems to be just siloed and happening everywhere. And I wish we collected all this data, even about how the decision is made at the executive level to buy a certain tool, to implement it, where, why, by whom. So that’s one thing that surprised me.  

The speed is not surprising me because it really solves problems that healthcare systems have been struggling with. What seems to be one of the more popular uses, and again, you know this better than I do, is the help with scribes with taking notes, ambient recording. This seems to be really desired because of burnout that clinicians face around this whole issue of note taking.  

And it’s also seen as a way to allow clinicians to do more human interaction, you know, … 

LEE: Right.  

RAVITSKY: … look at the patient, talk to the patient, … 

LEE: Yep. 

RAVITSKY: … listen, rather than focus on the screen. We’ve all sat across the desk with a doctor that never looks at us because they only look at the screen. So there’s a real problem here, and there’s a real solution and therefore it’s hitting the ground quickly.  

But what’s surprising to me is how many places don’t think that it’s their responsibility to inform patients that this is happening. So some places do; some places don’t. And to me, this is a fundamental ethical issue of patient autonomy and empowerment. And it’s also pragmatically the fear of a crisis of trust. People don’t like being recorded without their consent. Surprise, surprise.  

LEE: Mm-hmm. Yeah, yeah. 

RAVITSKY: People worry about such a recording of a very private conversation that they consider to be confidential, such a recording ending up in the wrong hands or being shared externally or going to a commercial entity. People care; patients care.  

So what is our ethical responsibility to tell them? And what is the institutional responsibility to implement these wonderful tools? I’m not against them, I’m totally in favor—implement these great tools in a way that respects long-standing ethical principles of informed consent, transparency, accountability for, you know, change in practice? And, you know, bottom line: patients right to know what’s happening in their care.  

LEE: You actually recently had a paper in a medical journal (opens in new tab) that touched on an aspect of this, which I think was not with scribes, but with notes, you know, … 

RAVITSKY: Yep. 

LEE: … that doctors would send to patients. And in fact, in previous episodes of this podcast, we actually talked to both the technology developers of that type of feature as well as doctors who were using that feature. And in fact, even in those previous conversations, there was the question, “Well, what does the patient need to know about how this note was put together?” So you and your coauthors had a very interesting recent paper about this. 

RAVITSKY: Yeah, so the trigger for the paper was that patients seemed to really like being able to send electronic messages to clinicians.  

LEE: Yes. 

RAVITSKY: We email and text all day long. Why not in health, right? People are used to communicating in that way. It’s efficient; it’s fast.   

So we asked ourselves, “Wait, what if an AI tool writes the response?” Because again, this is a huge burden on clinicians, and it’s a real issue of burnout.  

We surveyed hundreds of respondents, and basically what we discovered is that there was a statistically significant difference in their level of satisfaction when they got an answer from a human clinician, when they got an answer, again, electronic message from AI.  

And it turns out that they preferred the messages written by AI. They were longer, more detailed, even conveyed more empathy. You know, AI has all the time in the world [LAUGHS] to write you a text. It’s not rushing to the next one. 

But then when we disclosed who wrote the message, they were less satisfied when they were told it was AI.  

So the ethical question that that raises is the following: if your only metric is patient satisfaction, the solution is to respond using AI but not tell them that. 

Now when we compared telling them that it was AI or human or not telling them anything, their satisfaction remained high, which means that if they were not told anything, they probably assumed that it was a human clinician writing, because their satisfaction for human clinician or no disclosure was the same. 

So basically, if we say nothing and just send back an AI-generated response, they will be more satisfied because the response is nicer, but they won’t be put off by the fact that it was written by AI. And therefore, hey presto, optimal satisfaction. But we challenge that, and we say, it’s not just about satisfaction. 

It’s about long-term trust. It’s about your right to know. It’s about empowering you to make decisions about how you want to communicate.  

So we push back against this notion that we’re just there to optimize patient satisfaction, and we bring in broader ethical considerations that say, “No, patients need to know.” If it’s not the norm yet to get your message from AI, … 

LEE: Yeah. 

RAVITSKY: … they should know that this is happening. And I think, Peter, that maybe we’re in a transition period. 

It could be that in two years, maybe less than that, most of our communication will come back from AI, and we will just take it for granted …  

LEE: Right. 

RAVITSKY: … that that’s the case. And at that point, maybe disclosure is not necessary because many, many surveys will show us that patients assume that, and therefore they are informed. But at this point in time, when it’s transition and it’s not the norm yet, I firmly think that ethics requires that we inform patients. 

LEE: Let me push on this a little bit because I think this final point that you just made is, I think is so interesting. Does it matter what kind of information is coming from a human or AI? Is there a time when patients will have different expectations for different types of information from their doctors? 

RAVITSKY: I think, Peter, that you’re asking the right question because it’s more nuanced. And these are the kinds of empirical questions that we will be exploring in the coming months and years. Our recent paper showed that there was no difference regarding the content. If the message was about what we call the “serious” matter or a less “serious” matter, the preferences were the same. But we didn’t go deep enough into that. That would require a different type of design of study. And you just said, you know, there are different types of information. We need to categorize them.  

LEE: Yeah.  

RAVITSKY: What types of information and what degree of impact on your life? Is it a life-and-death piece of information? Is it a quality-of-life piece of information? How central is it to your care and to your thinking? So all of that would have to be mapped out so that we can design these studies.  

But you know, you pushed back in that way, and I want to push back in a different direction that to me is more fundamental and philosophical. How much do we know now? You know, I keep saying, oh, patients deserve a chance for informed consent, … 

LEE: Right.  

RAVITSKY: … and they need to be empowered to make decisions. And if they don’t want that tool used in their care, then they should be able to say, “No.” Really? Is that the world we live in now? [LAUGHTER] Do I have access to the black box that is my doctor’s brain? Do I know how they performed on this procedure in the last year? 

Do I know whether they’re tired? Do I know if they’re up to speed on the literature with this? We already deal with black boxes, except they’re not AI. And I think the evidence emerges that AI outperforms the humans in so many of these tasks.  

So my pushback is, are we seeing AI exceptionalism in the sense that if it’s AI, Huh, panic! We have to inform everybody about everything, and we have to give them choices, and they have to be able to reject that tool and the other tool versus, you know, the rate of human error in medicine is awful. People don’t know the numbers. The annual deaths attributed to medical error is outrageous.  

So why are we so focused on informed consent and empowerment regarding implementation of AI and less in other contexts. Is it just because it’s new? Is it because it is some sort of existential threat, … 

LEE: Yep, yeah. 

RAVITSKY: … not just a matter of risk? I don’t know the answer, but I don’t want us to suffer from AI exceptionalism, and I don’t want us to hold AI to such a high standard that we won’t be able to benefit from it. Whereas, again, we’re dealing with black boxes already in medicine. 

LEE: Just to stay on this topic, though, one more question, which is, maybe, almost silly in how hypothetical it is. If instead of email, it were a Teams call or a Zoom call, doctor-patient, except that the doctor is not the real doctor, but it’s a perfect replica of the doctor designed to basically fool the patient that this is the real human being and having that interaction. Does that change the bioethical considerations at all? 

RAVITSKY: I think it does because it’s always a question of, are we really misleading? Now if you get a text message in an environment that, you know, people know AI is already integrated to some extent, maybe not your grandmother, but the younger generation is aware of this implementation, then maybe you can say, “Hmm. It was implied. I didn’t mean to mislead the patient.” 

If the patient thinks they’re talking to a clinician, and they’re seeing, like—what if it’s not you now, Peter? What if I’m talking to an avatar [LAUGHS] or some representation of you? Would I feel that I was misled in recording this podcast? Yeah, I would. Because you really gave me good reasons to assume that it was you.  

So it’s something deeper about trust, I think. And it touches on the notion of empathy. A lot of literature is being developed now on the issue of: what will remain the purview of the human clinician? What are humans good for [LAUGHS] when AI is so successful and especially in medicine?  

And if we see that the text messages are read as conveying more empathy and more care and more attention, and if we then move to a visual representation, facial expressions that convey more empathy, we really need to take a hard look at what we mean by care. What about then the robots, right, that we can touch, that can hug us?  

I think we’re really pushing the frontier of what we mean by human interaction, human connectedness, care, and empathy. This will be a lot of material for philosophers to ask themselves the fundamental question you asked me at first: what does it mean to be human?  

But this time, what does it mean to be two humans together and to have a connection?  

And if we can really be replaced in the sense that patients will feel more satisfied, more heard, more cared for, do we have ethical grounds for resisting that? And if so, why?  

You’re really going deep here into the conceptual questions, but bioethics is already looking at that. 

LEE: Vardit, it’s just always amazing to talk to you. The incredible span of what you think about from those fundamental philosophical questions all the way to the actual nitty gritty, like, you know, what parts of an email from a doctor to a patient should be marked as AI. I think that span is just incredible and incredibly needed and useful today. So thank you for all that you do. 

RAVITSKY: Thank you so much for inviting me. 

[TRANSITION MUSIC] 

The field of bioethics, and this is my take, is all about the adoption of disruptive new technologies into biomedical research and healthcare. And Vardit is able to explain this with such clarity. I think one of the reasons that AI has been challenging for many people is that its use spans the gamut from the nuts and bolts of how and when to disclose to patients that AI is being used to craft an email, all the way to, what does it mean to be a human being caring for another human?  

What I learned from the conversation with Vardit is that bioethicists are confronting head-on the issue of AI in medicine and not with an eye towards restricting it, but recognizing that the technology is real, it’s arrived, and needs to be harnessed now for maximum benefit.  

And so now, here’s my conversation with Dr. Roxana Daneshjou: 

LEE: Roxana, I’m just thrilled to have this chance to chat with you. 

ROXANA DANESHJOU: Thank you so much for having me on today. I’m looking forward to our conversation. 

LEE: In Microsoft Research, of course, you know, we think about healthcare and biomedicine a lot, but I think there’s less of an understanding from our audience what people actually do in their day-to-day work. And of course, you have such a broad background, both on the science side with a PhD and on the medical side. So what’s your typical work week like? 

DANESHJOU: I spend basically 90% of my time working on running my research lab (opens in new tab) and doing research on how AI interacts with medicine, how we can implement it to fix the pain points in medicine, and how we can do that in a fair and responsible way. And 10% of my time, I am in clinic. So I am a practicing dermatologist at Stanford, and I see patients half a day a week. 

LEE: And your background, it’s very interesting. There’s always been these MD-PhDs in the world, but somehow, especially right now with what’s happening in AI, people like you have become suddenly extremely important because it suddenly has become so important to be able to combine these two things. Did you have any inkling about that when you started, let’s say, on your PhD work? 

DANESHJOU: So I would say that during my—[LAUGHS] because I was in training for so long—during … my PhD was in computational genomics, and I still have a significant interest in precision medicine, and I think AI is going to be central to that.  

But I think the reason I became interested in AI initially is that I was thinking about how we associate genetic factors with patient phenotypes. Patient phenotypes being, How does the disease present? What does the disease look like? And I thought, you know, AI might be a good way to standardize phenotypes from images of, say, skin disease, because I was interested in dermatology at that time. And, you know, the part about phenotyping disease was a huge bottleneck because you would have humans sort of doing the phenotyping.  

And so in my head, when I was getting into the space, I was thinking I’ll bring together, you know, computer vision and genetics to try to, you know, make new discoveries about how genetics impacts human disease. And then when I actually started my postdoc to learn computer vision, I went down this very huge rabbit hole, which I am still, I guess, falling down, where I realized, you know, about biases in computer vision and how much work needed to be done for generalizability. 

And then after that, large language models came out, and, like, everybody else became incredibly interested in how this could help in healthcare and now also vision language models and multimodal models. So, you know, we’re just tumbling down the rabbit hole.  

LEE: Indeed, I think you really made a name for yourself by looking at the issues of biases, for example, in training datasets. And that was well before large language models were a thing. Maybe our audience would like to hear a little bit more about that earlier work.  

DANESHJOU: So as I mentioned, my PhD was in computational genetics. In genetics, what has happened during the genetic revolution is these large-scale studies to discover how genetic variation impacts human disease and human response to medication, so that’s what pharmacogenomics is, is human response to medications. And as I got, you know, entrenched in that world, I came to realize that I wasn’t really represented in the data. And it was because the majority of these genetic studies were on European ancestry individuals. You weren’t represented either. 

LEE: Right, yeah. 

DANESHJOU: Many diverse global populations were completely excluded from these studies, and genetic variation is quite diverse across the globe. And so you’re leaving out a large portion of genetic variation from these research studies. Now things have improved. It still needs work in genetics. But definitely there has been many amazing researchers, you know, sounding the alarm in that space. And so during my PhD, I actually focused on doing studies of pharmacogenomics in non-European ancestry populations. So when I came to computer vision and in particular dermatology, where there were a lot of papers being published about how AI models perform at diagnosing skin disease and several papers essentially saying, oh, it’s equivalent to a dermatologist—of course, that’s not completely true because it’s a very sort of contrived, you know, setting of diagnosis— … 

LEE: Right, right. 

DANESHJOU: but my first inkling was, well, are these models going to be able to perform well across skin tones? And one of our, you know, landmark papers, which was in Science Advances (opens in new tab), showed … we created a diverse dataset, our own diverse benchmark of skin disease, and showed that the models performed significantly worse on brown and black skin. And I think the key here is we also showed that it was an addressable problem because when we fine-tuned on diverse skin tones, you could make that bias go away. So it was really, in this case, about what data was going into the training of these computer vision models. 

LEE: Yeah, and I think if you’re listening to this conversation, if you haven’t read that paper, I think it’s really must reading. It was not only, Roxana, it wasn’t only just a landmark scientifically and medically, but it also sort of crossed the chasm and really became a subject of public discourse and debate, as well. And I think you really changed the public discourse around AI.  

So now I want to get into generative AI. I always like to ask, what was your first encounter with generative AI personally? And what went through your head? You know, what was that experience like for you? 

DANESHJOU: Yeah, I mean, I actually tell this story a lot because I think it’s a fun story. So I would say that I had played with, you know, GPT-3 prior and wasn’t particularly, you know, impressed … 

LEE: Yeah. 

DANESHJOU: … by how it was doing. And I was at NeurIPS [Conference on Neural Information Processing Systems] in New Orleans, and I was … we were walking back from a dinner. I was with Andrew Beam from Harvard. I was with his group. 

And we were just, sort of, you know, walking along, enjoying the sites of New Orleans, chatting. And one of his students said, “Hey, OpenAI just released this thing called ChatGPT.”  

LEE: So this would be New Orleans in December … 

DANESHJOU: 2022.  

LEE: 2022, right? Yes. Uh-huh. OK. 

DANESHJOU: So I went back to my hotel room. I was very tired. But I, you know, went to the website to see, OK, like, what is this thing? And I started to ask it medical questions, and I started all of a sudden thinking, “Uh-oh, we have made … we have made a leap here; something has changed.”  

LEE: So it must have been very intense for you from then because months later, you had another incredibly impactful, or landmark, paper basically looking at biases, race-based medicine in large language models (opens in new tab). So can you say more about that? 

DANESHJOU: Yes. I work with a very diverse team, and we have thought about bias in medicine, not just with technology but also with humans. Humans have biases, too. And there’s this whole debate around, is the technology going to be more biased than the humans? How do we do that? But at the same time, like, the technology actually encodes the biases of humans.  

And there was a paper in the Proceedings of the National Academy of Sciences (opens in new tab), which did not look at technology at all but actually looked at the race-based biases of medical trainees that were untrue and harmful in that they perpetuated racist beliefs.  

And so we thought, if medical trainees and humans have these biases, why don’t we see if the models carry them forward? And we added a few more questions that we, sort of, brainstormed as a team, and we started asking the models those questions. And … 

LEE: And by this time, it was GPT-4?  

DANESHJOU: We did include GPT-4 because GPT-4 came out, as well. And we also included other models, as well, such as Claude, because we wanted to look across the board. And what we found is that all of the models had instances of perpetuating race-based medicine. And actually, the GPT models had one of the most, I think, one of the most egregious responses—and, again, this is 3.5 and 4; we haven’t, you know, fully checked to see what things look like, because there have been newer models—in that they said that we should use race in calculating kidney function because there were differences in muscle mass between the races. And this is sort of a racist trope in medicine that is not true because race is not based on biology; it’s a social construct.   

So, yeah, that was that study. And that one did spur a lot of public conversation. 

LEE: Your work there even had the issue of bias overtake hallucination, you know, as really the most central and most difficult issue. So how do you think about bias in LLMs, and does that in your mind disqualify the use of large language models from particular uses in medicine?  

DANESHJOU: Yeah, I think that the hallucinations are an issue, too. And in some senses, they might even go with one another, right. Like, if it’s hallucinating information that’s not true but also, like, biased.  

So I think these are issues that we have to address with the use of LLMs in healthcare. But at the same time, things are moving very fast in this space. I mean, we have a secure instance of several large language models within our healthcare system at Stanford so that you could actually put secure patient information in there.  

So while I acknowledge that bias and hallucinations are a huge issue, I also acknowledge that the healthcare system is quite broken and needs to be improved, needs to be streamlined. Physicians are burned out; patients are not getting access to care in the appropriate ways. And I have a really great story about that, which I can share with you later.  

So in 2024, we did a study asking dermatologists, are they using large language models (opens in new tab) in their clinical practice? And I think this percentage has probably gone up since then: 65% of dermatologists reported using large language models in their practices on tasks such as, you know, writing insurance authorization letters, you know, writing information sheets for the patient, even, sort of, using them to educate themselves, which makes me a little nervous because in my mind, the best use of large language models right now are cases where you can verify facts easily.  

So, for example, I did show and teach my nurse how to use our secure large language model in our healthcare system to write rebuttal letters to the insurance. I told her, “Hey, you put in these bullet points that you want to make, and you ask it to write the letter, and you can verify that the letter contains the facts which you want and which are true.” 

LEE: Yes. 

DANESHJOU: And we have also done a lot of work to try to stress test models because we want them to be better. And so we held this red-teaming event at Stanford where we brought together 80 physicians, computer scientists, engineers and had them write scenarios and real questions that they might ask on a day to day or tasks that they might actually ask AI to do. 

And then we had them grade the performance. And we did this with the GPT models. At the  time, we were doing it with GPT-3.5, 4, and 4 with internet. But before the paper (opens in new tab) came out, we then ran the dataset on newer models.  

And we made the dataset public (opens in new tab) because I’m a big believer in public data. So we made the dataset public so that others could use this dataset, and we labeled what the issues were in the responses, whether it was bias, hallucination, like, a privacy issue, those sort of things. 

LEE: If I think about the hits or misses in our book, you know, we actually wrote a little bit, not too much, about noticing biases. I think we underestimated the magnitude of the issue in our book. And another element that we wrote about in our book is that we noticed that the language model, if presented with some biased decision-making, more often than not was able to spot that the decision-making was possibly being influenced by some bias. What do you make of that?  

DANESHJOU: So funny enough, I think we had … we had a—before I moved from Twitter to Bluesky—but we had a little back and forth on Twitter about this, which actually inspired us to look into this as a research, and we have a preprint up on it of actually using other large language models to identify bias and then to write critiques that the original model can incorporate and improve its answer upon. 

I mean, we’re moving towards this sort of agentic systems framework rather than a singular large language model, and people, of course, are talking about also retrieval-augmented generation, where you sort of have this corpus of, you know, text that you trust and find trustworthy and have that incorporated into the response of the model.  

And so you could build systems essentially where you do have other models saying, “Hey, specifically look for bias.” And then it will sort of focus on that task. And you can even, you know, give it examples of what bias is within context learning now. So I do think that we are going to be improving in this space. And actually, my team is … most recently, we’re working on building patient-facing chatbots. That’s where my, like, patient story comes in. But we’re building patient-facing chatbots. And we’re using, you know, we’re using prompt-engineering tools. We’re using automated eval tools. We’re building all of these things to try to make it more accurate and less bias. So it’s not just like one LLM spitting out an answer. It’s a whole system. 

LEE: All right. So let’s get to your patient-facing story.  

DANESHJOU: Oh, of course. Over the summer, my 6-year-old fell off the monkey bars and broke her arm. And I picked her up from school. She’s crying so badly. And I just look at her, and I know that we’re in trouble. 

And I said, OK, you know, we’re going straight to the emergency room. And we went straight to the emergency room. She’s crying the whole time. I’m almost crying because it’s just, like, she doesn’t even want to go into the hospital. And so then my husband shows up, and we also had the baby, and the baby wasn’t allowed in the emergency room, so I had to step out. 

And thanks to the [21st Century] Cures Act (opens in new tab), I’m getting, like, all the information, you know, as it’s happening. Like, I’m getting the x-ray results, and I’m looking at it. And I can tell there’s a fracture, but I can’t, you know, tell, like, how bad it is. Like, is this something that’s going to need surgery?  

And I’m desperately texting, like, all the orthopedic folks I know, the pediatricians I know. [LAUGHTER] “Hey, what does this mean?” Like, getting real-time information. And later in the process, there was a mistake in her after-visit summary about how much Tylenol she could take. But I, as a physician, knew that this dose was a mistake.  

I actually asked ChatGPT. I gave it the whole after-visit summary, and I said, are there any mistakes here? And it clued in that the dose of the medication was wrong. So again, I—as a physician with all these resources—have difficulty kind of navigating the healthcare system; understanding what’s going on in x-ray results that are showing up on my phone; can personally identify medication dose mistakes, but you know, most people probably couldn’t. And it could be very … I actually, you know, emailed the team and let them know, to give feedback.  

So we have a healthcare system that is broken in so many ways, and it’s so difficult to navigate. So I get it. And so that’s been, you know, a big impetus for me to work in this space and try to make things better. 

LEE: That’s an incredible story. It’s also validating because, you know, one of the examples in our book was the use of an LLM to spot a medication error that a doctor or a nurse might make. You know, interestingly, we’re finding no sort of formalized use of AI right now in the field. But anecdotes like this are everywhere. So it’s very interesting.  

All right. So we’re starting to run short on time. So I want to ask you a few quick questions and a couple that might be a little provocative.  

DANESHJOU: Oh boy. [LAUGHTER] Well, I don’t run away … I don’t run away from controversy. 

LEE: So, of course, with that story you just told, I can see that you use AI yourself. When you are actually in clinic, when you are being a dermatologist … 

DANESHJOU: Yeah.  

LEE: … and seeing patients, are you using generative AI? 

DANESHJOU: So I do not use it in clinic except for the situation of the insurance authorization letters. And even, I was offered, you know, sort of an AI-based scribe, which many people are using. There have been some studies that show that they can make mistakes. I have a human scribe. To me, writing the notes is actually part of the thinking process. So when I write my notes at the end of the day, there have been times that I’ve all of a sudden had an epiphany, particularly on a complex case. But I have used it to write, you know, sort of these insurance authorization letters. I’ve also used it in grant writing. So as a scientist, I have used it a lot more.  

LEE: Right. So I don’t know of anyone who has a more nuanced and deeper understanding of the issues of biases in AI in medicine than you. Do you think [these] biases can be repaired in AI, and if not, what are the implications? 

DANESHJOU: So I think there are several things here, and I just want to be thoughtful about it. One, I think, the bias in the technology comes from bias in the system and bias in medicine, which very much exists and is incredibly problematic. And so I always tell people, like, it doesn’t matter if you have the most perfect, fair AI. If you have a biased human and you add those together, because you’re going to have this human-AI interaction, you’re still going to have a problem.  

And there is a paper that I’m on with Dr. Matt Groh (opens in new tab), which looked at looking at dermatology diagnosis across skin tones and then with, like, AI assistance. And we found there’s a bias gap, you know, with even physicians. So it’s not just an AI problem; humans have the problem, too. And… 

LEE: Hmm. Yup. 

DANESHJOU: … we also looked at when you have the human-AI system, how that impacts the gap because you want to see the gap close. And it was kind of a mixed result in the sense that there was actually situations where, like, the accuracy increased in both groups, but the gap actually also increased because they were actually, even though they knew it was a fair AI, for some reason, they were relying upon the AI more often when … or they were trusting it more often on diagnoses on white skin—maybe they’d read my papers, who knows? [LAUGHTER]—even though we had told them, you know, it was a fair model.  

So I think for me, the important thing is understanding how the AI model works with the physician at the task. And what I would like to see is it improve the overall bias and disparities with that unit.  

And at the same time, I tell human physicians, we have to work on ourselves. We have to work on our system, you know, our medical system that has systemic issues of access to care or how patients get treated based on what they might look like or other parts of their background.  

LEE: All right, final question. So we started off with your stories about imaging in dermatology. And of course, Geoff Hinton, Turing winner and one of the grandfathers of the AI revolution, famously had predicted many years ago that by 2018 or something like that, we wouldn’t need human radiologists because of AI. 

That hasn’t come to pass, but since you work in a field that also is very dependent on imaging technologies, do you see a future when radiologists or, for that matter, dermatologists might be largely replaced by machines? 

DANESHJOU: I think that’s a complex question. Let’s say you have the most perfect AI systems. I think there’s still a lot of nuance in how these, you know, things get done. I’m not a radiologist, so I don’t want to speak to what happens in radiology. But in dermatology, it ends up being quite complex, the process. 

LEE: Yeah. 

DANESHJOU: You know, I don’t just look at lesions and make diagnoses. Like, I do skin exams to first identify the lesions of concern. So maybe if we had total-body photography that could help, like, catch which lesions would be of concern, which people have worked on, that would be step, sort of, step one.  

And then the second thing is, you know, it’s … I have to do the biopsy. So, you know, the robot’s not going to be doing the biopsy. [LAUGHTER]  

And then the pathology for skin cancer is sometimes very clear, but there’s also, like, intermediate-type lesions where we have to make a decision bringing all that information together. For rashes, it can be quite complex. And then we have to kind of think about what other tests we’re going to order, what therapeutics we might try first, that sort of stuff.  

So, you know, there is a thought that you might have AI that could reason through all of those steps maybe, but I just don’t feel like we’re anywhere close to that at all. I think the other thing is AI does a lot better on sort of, you know, tasks that are well defined. And a lot of things in medicine, like, it would be hard to train the model on because it’s not well defined. Even human physicians would disagree on the next best step.  

LEE: Well, Roxana, for whatever it’s worth, I can’t even begin to imagine anything replacing you. I think your work has been just so—I think you used the word, and I agree with it— landmark, and multiple times. So thank you for all that you’re doing and thank you so much for joining this podcast.  

DANESHJOU: Thanks for having me. This was very fun. 

[TRANSITION MUSIC] 

The issue of bias in AI has been the subject of truly landmark work by Roxana and her collaborators. And this includes biases in large language models.  

This was something that in our writing of the book, Carey, Zak, and I recognized and wrote about. But in fairness, I don’t think Carey, Zak, or I really understood the full implications of it. And this is where Roxana’s work has been so illuminating and important. 

Roxana’s practical prescriptions around red teaming have proven to be important in practice, and equally important were Roxana’s insights into how AI might always be guilty of the same biases, not only of individuals but also of whole healthcare organizations. But at the same time, AI might also be a potentially powerful tool to detect and help mitigate against such biases. 

When I think about the book that Carey, Zak, and I wrote, I think when we talked about laws, norms, ethics, regulations, it’s the place that we struggled the most. And in fact, we actually relied on a conversation with GPT-4 in order to tease out some of the core issues. Well, moving on from that conversation with an AI to a conversation with three deep experts who have dedicated their careers to making sure that we can harness all of the goodness while mitigating against the risks of AI, it’s been both fulfilling, very interesting, and a great learning experience. 

[THEME MUSIC]   

I’d like to say thank you again to Laura, Vardit, and Roxana for sharing their stories and insights. And to our listeners, thank you for joining us. We have some really great conversations planned for the coming episodes, including an examination on the broader economic impact of AI in health and a discussion on AI drug discovery. We hope you’ll continue to tune in.  

Until next time. 

[MUSIC FADES] 


The post Laws, norms, and ethics for AI in health appeared first on Microsoft Research.

Read More