Microsoft AI – Page 2

What AI’s impact on individuals means for the health workforce and industry

May 29, 2025

by Peter Lee, Ethan Mollick, Azeem Azhar Microsoft AI

Illustrated headshots of Azeem Azhar, Peter Lee, and Ethan Mollick.

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee.

In this episode, Ethan Mollick (opens in new tab) and Azeem Azhar (opens in new tab), thought leaders at the forefront of AI’s impact on work, education, and society, join Lee to discuss how generative AI is reshaping healthcare and organizational systems. Mollick, professor at the Wharton School, discusses the conflicting emotions that come with navigating AI’s effect on the tasks we enjoy and those we don’t; the systemic challenges in AI adoption; and the need for organizations to actively experiment with AI rather than wait for top-down solutions. Azhar, a technology analyst and writer who explores the intersection of AI, economics, and society, explores how generative AI is transforming healthcare through applications like medical scribing, clinician support, and consumer health monitoring.

Learn more:

Co-Intelligence: Living and Working with AI (opens in new tab) (Mollick)
Book | April 2024
One Useful Thing (opens in new tab) (Mollick)
Substack blog/newsletter
The Exponential Age: How Accelerating Technology is Transforming Business, Politics and Society (opens in new tab) (Azhar)
Book | September 2021
Exponential View (opens in new tab) (Azhar)
Substack blog/newsletter
The AI Revolution in Medicine: GPT-4 and Beyond  
Book | Peter Lee, Carey Goldberg, Isaac Kohane | April 2023

Transcript

[MUSIC]  

[BOOK PASSAGE] 

PETER LEE: “In American primary care, the missing workforce is stunning in magnitude, the shortfall estimated to reach up to 48,000 doctors within the next dozen years. China and other countries with aging populations can expect drastic shortfalls, as well. Just last month, I asked a respected colleague retiring from primary care who he would recommend as a replacement; he told me bluntly that, other than expensive concierge care practices, he could not think of anyone, even for himself. This mismatch between need and supply will only grow, and the US is far from alone among developed countries in facing it.”

[END OF BOOK PASSAGE]  

[THEME MUSIC]  

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.  

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?   

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.    

[THEME MUSIC FADES]

The book passage I read at the top is from “Chapter 4: Trust but Verify,” which was written by Zak.

You know, it’s no secret that in the US and elsewhere shortages in medical staff and the rise of clinician burnout are affecting the quality of patient care for the worse. In our book, we predicted that generative AI would be something that might help address these issues.

So in this episode, we’ll delve into how individual performance gains that our previous guests have described might affect the healthcare workforce as a whole, and on the patient side, we’ll look into the influence of generative AI on the consumerization of healthcare. Now, since all of this consumes such a huge fraction of the overall economy, we’ll also get into what a general-purpose technology as disruptive as generative AI might mean in the context of labor markets and beyond.

To help us do that, I’m pleased to welcome Ethan Mollick and Azeem Azhar.

Ethan Mollick is the Ralph J. Roberts Distinguished Faculty Scholar, a Rowan Fellow, and an associate professor at the Wharton School of the University of Pennsylvania. His research into the effects of AI on work, entrepreneurship, and education is applied by organizations around the world, leading him to be named one of Time magazine’s most influential people in AI for 2024. He’s also the author of the New York Times best-selling book Co-Intelligence.

Azeem Azhar is an author, founder, investor, and one of the most thoughtful and influential voices on the interplay between disruptive emerging technologies and business and society. In his best-selling book, The Exponential Age, and in his highly regarded newsletter and podcast, Exponential View, he explores how technologies like AI are reshaping everything from healthcare to geopolitics.

Ethan and Azeem are two leading thinkers on the ways that disruptive technologies—and especially AI—affect our work, our jobs, our business enterprises, and whole industries. As economists, they are trying to work out whether we are in the midst of an economic revolution as profound as the shift from an agrarian to an industrial society.

[TRANSITION MUSIC]

Here is my interview with Ethan Mollick:

LEE: Ethan, welcome.

ETHAN MOLLICK: So happy to be here, thank you.

LEE: I described you as a professor at Wharton, which I think most of the people who listen to this podcast series know of as an elite business school. So it might surprise some people that you study AI. And beyond that, you know, that I would seek you out to talk about AI in medicine. [LAUGHTER] So to get started, how and why did it happen that you’ve become one of the leading experts on AI?

MOLLICK: It’s actually an interesting story. I’ve been AI-adjacent my whole career. When I was [getting] my PhD at MIT, I worked with Marvin Minsky (opens in new tab) and the MIT [Massachusetts Institute of Technology] Media Labs AI group. But I was never the technical AI guy. I was the person who was trying to explain AI to everybody else who didn’t understand it.

And then I became very interested in, how do you train and teach? And AI was always a part of that. I was building games for teaching, teaching tools that were used in hospitals and elsewhere, simulations. So when LLMs burst into the scene, I had already been using them and had a good sense of what they could do. And between that and, kind of, being practically oriented and getting some of the first research projects underway, especially under education and AI and performance, I became sort of a go-to person in the field.

And once you’re in a field where nobody knows what’s going on and we’re all making it up as we go along—I thought it’s funny that you led with the idea that you have a couple of months head start for GPT-4, right. Like that’s all we have at this point, is a few months’ head start. [LAUGHTER] So being a few months ahead is good enough to be an expert at this point. Whether it should be or not is a different question.

LEE: Well, if I understand correctly, leading AI companies like OpenAI, Anthropic, and others have now sought you out as someone who should get early access to really start to do early assessments and gauge early reactions. How has that been?

MOLLICK: So, I mean, I think the bigger picture is less about me than about two things that tells us about the state of AI right now.

One, nobody really knows what’s going on, right. So in a lot of ways, if it wasn’t for your work, Peter, like, I don’t think people would be thinking about medicine as much because these systems weren’t built for medicine. They weren’t built to change education. They weren’t built to write memos. They, like, they weren’t built to do any of these things. They weren’t really built to do anything in particular. It turns out they’re just good at many things.

And to the extent that the labs work on them, they care about their coding ability above everything else and maybe math and science secondarily. They don’t think about the fact that it expresses high empathy. They don’t think about its accuracy and diagnosis or where it’s inaccurate. They don’t think about how it’s changing education forever.

So one part of this is the fact that they go to my Twitter feed or ask me for advice is an indicator of where they are, too, which is they’re not thinking about this. And the fact that a few months’ head start continues to give you a lead tells you that we are at the very cutting edge. These labs aren’t sitting on projects for two years and then releasing them. Months after a project is complete or sooner, it’s out the door. Like, there’s very little delay. So we’re kind of all in the same boat here, which is a very unusual space for a new technology.

LEE: And I, you know, explained that you’re at Wharton. Are you an odd fit as a faculty member at Wharton, or is this a trend now even in business schools that AI experts are becoming key members of the faculty?

MOLLICK: I mean, it’s a little of both, right. It’s faculty, so everybody does everything. I’m a professor of innovation-entrepreneurship. I’ve launched startups before and working on that and education means I think about, how do organizations redesign themselves? How do they take advantage of these kinds of problems? So medicine’s always been very central to that, right. A lot of people in my MBA class have been MDs either switching, you know, careers or else looking to advance from being sort of individual contributors to running teams. So I don’t think that’s that bad a fit. But I also think this is general-purpose technology; it’s going to touch everything. The focus on this is medicine, but Microsoft does far more than medicine, right. It’s … there’s transformation happening in literally every field, in every country. This is a widespread effect.

So I don’t think we should be surprised that business schools matter on this because we care about management. There’s a long tradition of management and medicine going together. There’s actually a great academic paper that shows that teaching hospitals that also have MBA programs associated with them have higher management scores and perform better (opens in new tab). So I think that these are not as foreign concepts, especially as medicine continues to get more complicated.

LEE: Yeah. Well, in fact, I want to dive a little deeper on these issues of management, of entrepreneurship, um, education. But before doing that, if I could just stay focused on you. There is always something interesting to hear from people about their first encounters with AI. And throughout this entire series, I’ve been doing that both pre-generative AI and post-generative AI. So you, sort of, hinted at the pre-generative AI. You were in Minsky’s lab. Can you say a little bit more about that early encounter? And then tell us about your first encounters with generative AI.

MOLLICK: Yeah. Those are great questions. So first of all, when I was at the media lab, that was pre-the current boom in sort of, you know, even in the old-school machine learning kind of space. So there was a lot of potential directions to head in. While I was there, there were projects underway, for example, to record every interaction small children had. One of the professors was recording everything their baby interacted with in the hope that maybe that would give them a hint about how to build an AI system.

There was a bunch of projects underway that were about labeling every concept and how they relate to other concepts. So, like, it was very much Wild West of, like, how do we make an AI work—which has been this repeated problem in AI, which is, what is this thing?

The fact that it was just like brute force over the corpus of all human knowledge turns out to be a little bit of like a, you know, it’s a miracle and a little bit of a disappointment in some ways [LAUGHTER] compared to how elaborate some of this was. So, you know, I think that, that was sort of my first encounters in sort of the intellectual way.

The generative AI encounters actually started with the original, sort of, GPT-3, or, you know, earlier versions. And it was actually game-based. So I played games like AI Dungeon. And as an educator, I realized, oh my gosh, this stuff could write essays at a fourth-grade level. That’s really going to change the way, like, middle school works, was my thinking at the time. And I was posting about that back in, you know, 2021 that this is a big deal. But I think everybody was taken surprise, including the AI companies themselves, by, you know, ChatGPT, by GPT-3.5. The difference in degree turned out to be a difference in kind.

LEE: Yeah, you know, if I think back, even with GPT-3, and certainly this was the case with GPT-2, it was, at least, you know, from where I was sitting, it was hard to get people to really take this seriously and pay attention.

MOLLICK: Yes.

LEE: You know, it’s remarkable. Within Microsoft, I think a turning point was the use of GPT-3 to do code completions. And that was actually productized as GitHub Copilot (opens in new tab), the very first version. That, I think, is where there was widespread belief. But, you know, in a way, I think there is, even for me early on, a sense of denial and skepticism. Did you have those initially at any point?

MOLLICK: Yeah, I mean, it still happens today, right. Like, this is a weird technology. You know, the original denial and skepticism was, I couldn’t see where this was going. It didn’t seem like a miracle because, you know, of course computers can complete code for you. Like, what else are they supposed to do? Of course, computers can give you answers to questions and write fun things. So there’s difference of moving into a world of generative AI. I think a lot of people just thought that’s what computers could do. So it made the conversations a little weird. But even today, faced with these, you know, with very strong reasoner models that operate at the level of PhD students, I think a lot of people have issues with it, right.

I mean, first of all, they seem intuitive to use, but they’re not always intuitive to use because the first use case that everyone puts AI to, it fails at because they use it like Google or some other use case. And then it’s genuinely upsetting in a lot of ways. I think, you know, I write in my book about the idea of three sleepless nights. That hasn’t changed. Like, you have to have an intellectual crisis to some extent, you know, and I think people do a lot to avoid having that existential angst of like, “Oh my god, what does it mean that a machine could think—apparently think—like a person?”

So, I mean, I see resistance now. I saw resistance then. And then on top of all of that, there’s the fact that the curve of the technology is quite great. I mean, the price of GPT-4 level intelligence from, you know, when it was released has dropped 99.97% at this point, right.

LEE: Yes. Mm-hmm.

MOLLICK: I mean, I could run a GPT-4 class system basically on my phone. Microsoft’s releasing things that can almost run on like, you know, like it fits in almost no space, that are almost as good as the original GPT-4 models. I mean, I don’t think people have a sense of how fast the trajectory is moving either.

LEE: Yeah, you know, there’s something that I think about often. There is this existential dread, or will this technology replace me? But I think the first people to feel that are researchers—people encountering this for the first time. You know, if you were working, let’s say, in Bayesian reasoning or in traditional, let’s say, Gaussian mixture model based, you know, speech recognition, you do get this feeling, Oh, my god, this technology has just solved the problem that I’ve dedicated my life to. And there is this really difficult period where you have to cope with that. And I think this is going to be spreading, you know, in more and more walks of life. And so this … at what point does that sort of sense of dread hit you, if ever?

MOLLICK: I mean, you know, it’s not even dread as much as like, you know, Tyler Cowen wrote that it’s impossible to not feel a little bit of sadness as you use these AI systems, too. Because, like, I was talking to a friend, just as the most minor example, and his talent that he was very proud of was he was very good at writing limericks for birthday cards. He’d write these limericks. Everyone was always amused by them. [LAUGHTER]

And now, you know, GPT-4 and GPT-4.5, they made limericks obsolete. Like, anyone can write a good limerick, right. So this was a talent, and it was a little sad. Like, this thing that you cared about mattered.

You know, as academics, we’re a little used to dead ends, right, and like, you know, some getting the lap. But the idea that entire fields are hitting that way. Like in medicine, there’s a lot of support systems that are now obsolete. And the question is how quickly you change that. In education, a lot of our techniques are obsolete.

What do you do to change that? You know, it’s like the fact that this brute force technology is good enough to solve so many problems is weird, right. And it’s not just the end of, you know, of our research angles that matter, too. Like, for example, I ran this, you know, 14-person-plus, multimillion-dollar effort at Wharton to build these teaching simulations, and we’re very proud of them. It took years of work to build one.

Now we’ve built a system that can build teaching simulations on demand by you talking to it with one team member. And, you know, you literally can create any simulation by having a discussion with the AI. I mean, you know, there’s a switch to a new form of excitement, but there is a little bit of like, this mattered to me, and, you know, now I have to change how I do things. I mean, adjustment happens. But if you haven’t had that displacement, I think that’s a good indicator that you haven’t really faced AI yet.

LEE: Yeah, what’s so interesting just listening to you is you use words like sadness, and yet I can see the—and hear the—excitement in your voice and your body language. So, you know, that’s also kind of an interesting aspect of all of this.

MOLLICK: Yeah, I mean, I think there’s something on the other side, right. But, like, I can’t say that I haven’t had moments where like, ughhhh, but then there’s joy and basically like also, you know, freeing stuff up. I mean, I think about doctors or professors, right. These are jobs that bundle together lots of different tasks that you would never have put together, right. If you’re a doctor, you would never have expected the same person to be good at keeping up with the research and being a good diagnostician and being a good manager and being good with people and being good with hand skills.

Like, who would ever want that kind of bundle? That’s not something you’re all good at, right. And a lot of our stress of our job comes from the fact that we suck at some of it. And so to the extent that AI steps in for that, you kind of feel bad about some of the stuff that it’s doing that you wanted to do. But it’s much more uplifting to be like, I don’t have to do this stuff I’m bad anymore, or I get the support to make myself good at it. And the stuff that I really care about, I can focus on more. Well, because we are at kind of a unique moment where whatever you’re best at, you’re still better than AI. And I think it’s an ongoing question about how long that lasts. But for right now, like you’re not going to say, OK, AI replaces me entirely in my job in medicine. It’s very unlikely.

But you will say it replaces these 17 things I’m bad at, but I never liked that anyway. So it’s a period of both excitement and a little anxiety.

LEE: Yeah, I’m going to want to get back to this question about in what ways AI may or may not replace doctors or some of what doctors and nurses and other clinicians do. But before that, let’s get into, I think, the real meat of this conversation. In previous episodes of this podcast, we talked to clinicians and healthcare administrators and technology developers that are very rapidly injecting AI today to do various forms of workforce automation, you know, automatically writing a clinical encounter note, automatically filling out a referral letter or request for prior authorization for some reimbursement to an insurance company.

And so these sorts of things are intended not only to make things more efficient and lower costs but also to reduce various forms of drudgery, cognitive burden on frontline health workers. So how do you think about the impact of AI on that aspect of workforce, and, you know, what would you expect will happen over the next few years in terms of impact on efficiency and costs?

MOLLICK: So I mean, this is a case where I think we’re facing the big bright problem in AI in a lot of ways, which is that this is … at the individual level, there’s lots of performance gains to be gained, right. The problem, though, is that we as individuals fit into systems, in medicine as much as anywhere else or more so, right. Which is that you could individually boost your performance, but it’s also about systems that fit along with this, right.

So, you know, if you could automatically, you know, record an encounter, if you could automatically make notes, does that change what you should be expecting for notes or the value of those notes or what they’re for? How do we take what one person does and validate it across the organization and roll it out for everybody without making it a 10-year process that it feels like IT in medicine often is? Like, so we’re in this really interesting period where there’s incredible amounts of individual innovation in productivity and performance improvements in this field, like very high levels of it, but not necessarily seeing that same thing translate to organizational efficiency or gains.

And one of my big concerns is seeing that happen. We’re seeing that in nonmedical problems, the same kind of thing, which is, you know, we’ve got research showing 20 and 40% performance improvements, like not uncommon to see those things. But then the organization doesn’t capture it; the system doesn’t capture it. Because the individuals are doing their own work and the systems don’t have the ability to, kind of, learn or adapt as a result.

LEE: You know, where are those productivity gains going, then, when you get to the organizational level?

MOLLICK: Well, they’re dying for a few reasons. One is, there’s a tendency for individual contributors to underestimate the power of management, right.

Practices associated with good management increase happiness, decrease, you know, issues, increase success rates. In the same way, about 40%, as far as we can tell, of the US advantage over other companies, of US firms, has to do with management ability. Like, management is a big deal. Organizing is a big deal. Thinking about how you coordinate is a big deal.

At the individual level, when things get stuck there, right, you can’t start bringing them up to how systems work together. It becomes, How do I deal with a doctor that has a 60% performance improvement? We really only have one thing in our playbook for doing that right now, which is, OK, we could fire 40% of the other doctors and still have a performance gain, which is not the answer you want to see happen.

So because of that, people are hiding their use. They’re actually hiding their use for lots of reasons.

And it’s a weird case because the people who are able to figure out best how to use these systems, for a lot of use cases, they’re actually clinicians themselves because they’re experimenting all the time. Like, they have to take those encounter notes. And if they figure out a better way to do it, they figure that out. You don’t want to wait for, you know, a med tech company to figure that out and then sell that back to you when it can be done by the physicians themselves.

So we’re just not used to a period where everybody’s innovating and where the management structure isn’t in place to take advantage of that. And so we’re seeing things stalled at the individual level, and people are often, especially in risk-averse organizations or organizations where there’s lots of regulatory hurdles, people are so afraid of the regulatory piece that they don’t even bother trying to make change.

LEE: If you are, you know, the leader of a hospital or a clinic or a whole health system, how should you approach this? You know, how should you be trying to extract positive success out of AI?

MOLLICK: So I think that you need to embrace the right kind of risk, right. We don’t want to put risk on our patients … like, we don’t want to put uninformed risk. But innovation involves risk to how organizations operate. They involve change. So I think part of this is embracing the idea that R&D has to happen in organizations again.

What’s happened over the last 20 years or so has been organizations giving that up. Partially, that’s a trend to focus on what you’re good at and not try and do this other stuff. Partially, it’s because it’s outsourced now to software companies that, like, Salesforce tells you how to organize your sales team. Workforce tells you how to organize your organization. Consultants come in and will tell you how to make change based on the average of what other people are doing in your field.

So companies and organizations and hospital systems have all started to give up their ability to create their own organizational change. And when I talk to organizations, I often say they have to have two approaches. They have to think about the crowd and the lab.

So the crowd is the idea of how to empower clinicians and administrators and supporter networks to start using AI and experimenting in ethical, legal ways and then sharing that information with each other. And the lab is, how are we doing R&D about the approach of how to [get] AI to work, not just in direct patient care, right. But also fundamentally, like, what paperwork can you cut out? How can we better explain procedures? Like, what management role can this fill?

And we need to be doing active experimentation on that. We can’t just wait for, you know, Microsoft to solve the problems. It has to be at the level of the organizations themselves.

LEE: So let’s shift a little bit to the patient. You know, one of the things that we see, and I think everyone is seeing, is that people are turning to chatbots, like ChatGPT, actually to seek healthcare information for, you know, their own health or the health of their loved ones.

And there was already, prior to all of this, a trend towards, let’s call it, consumerization of healthcare. So just in the business of healthcare delivery, do you think AI is going to hasten these kinds of trends, or from the consumer’s perspective, what … ?

MOLLICK: I mean, absolutely, right. Like, all the early data that we have suggests that for most common medical problems, you should just consult AI, too, right. In fact, there is a real question to ask: at what point does it become unethical for doctors themselves to not ask for a second opinion from the AI because it’s cheap, right? You could overrule it or whatever you want, but like not asking seems foolish.

I think the two places where there’s a burning almost, you know, moral imperative is … let’s say, you know, I’m in Philadelphia, I’m a professor, I have access to really good healthcare through the Hospital University of Pennsylvania system. I know doctors. You know, I’m lucky. I’m well connected. If, you know, something goes wrong, I have friends who I can talk to. I have specialists. I’m, you know, pretty well educated in this space.

But for most people on the planet, they don’t have access to good medical care, they don’t have good health. It feels like it’s absolutely imperative to say when should you use AI and when not. Are there blind spots? What are those things?

And I worry that, like, to me, that would be the crash project I’d be invoking because I’m doing the same thing in education, which is this system is not as good as being in a room with a great teacher who also uses AI to help you, but it’s better than not getting an, you know, to the level of education people get in many cases. Where should we be using it? How do we guide usage in the right way? Because the AI labs aren’t thinking about this. We have to.

So, to me, there is a burning need here to understand this. And I worry that people will say, you know, everything that’s true—AI can hallucinate, AI can be biased. All of these things are absolutely true, but people are going to use it. The early indications are that it is quite useful. And unless we take the active role of saying, here’s when to use it, here’s when not to use it, we don’t have a right to say, don’t use this system. And I think, you know, we have to be exploring that.

LEE: What do people need to understand about AI? And what should schools, universities, and so on be teaching?

MOLLICK: Those are, kind of, two separate questions in lot of ways. I think a lot of people want to teach AI skills, and I will tell you, as somebody who works in this space a lot, there isn’t like an easy, sort of, AI skill, right. I could teach you prompt engineering in two to three classes, but every indication we have is that for most people under most circumstances, the value of prompting, you know, any one case is probably not that useful.

A lot of the tricks are disappearing because the AI systems are just starting to use them themselves. So asking good questions, being a good manager, being a good thinker tend to be important, but like magic tricks around making, you know, the AI do something because you use the right phrase used to be something that was real but is rapidly disappearing.

So I worry when people say teach AI skills. No one’s been able to articulate to me as somebody who knows AI very well and teaches classes on AI, what those AI skills that everyone should learn are, right.

I mean, there’s value in learning a little bit how the models work. There’s a value in working with these systems. A lot of it’s just hands on keyboard kind of work. But, like, we don’t have an easy slam dunk “this is what you learn in the world of AI” because the systems are getting better, and as they get better, they get less sensitive to these prompting techniques. They get better prompting themselves. They solve problems spontaneously and start being agentic. So it’s a hard problem to ask about, like, what do you train someone on? I think getting people experience in hands-on-keyboards, getting them to … there’s like four things I could teach you about AI, and two of them are already starting to disappear.

But, like, one is be direct. Like, tell the AI exactly what you want. That’s very helpful. Second, provide as much context as possible. That can include things like acting as a doctor, but also all the information you have. The third is give it step-by-step directions—that’s becoming less important. And the fourth is good and bad examples of the kind of output you want. Those four, that’s like, that’s it as far as the research telling you what to do, and the rest is building intuition.

LEE: I’m really impressed that you didn’t give the answer, “Well, everyone should be teaching my book, Co-Intelligence.” [LAUGHS]

MOLLICK: Oh, no, sorry! Everybody should be teaching my book Co-Intelligence. I apologize. [LAUGHTER]

LEE: It’s good to chuckle about that, but actually, I can’t think of a better book, like, if you were to assign a textbook in any professional education space, I think Co-Intelligence would be number one on my list. Are there other things that you think are essential reading?

MOLLICK: That’s a really good question. I think that a lot of things are evolving very quickly. I happen to, kind of, hit a sweet spot with Co-Intelligence to some degree because I talk about how I used it, and I was, sort of, an advanced user of these systems.

So, like, it’s, sort of, like my Twitter feed, my online newsletter. I’m just trying to, kind of, in some ways, it’s about trying to make people aware of what these systems can do by just showing a lot, right. Rather than picking one thing, and, like, this is a general-purpose technology. Let’s use it for this. And, like, everybody gets a light bulb for a different reason. So more than reading, it is using, you know, and that can be Copilot or whatever your favorite tool is.

But using it. Voice modes help a lot. In terms of readings, I mean, I think that there is a couple of good guides to understanding AI that were originally blog posts. I think Tim Lee has one called Understanding AI (opens in new tab), and it had a good overview …

LEE: Yeah, that’s a great one.

MOLLICK: … of that topic that I think explains how transformers work, which can give you some mental sense. I think [Andrej] Karpathy (opens in new tab) has some really nice videos of use that I would recommend.

Like on the medical side, I think the book that you did, if you’re in medicine, you should read that. I think that that’s very valuable. But like all we can offer are hints in some ways. Like there isn’t … if you’re looking for the instruction manual, I think it can be very frustrating because it’s like you want the best practices and procedures laid out, and we cannot do that, right. That’s not how a system like this works.

LEE: Yeah.

MOLLICK: It’s not a person, but thinking about it like a person can be helpful, right.

LEE: One of the things that has been sort of a fun project for me for the last few years is I have been a founding board member of a new medical school at Kaiser Permanente. And, you know, that medical school curriculum is being formed in this era. But it’s been perplexing to understand, you know, what this means for a medical school curriculum. And maybe even more perplexing for me, at least, is the accrediting bodies, which are extremely important in US medical schools; how accreditors should think about what’s necessary here.

Besides the things that you’ve … the, kind of, four key ideas you mentioned, if you were talking to the board of directors of the LCME [Liaison Committee on Medical Education] accrediting body, what’s the one thing you would want them to really internalize?

MOLLICK: This is both a fast-moving and vital area. This can’t be viewed like a usual change, which [is], “Let’s see how this works.” Because it’s, like, the things that make medical technologies hard to do, which is like unclear results, limited, you know, expensive use cases where it rolls out slowly. So one or two, you know, advanced medical facilities get access to, you know, proton beams or something else at multi-billion dollars of cost, and that takes a while to diffuse out. That’s not happening here. This is all happening at the same time, all at once. This is now … AI is part of medicine.

I mean, there’s a minor point that I’d make that actually is a really important one, which is large language models, generative AI overall, work incredibly differently than other forms of AI. So the other worry I have with some of these accreditors is they blend together algorithmic forms of AI, which medicine has been trying for long time—decision support, algorithmic methods, like, medicine more so than other places has been thinking about those issues. Generative AI, even though it uses the same underlying techniques, is a completely different beast.

So, like, even just take the most simple thing of algorithmic aversion, which is a well-understood problem in medicine, right. Which is, so you have a tool that could tell you as a radiologist, you know, the chance of this being cancer; you don’t like it, you overrule it, right.

We don’t find algorithmic aversion happening with LLMs in the same way. People actually enjoy using them because it’s more like working with a person. The flaws are different. The approach is different. So you need to both view this as universal applicable today, which makes it urgent, but also as something that is not the same as your other form of AI, and your AI working group that is thinking about how to solve this problem is not the right people here.

LEE: You know, I think the world has been trained because of the magic of web search to view computers as question-answering machines. Ask a question, get an answer.

MOLLICK: Yes. Yes.

LEE: Write a query, get results. And as I have interacted with medical professionals, you can see that medical professionals have that model of a machine in mind. And I think that’s partly, I think psychologically, why hallucination is so alarming. Because you have a mental model of a computer as a machine that has absolutely rock-solid perfect memory recall.

But the thing that was so powerful in Co-Intelligence, and we tried to get at this in our book also, is that’s not the sweet spot. It’s this sort of deeper interaction, more of a collaboration. And I thought your use of the term Co-Intelligence really just even in the title of the book tried to capture this. When I think about education, it seems like that’s the first step, to get past this concept of a machine being just a question-answering machine. Do you have a reaction to that idea?

MOLLICK: I think that’s very powerful. You know, we’ve been trained over so many years at both using computers but also in science fiction, right. Computers are about cold logic, right. They will give you the right answer, but if you ask it what love is, they explode, right. Like that’s the classic way you defeat the evil robot in Star Trek, right. “Love does not compute.” [LAUGHTER]

Instead, we have a system that makes mistakes, is warm, beats doctors in empathy in almost every controlled study on the subject, right. Like, absolutely can outwrite you in a sonnet but will absolutely struggle with giving you the right answer every time. And I think our mental models are just broken for this. And I think you’re absolutely right. And that’s part of what I thought your book does get at really well is, like, this is a different thing. It’s also generally applicable. Again, the model in your head should be kind of like a person even though it isn’t, right.

There’s a lot of warnings and caveats to it, but if you start from person, smart person you’re talking to, your mental model will be more accurate than smart machine, even though both are flawed examples, right. So it will make mistakes; it will make errors. The question is, what do you trust it on? What do you not trust it? As you get to know a model, you’ll get to understand, like, I totally don’t trust it for this, but I absolutely trust it for that, right.

LEE: All right. So we’re getting to the end of the time we have together. And so I’d just like to get now into something a little bit more provocative. And I get the question all the time. You know, will AI replace doctors? In medicine and other advanced knowledge work, project out five to 10 years. What do think happens?

MOLLICK: OK, so first of all, let’s acknowledge systems change much more slowly than individual use. You know, doctors are not individual actors; they’re part of systems, right. So not just the system of a patient who like may or may not want to talk to a machine instead of a person but also legal systems and administrative systems and systems that allocate labor and systems that train people.

So, like, it’s hard to imagine that in five to 10 years medicine being so upended that even if AI was better than doctors at every single thing doctors do, that we’d actually see as radical a change in medicine as you might in other fields. I think you will see faster changes happen in consulting and law and, you know, coding, other spaces than medicine.

But I do think that there is good reason to suspect that AI will outperform people while still having flaws, right. That’s the difference. We’re already seeing that for common medical questions in enough randomized controlled trials that, you know, best doctors beat AI, but the AI beats the mean doctor, right. Like, that’s just something we should acknowledge is happening at this point.

Now, will that work in your specialty? No. Will that work with all the contingent social knowledge that you have in your space? Probably not.

Like, these are vignettes, right. But, like, that’s kind of where things are. So let’s assume, right … you’re asking two questions. One is, how good will AI get?

LEE: Yeah.

MOLLICK: And we don’t know the answer to that question. I will tell you that your colleagues at Microsoft and increasingly the labs, the AI labs themselves, are all saying they think they’ll have a machine smarter than a human at every intellectual task in the next two to three years. If that doesn’t happen, that makes it easier to assume the future, but let’s just assume that that’s the case. I think medicine starts to change with the idea that people feel obligated to use this to help for everything.

Your patients will be using it, and it will be your advisor and helper at the beginning phases, right. And I think that I expect people to be better at empathy. I expect better bedside manner. I expect management tasks to become easier. I think administrative burden might lighten if we handle this right way or much worse if we handle it badly. Diagnostic accuracy will increase, right.

And then there’s a set of discovery pieces happening, too, right. One of the core goals of all the AI companies is to accelerate medical research. How does that happen and how does that affect us is a, kind of, unknown question. So I think clinicians are in both the eye of the storm and surrounded by it, right. Like, they can resist AI use for longer than most other fields, but everything around them is going to be affected by it.

LEE: Well, Ethan, this has been really a fantastic conversation. And, you know, I think in contrast to all the other conversations we’ve had, this one gives especially the leaders in healthcare, you know, people actually trying to lead their organizations into the future, whether it’s in education or in delivery, a lot to think about. So I really appreciate you joining.

MOLLICK: Thank you.

[TRANSITION MUSIC] 

I’m a computing researcher who works with people who are right in the middle of today’s bleeding-edge developments in AI. And because of that, I often lose sight of how to talk to a broader audience about what it’s all about. And so I think one of Ethan’s superpowers is that he has this knack for explaining complex topics in AI in a really accessible way, getting right to the most important points without making it so simple as to be useless. That’s why I rarely miss an opportunity to read up on his latest work.

One of the first things I learned from Ethan is the intuition that you can, sort of, think of AI as a very knowledgeable intern. In other words, think of it as a persona that you can interact with, but you also need to be a manager for it and to always assess the work that it does.

In our discussion, Ethan went further to stress that there is, because of that, a serious education gap. You know, over the last decade or two, we’ve all been trained, mainly by search engines, to think of computers as question-answering machines. In medicine, in fact, there’s a question-answering application that is really popular called UpToDate (opens in new tab). Doctors use it all the time. But generative AI systems like ChatGPT are different. There’s therefore a challenge in how to break out of the old-fashioned mindset of search to get the full value out of generative AI.

The other big takeaway for me was that Ethan pointed out while it’s easy to see productivity gains from AI at the individual level, those same gains, at least today, don’t often translate automatically to organization-wide or system-wide gains. And one, of course, has to conclude that it takes more than just making individuals more productive; the whole system also has to adjust to the realities of AI.

Here’s now my interview with Azeem Azhar:

LEE: Azeem, welcome.

AZEEM AZHAR: Peter, thank you so much for having me.

LEE: You know, I think you’re extremely well known in the world. But still, some of the listeners of this podcast series might not have encountered you before.

And so one of the ways I like to ask people to introduce themselves is, how do you explain to your parents what you do every day?

AZHAR: Well, I’m very lucky in that way because my mother was the person who got me into computers more than 40 years ago. And I still have that first computer, a ZX81 with a Z80 chip …

LEE: Oh wow.

AZHAR: … to this day. It sits in my study, all seven and a half thousand transistors and Bakelite plastic that it is. And my parents were both economists, and economics is deeply connected with technology in some sense. And I grew up in the late ’70s and the early ’80s. And that was a time of tremendous optimism around technology. It was space opera, science fiction, robots, and of course, the personal computer and, you know, Bill Gates and Steve Jobs. So that’s where I started.

And so, in a way, my mother and my dad, who passed away a few years ago, had always known me as someone who was fiddling with computers but also thinking about economics and society. And so, in a way, it’s easier to explain to them because they’re the ones who nurtured the environment that allowed me to research technology and AI and think about what it means to firms and to the economy at large.

LEE: I always like to understand the origin story. And what I mean by that is, you know, what was your first encounter with generative AI? And what was that like? What did you go through?

AZHAR: The first real moment was when Midjourney and Stable Diffusion emerged in that summer of 2022. I’d been away on vacation, and I came back—and I’d been off grid, in fact—and the world had really changed.

Now, I’d been aware of GPT-3 and GPT-2, which I played around with and with BERT, the original transformer paper about seven or eight years ago, but it was the moment where I could talk to my computer, and it could produce these images, and it could be refined in natural language that really made me think we’ve crossed into a new domain. We’ve gone from AI being highly discriminative to AI that’s able to explore the world in particular ways. And then it was a few months later that ChatGPT came out—November, the 30th.

And I think it was the next day or the day after that I said to my team, everyone has to use this, and we have to meet every morning and discuss how we experimented the day before. And we did that for three or four months. And, you know, it was really clear to me in that interface at that point that, you know, we’d absolutely pass some kind of threshold.

LEE: And who’s the we that you were experimenting with?

AZHAR: So I have a team of four who support me. They’re mostly researchers of different types. I mean, it’s almost like one of those jokes. You know, I have a sociologist, an economist, and an astrophysicist. And, you know, they walk into the bar, [LAUGHTER] or they walk into our virtual team room, and we try to solve problems.

LEE: Well, so let’s get now into brass tacks here. And I think I want to start maybe just with an exploration of the economics of all this and economic realities. Because I think in a lot of your work—for example, in your book—you look pretty deeply at how automation generally and AI specifically are transforming certain sectors like finance, manufacturing, and you have a really, kind of, insightful focus on what this means for productivity and which ways, you know, efficiencies are found.

And then you, sort of, balance that with risks, things that can and do go wrong. And so as you take that background and looking at all those other sectors, in what ways are the same patterns playing out or likely to play out in healthcare and medicine?

AZHAR: I’m sure we will see really remarkable parallels but also new things going on. I mean, medicine has a particular quality compared to other sectors in the sense that it’s highly regulated, market structure is very different country to country, and it’s an incredibly broad field. I mean, just think about taking a Tylenol and going through laparoscopic surgery. Having an MRI and seeing a physio. I mean, this is all medicine. I mean, it’s hard to imagine a sector that is [LAUGHS] more broad than that.

So I think we can start to break it down, and, you know, where we’re seeing things with generative AI will be that the, sort of, softest entry point, which is the medical scribing. And I’m sure many of us have been with clinicians who have a medical scribe running alongside—they’re all on Surface Pros I noticed, right? [LAUGHTER] They’re on the tablet computers, and they’re scribing away.

And what that’s doing is, in the words of my friend Eric Topol, it’s giving the clinician time back (opens in new tab), right. They have time back from days that are extremely busy and, you know, full of administrative overload. So I think you can obviously do a great deal with reducing that overload.

And within my team, we have a view, which is if you do something five times in a week, you should be writing an automation for it. And if you’re a doctor, you’re probably reviewing your notes, writing the prescriptions, and so on several times a day. So those are things that can clearly be automated, and the human can be in the loop. But I think there are so many other ways just within the clinic that things can help.

So, one of my friends, my friend from my junior school—I’ve known him since I was 9—is an oncologist who’s also deeply into machine learning, and he’s in Cambridge in the UK. And he built with Microsoft Research a suite of imaging AI tools from his own discipline, which they then open sourced.

So that’s another way that you have an impact, which is that you actually enable the, you know, generalist, specialist, polymath, whatever they are in health systems to be able to get this technology, to tune it to their requirements, to use it, to encourage some grassroots adoption in a system that’s often been very, very heavily centralized.

LEE: Yeah.

AZHAR: And then I think there are some other things that are going on that I find really, really exciting. So one is the consumerization of healthcare. So I have one of those sleep tracking rings, the Oura (opens in new tab).

LEE: Yup.

AZHAR: That is building a data stream that we’ll be able to apply more and more AI to. I mean, right now, it’s applying traditional, I suspect, machine learning, but you can imagine that as we start to get more data, we start to get more used to measuring ourselves, we create this sort of pot, a personal asset that we can turn AI to.

And there’s still another category. And that other category is one of the completely novel ways in which we can enable patient care and patient pathway. And there’s a fantastic startup in the UK called Neko Health (opens in new tab), which, I mean, does physicals, MRI scans, and blood tests, and so on.

It’s hard to imagine Neko existing without the sort of advanced data, machine learning, AI that we’ve seen emerge over the last decade. So, I mean, I think that there are so many ways in which the temperature is slowly being turned up to encourage a phase change within the healthcare sector.

And last but not least, I do think that these tools can also be very, very supportive of a clinician’s life cycle. I think we, as patients, we’re a bit … I don’t know if we’re as grateful as we should be for our clinicians who are putting in 90-hour weeks. [LAUGHTER] But you can imagine a world where AI is able to support not just the clinicians’ workload but also their sense of stress, their sense of burnout.

So just in those five areas, Peter, I sort of imagine we could start to fundamentally transform over the course of many years, of course, the way in which people think about their health and their interactions with healthcare systems

LEE: I love how you break that down. And I want to press on a couple of things.

You also touched on the fact that medicine is, at least in most of the world, is a highly regulated industry. I guess finance is the same way, but they also feel different because the, like, finance sector has to be very responsive to consumers, and consumers are sensitive to, you know, an abundance of choice; they are sensitive to price. Is there something unique about medicine besides being regulated?

AZHAR: I mean, there absolutely is. And in finance, as well, you have much clearer end states. So if you’re not in the consumer space, but you’re in the, you know, asset management space, you have to essentially deliver returns against the volatility or risk boundary, right. That’s what you have to go out and do. And I think if you’re in the consumer industry, you can come back to very, very clear measures, net promoter score being a very good example.

In the case of medicine and healthcare, it is much more complicated because as far as the clinician is concerned, people are individuals, and we have our own parts and our own responses. If we didn’t, there would never be a need for a differential diagnosis. There’d never be a need for, you know, Let’s try azithromycin first, and then if that doesn’t work, we’ll go to vancomycin, or, you know, whatever it happens to be. You would just know. But ultimately, you know, people are quite different. The symptoms that they’re showing are quite different, and also their compliance is really, really different.

I had a back problem that had to be dealt with by, you know, a physio and extremely boring exercises four times a week, but I was ruthless in complying, and my physio was incredibly surprised. He’d say well no one ever does this, and I said, well you know the thing is that I kind of just want to get this thing to go away.

LEE: Yeah.

AZHAR: And I think that that’s why medicine is and healthcare is so different and more complex. But I also think that’s why AI can be really, really helpful. I mean, we didn’t talk about, you know, AI in its ability to potentially do this, which is to extend the clinician’s presence throughout the week.

LEE: Right. Yeah.

AZHAR: The idea that maybe some part of what the clinician would do if you could talk to them on Wednesday, Thursday, and Friday could be delivered through an app or a chatbot just as a way of encouraging the compliance, which is often, especially with older patients, one reason why conditions, you know, linger on for longer.

LEE: You know, just staying on the regulatory thing, as I’ve thought about this, the one regulated sector that I think seems to have some parallels to healthcare is energy delivery, energy distribution.

Because like healthcare, as a consumer, I don’t have choice in who delivers electricity to my house. And even though I care about it being cheap or at least not being overcharged, I don’t have an abundance of choice. I can’t do price comparisons.

And there’s something about that, just speaking as a consumer of both energy and a consumer of healthcare, that feels similar. Whereas other regulated industries, you know, somehow, as a consumer, I feel like I have a lot more direct influence and power. Does that make any sense to someone, you know, like you, who’s really much more expert in how economic systems work?

AZHAR: I mean, in a sense, one part of that is very, very true. You have a limited panel of energy providers you can go to, and in the US, there may be places where you have no choice.

I think the area where it’s slightly different is that as a consumer or a patient, you can actually make meaningful choices and changes yourself using these technologies, and people used to joke about you know asking Dr. Google. But Dr. Google is not terrible, particularly if you go to WebMD. And, you know, when I look at long-range change, many of the regulations that exist around healthcare delivery were formed at a point before people had access to good quality information at the touch of their fingertips or when educational levels in general were much, much lower. And many regulations existed because of the incumbent power of particular professional sectors.

I’ll give you an example from the United Kingdom. So I have had asthma all of my life. That means I’ve been taking my inhaler, Ventolin, and maybe a steroid inhaler for nearly 50 years. That means that I know … actually, I’ve got more experience, and I—in some sense—know more about it than a general practitioner.

LEE: Yeah.

AZHAR: And until a few years ago, I would have to go to a general practitioner to get this drug that I’ve been taking for five decades, and there they are, age 30 or whatever it is. And a few years ago, the regulations changed. And now pharmacies can … or pharmacists can prescribe those types of drugs under certain conditions directly.

LEE: Right.

AZHAR: That was not to do with technology. That was to do with incumbent lock-in. So when we look at the medical industry, the healthcare space, there are some parallels with energy, but there are a few little things that the ability that the consumer has to put in some effort to learn about their condition, but also the fact that some of the regulations that exist just exist because certain professions are powerful.

LEE: Yeah, one last question while we’re still on economics. There seems to be a conundrum about productivity and efficiency in healthcare delivery because I’ve never encountered a doctor or a nurse that wants to be able to handle even more patients than they’re doing on a daily basis.

And so, you know, if productivity means simply, well, your rounds can now handle 16 patients instead of eight patients, that doesn’t seem necessarily to be a desirable thing. So how can we or should we be thinking about efficiency and productivity since obviously costs are, in most of the developed world, are a huge, huge problem?

AZHAR: Yes, and when you described doubling the number of patients on the round, I imagined you buying them all roller skates so they could just whizz around [LAUGHTER] the hospital faster and faster than ever before.

We can learn from what happened with the introduction of electricity. Electricity emerged at the end of the 19th century, around the same time that cars were emerging as a product, and car makers were very small and very artisanal. And in the early 1900s, some really smart car makers figured out that electricity was going to be important. And they bought into this technology by putting pendant lights in their workshops so they could “visit more patients.” Right?

LEE: Yeah, yeah.

AZHAR: They could effectively spend more hours working, and that was a productivity enhancement, and it was noticeable. But, of course, electricity fundamentally changed the productivity by orders of magnitude of people who made cars starting with Henry Ford because he was able to reorganize his factories around the electrical delivery of power and to therefore have the moving assembly line, which 10xed the productivity of that system.

So when we think about how AI will affect the clinician, the nurse, the doctor, it’s much easier for us to imagine it as the pendant light that just has them working later …

LEE: Right.

AZHAR: … than it is to imagine a reconceptualization of the relationship between the clinician and the people they care for.

And I’m not sure. I don’t think anybody knows what that looks like. But, you know, I do think that there will be a way that this changes, and you can see that scale out factor. And it may be, Peter, that what we end up doing is we end up saying, OK, because we have these brilliant AIs, there’s a lower level of training and cost and expense that’s required for a broader range of conditions that need treating. And that expands the market, right. That expands the market hugely. It’s what has happened in the market for taxis or ride sharing. The introduction of Uber and the GPS system …

LEE: Yup.

AZHAR: … has meant many more people now earn their living driving people around in their cars. And at least in London, you had to be reasonably highly trained to do that.

So I can see a reorganization is possible. Of course, entrenched interests, the economic flow … and there are many entrenched interests, particularly in the US between the health systems and the, you know, professional bodies that might slow things down. But I think a reimagining is possible.

And if I may, I’ll give you one example of that, which is, if you go to countries outside of the US where there are many more sick people per doctor, they have incentives to change the way they deliver their healthcare. And well before there was AI of this quality around, there was a few cases of health systems in India—Aravind Eye Care (opens in new tab) was one, and Narayana Hrudayalaya [now known as Narayana Health (opens in new tab)] was another. And in the latter, they were a cardiac care unit where you couldn’t get enough heart surgeons.

LEE: Yeah, yep.

AZHAR: So specially trained nurses would operate under the supervision of a single surgeon who would supervise many in parallel. So there are ways of increasing the quality of care, reducing the cost, but it does require a systems change. And we can’t expect a single bright algorithm to do it on its own.

LEE: Yeah, really, really interesting. So now let’s get into regulation. And let me start with this question. You know, there are several startup companies I’m aware of that are pushing on, I think, a near-term future possibility that a medical AI for consumer might be allowed, say, to prescribe a medication for you, something that would normally require a doctor or a pharmacist, you know, that is certified in some way, licensed to do. Do you think we’ll get to a point where for certain regulated activities, humans are more or less cut out of the loop?

AZHAR: Well, humans would have been in the loop because they would have provided the training data, they would have done the oversight, the quality control. But to your question in general, would we delegate an important decision entirely to a tested set of algorithms? I’m sure we will. We already do that. I delegate less important decisions like, What time should I leave for the airport to Waze. I delegate more important decisions to the automated braking in my car. We will do this at certain levels of risk and threshold.

If I come back to my example of prescribing Ventolin. It’s really unclear to me that the prescription of Ventolin, this incredibly benign bronchodilator that is only used by people who’ve been through the asthma process, needs to be prescribed by someone who’s gone through 10 years or 12 years of medical training. And why that couldn’t be prescribed by an algorithm or an AI system.

LEE: Right. Yep. Yep.

AZHAR: So, you know, I absolutely think that that will be the case and could be the case. I can’t really see what the objections are. And the real issue is where do you draw the line of where you say, “Listen, this is too important,” or “The cost is too great,” or “The side effects are too high,” and therefore this is a point at which we want to have some, you know, human taking personal responsibility, having a liability framework in place, having a sense that there is a person with legal agency who signed off on this decision. And that line I suspect will start fairly low, and what we’d expect to see would be that that would rise progressively over time.

LEE: What you just said, that scenario of your personal asthma medication, is really interesting because your personal AI might have the benefit of 50 years of your own experience with that medication. So, in a way, there is at least the data potential for, let’s say, the next prescription to be more personalized and more tailored specifically for you.

AZHAR: Yes. Well, let’s dig into this because I think this is super interesting, and we can look at how things have changed. So 15 years ago, if I had a bad asthma attack, which I might have once a year, I would have needed to go and see my general physician.

In the UK, it’s very difficult to get an appointment. I would have had to see someone privately who didn’t know me at all because I’ve just walked in off the street, and I would explain my situation. It would take me half a day. Productivity lost. I’ve been miserable for a couple of days with severe wheezing. Then a few years ago the system changed, a protocol changed, and now I have a thing called a rescue pack, which includes prednisolone steroids. It includes something else I’ve just forgotten, and an antibiotic in case I get an upper respiratory tract infection, and I have an “algorithm.” It’s called a protocol. It’s printed out. It’s a flowchart

I answer various questions, and then I say, “I’m going to prescribe this to myself.” You know, UK doctors don’t prescribe prednisolone, or prednisone as you may call it in the US, at the drop of a hat, right. It’s a powerful steroid. I can self-administer, and I can now get that repeat prescription without seeing a physician a couple of times a year. And the algorithm, the “AI” is, it’s obviously been done in PowerPoint naturally, and it’s a bunch of arrows. [LAUGHS]

Surely, surely, an AI system is going to be more sophisticated, more nuanced, and give me more assurance that I’m making the right decision around something like that.

LEE: Yeah. Well, at a minimum, the AI should be able to make that PowerPoint the next time. [LAUGHS]

AZHAR: Yeah, yeah. Thank god for Clippy. Yes.

LEE: So, you know, I think in our book, we had a lot of certainty about most of the things we’ve discussed here, but one chapter where I felt we really sort of ran out of ideas, frankly, was on regulation. And, you know, what we ended up doing for that chapter is … I can’t remember if it was Carey’s or Zak’s idea, but we asked GPT-4 to have a conversation, a debate with itself [LAUGHS], about regulation. And we made some minor commentary on that.

And really, I think we took that approach because we just didn’t have much to offer. By the way, in our defense, I don’t think anyone else had any better ideas anyway.

AZHAR: Right.

LEE: And so now two years later, do we have better ideas about the need for regulation, the frameworks around which those regulations should be developed, and, you know, what should this look like?

AZHAR: So regulation is going to be in some cases very helpful because it provides certainty for the clinician that they’re doing the right thing, that they are still insured for what they’re doing, and it provides some degree of confidence for the patient. And we need to make sure that the claims that are made stand up to quite rigorous levels, where ideally there are RCTs [randomized control trials], and there are the classic set of processes you go through.

You do also want to be able to experiment, and so the question is: as a regulator, how can you enable conditions for there to be experimentation? And what is experimentation? Experimentation is learning so that every element of the system can learn from this experience.

So finding that space where there can be bit of experimentation, I think, becomes very, very important. And a lot of this is about experience, so I think the first digital therapeutics have received FDA approval, which means there are now people within the FDA who understand how you go about running an approvals process for that, and what that ends up looking like—and of course what we’re very good at doing in this sort of modern hyper-connected world—is we can share that expertise, that knowledge, that experience very, very quickly.

So you go from one approval a year to a hundred approvals a year to a thousand approvals a year. So we will then actually, I suspect, need to think about what is it to approve digital therapeutics because, unlike big biological molecules, we can generate these digital therapeutics at the rate of knots [very rapidly].

LEE: Yes.

AZHAR: Every road in Hayes Valley in San Francisco, right, is churning out new startups who will want to do things like this. So then, I think about, what does it mean to get approved if indeed it gets approved? But we can also go really far with things that don’t require approval.

I come back to my sleep tracking ring. So I’ve been wearing this for a few years, and when I go and see my doctor or I have my annual checkup, one of the first things that he asks is how have I been sleeping. And in fact, I even sync my sleep tracking data to their medical record system, so he’s saying … hearing what I’m saying, but he’s actually pulling up the real data going, This patient’s lying to me again. Of course, I’m very truthful with my doctor, as we should all be. [LAUGHTER]

LEE: You know, actually, that brings up a point that consumer-facing health AI has to deal with pop science, bad science, you know, weird stuff that you hear on Reddit. And because one of the things that consumers want to know always is, you know, what’s the truth?

AZHAR: Right.

LEE: What can I rely on? And I think that somehow feels different than an AI that you actually put in the hands of, let’s say, a licensed practitioner. And so the regulatory issues seem very, very different for these two cases somehow.

AZHAR: I agree, they’re very different. And I think for a lot of areas, you will want to build AI systems that are first and foremost for the clinician, even if they have patient extensions, that idea that the clinician can still be with a patient during the week.

And you’ll do that anyway because you need the data, and you also need a little bit of a liability shield to have like a sensible person who’s been trained around that. And I think that’s going to be a very important pathway for many AI medical crossovers. We’re going to go through the clinician.

LEE: Yeah.

AZHAR: But I also do recognize what you say about the, kind of, kooky quackery that exists on Reddit. Although on Creatine, Reddit may yet prove to have been right. [LAUGHTER]

LEE: Yeah, that’s right. Yes, yeah, absolutely. Yeah.

AZHAR: Sometimes it’s right. And I think that it serves a really good role as a field of extreme experimentation. So if you’re somebody who makes a continuous glucose monitor traditionally given to diabetics but now lots of people will wear them—and sports people will wear them—you probably gathered a lot of extreme tail distribution data by reading the Reddit/biohackers …

LEE: Yes.

AZHAR: … for the last few years, where people were doing things that you would never want them to really do with the CGM [continuous glucose monitor]. And so I think we shouldn’t understate how important that petri dish can be for helping us learn what could happen next.

LEE: Oh, I think it’s absolutely going to be essential and a bigger thing in the future. So I think I just want to close here then with one last question. And I always try to be a little bit provocative with this.

And so as you look ahead to what doctors and nurses and patients might be doing two years from now, five years from now, 10 years from now, do you have any kind of firm predictions?

AZHAR: I’m going to push the boat out, and I’m going to go further out than closer in.

LEE: OK. [LAUGHS]

AZHAR: As patients, we will have many, many more touch points and interaction with our biomarkers and our health. We’ll be reading how well we feel through an array of things. And some of them we’ll be wearing directly, like sleep trackers and watches.

And so we’ll have a better sense of what’s happening in our lives. It’s like the moment you go from paper bank statements that arrive every month to being able to see your account in real time.

LEE: Yes.

AZHAR: And I suspect we’ll have … we’ll still have interactions with clinicians because societies that get richer see doctors more, societies that get older see doctors more, and we’re going to be doing both of those over the coming 10 years. But there will be a sense, I think, of continuous health engagement, not in an overbearing way, but just in a sense that we know it’s there, we can check in with it, it’s likely to be data that is compiled on our behalf somewhere centrally and delivered through a user experience that reinforces agency rather than anxiety.

And we’re learning how to do that slowly. I don’t think the health apps on our phones and devices have yet quite got that right. And that could help us personalize problems before they arise, and again, I use my experience for things that I’ve tracked really, really well. And I know from my data and from how I’m feeling when I’m on the verge of one of those severe asthma attacks that hits me once a year, and I can take a little bit of preemptive measure, so I think that that will become progressively more common and that sense that we will know our baselines.

I mean, when you think about being an athlete, which is something I think about, but I could never ever do, [LAUGHTER] but what happens is you start with your detailed baselines, and that’s what your health coach looks at every three or four months. For most of us, we have no idea of our baselines. You we get our blood pressure measured once a year. We will have baselines, and that will help us on an ongoing basis to better understand and be in control of our health. And then if the product designers get it right, it will be done in a way that doesn’t feel invasive, but it’ll be done in a way that feels enabling. We’ll still be engaging with clinicians augmented by AI systems more and more because they will also have gone up the stack. They won’t be spending their time on just “take two Tylenol and have a lie down” type of engagements because that will be dealt with earlier on in the system. And so we will be there in a very, very different set of relationships. And they will feel that they have different ways of looking after our health.

LEE: Azeem, it’s so comforting to hear such a wonderfully optimistic picture of the future of healthcare. And I actually agree with everything you’ve said.

Let me just thank you again for joining this conversation. I think it’s been really fascinating. And I think somehow the systemic issues, the systemic issues that you tend to just see with such clarity, I think are going to be the most, kind of, profound drivers of change in the future. So thank you so much.

AZHAR: Well, thank you, it’s been my pleasure, Peter, thank you.

[TRANSITION MUSIC] 

I always think of Azeem as a systems thinker. He’s always able to take the experiences of new technologies at an individual level and then project out to what this could mean for whole organizations and whole societies.

In our conversation, I felt that Azeem really connected some of what we learned in a previous episode—for example, from Chrissy Farr—on the evolving consumerization of healthcare to the broader workforce and economic impacts that we’ve heard about from Ethan Mollick.

Azeem’s personal story about managing his asthma was also a great example. You know, he imagines a future, as do I, where personal AI might assist and remember decades of personal experience with a condition like asthma and thereby know more than any human being could possibly know in a deeply personalized and effective way, leading to better care. Azeem’s relentless optimism about our AI future was also so heartening to hear.

Both of these conversations leave me really optimistic about the future of AI in medicine. At the same time, it is pretty sobering to realize just how much we’ll all need to change in pretty fundamental and maybe even in radical ways. I think a big insight I got from these conversations is how we interact with machines is going to have to be altered not only at the individual level, but at the company level and maybe even at the societal level.

Since my conversation with Ethan and Azeem, there have been some pretty important developments that speak directly to this. Just last week at Build (opens in new tab), which is Microsoft’s yearly developer conference, we announced a slew of AI agent technologies. Our CEO, Satya Nadella, in fact, started his keynote by going online in a GitHub developer environment and then assigning a coding task to an AI agent, basically treating that AI as a full-fledged member of a development team. Other agents, for example, a meeting facilitator, a data analyst, a business researcher, travel agent, and more were also shown during the conference.

But pertinent to healthcare specifically, what really blew me away was the demonstration of a healthcare orchestrator agent. And the specific thing here was in Stanford’s cancer treatment center, when they are trying to decide on potentially experimental treatments for cancer patients, they convene a meeting of experts. That is typically called a tumor board. And so this AI healthcare orchestrator agent actually participated as a full-fledged member of a tumor board meeting to help bring data together, make sure that the latest medical knowledge was brought to bear, and to assist in the decision-making around a patient’s cancer treatment. It was pretty amazing.

[THEME MUSIC]

A big thank-you again to Ethan and Azeem for sharing their knowledge and understanding of the dynamics between AI and society more broadly. And to our listeners, thank you for joining us. I’m really excited for the upcoming episodes, including discussions on medical students’ experiences with AI and AI’s influence on the operation of health systems and public health departments. We hope you’ll continue to tune in.

Until next time.

[MUSIC FADES]

The post What AI’s impact on individuals means for the health workforce and industry appeared first on Microsoft Research.

FrodoKEM: A conservative quantum-safe cryptographic algorithm

May 27, 2025

by Patrick Longa Microsoft AI

The image features a gradient background transitioning from blue on the left to pink on the right. In the center, there are three white icons. On the left is a microchip icon that represents quantum computing, in the middle is a shield, and on the right is another microchip with a padlock symbol inside it.

In this post, we describe FrodoKEM, a key encapsulation protocol that offers a simple design and provides strong security guarantees even in a future with powerful quantum computers.

The quantum threat to cryptography

For decades, modern cryptography has relied on mathematical problems that are practically impossible for classical computers to solve without a secret key. Cryptosystems like RSA, Diffie-Hellman key-exchange, and elliptic curve-based schemes—which rely on the hardness of the integer factorization and (elliptic curve) discrete logarithm problems—secure communications on the internet, banking transactions, and even national security systems. However, the emergence of quantum computing poses a significant threat to these cryptographic schemes.

Quantum computers leverage the principles of quantum mechanics to perform certain calculations exponentially faster than classical computers. Their ability to solve complex problems, such as simulating molecular interactions, optimizing large-scale systems, and accelerating machine learning, is expected to have profound and beneficial implications for fields ranging from chemistry and material science to artificial intelligence.

At the same time, quantum computing is poised to disrupt cryptography. In particular, Shor’s algorithm, a quantum algorithm developed in 1994, can efficiently factor large numbers and compute discrete logarithms—the very problems that underpin the security of RSA, Diffie-Hellman, and elliptic curve cryptography. This means that once large-scale, fault-tolerant quantum computers become available, public-key protocols based on RSA, ECC, and Diffie-Hellman will become insecure, breaking a sizable portion of the cryptographic backbone of today’s digital world. Recent advances in quantum computing, such as Microsoft’s Majorana 1 (opens in new tab), the first quantum processor powered by topological qubits, represent major steps toward practical quantum computing and underscore the urgency of transitioning to quantum-resistant cryptographic systems.

To address this looming security crisis, cryptographers and government agencies have been working on post-quantum cryptography (PQC)—new cryptographic algorithms that can resist attacks from both classical and quantum computers.

The NIST Post-Quantum Cryptography Standardization effort

In 2017, the U.S. National Institute of Standards and Technology (NIST) launched the Post-Quantum Cryptography Standardization project (opens in new tab) to evaluate and select cryptographic algorithms capable of withstanding quantum attacks. As part of this initiative, NIST sought proposals for two types of cryptographic primitives: key encapsulation mechanisms (KEMs)—which enable two parties to securely derive a shared key to establish an encrypted connection, similar to traditional key exchange schemes—and digital signature schemes.

This initiative attracted submissions from cryptographers worldwide, and after multiple evaluation rounds, NIST selected CRYSTALS-Kyber, a KEM based on structured lattices, and standardized it as ML-KEM (opens in new tab). Additionally, NIST selected three digital signature schemes: CRYSTALS-Dilithium, now called ML-DSA; SPHINCS⁺, now called SLH-DSA; and Falcon, now called FN-DSA.

While ML-KEM provides great overall security and efficiency, some governments and cryptographic researchers advocate for the inclusion and standardization of alternative algorithms that minimize reliance on algebraic structure. Reducing algebraic structure might prevent potential vulnerabilities and, hence, can be considered a more conservative design choice. One such algorithm is FrodoKEM.

International standardization of post-quantum cryptography

Beyond NIST, other international standardization bodies have been actively working on quantum-resistant cryptographic solutions. The International Organization for Standardization (ISO) is leading a global effort to standardize additional PQC algorithms. Notably, European government agencies—including Germany’s BSI (opens in new tab), the Netherlands’ NLNCSA and AIVD (opens in new tab), and France’s ANSSI (opens in new tab)—have shown strong support for FrodoKEM, recognizing it as a conservative alternative to structured lattice-based schemes.

As a result, FrodoKEM is undergoing standardization at ISO. Additionally, ISO is standardizing ML-KEM and a conservative code-based KEM called Classic McEliece. These three algorithms are planned for inclusion in ISO/IEC 18033-2:2006 as Amendment 2 (opens in new tab).

What is FrodoKEM?

FrodoKEM is a key encapsulation mechanism (KEM) based on the Learning with Errors (LWE) problem, a cornerstone of lattice-based cryptography. Unlike structured lattice-based schemes such as ML-KEM, FrodoKEM is built on generic, unstructured lattices, i.e., it is based on the plain LWE problem.

Why unstructured lattices?

Structured lattice-based schemes introduce additional algebraic properties that could potentially be exploited in future cryptanalytic attacks. By using unstructured lattices, FrodoKEM eliminates these concerns, making it a safer choice in the long run, albeit at the cost of larger key sizes and lower efficiency.

It is important to emphasize that no particular cryptanalytic weaknesses are currently known for recommended parameterizations of structured lattice schemes in comparison to plain LWE. However, our current understanding of the security of these schemes could potentially change in the future with cryptanalytic advances.

Lattices and the Learning with Errors (LWE) problem

Lattice-based cryptography relies on the mathematical structure of lattices, which are regular arrangements of points in multidimensional space. A lattice is defined as the set of all integer linear combinations of a set of basis vectors. The difficulty of certain computational problems on lattices, such as the Shortest Vector Problem (SVP) and the Learning with Errors (LWE) problem, forms the basis of lattice-based schemes.

The Learning with Errors (LWE) problem

The LWE problem is a fundamental hard problem in lattice-based cryptography. It involves solving a system of linear equations where some small random error has been added to each equation, making it extremely difficult to recover the original secret values. This added error ensures that the problem remains computationally infeasible, even for quantum computers. Figure 1 below illustrates the LWE problem, specifically, the search version of the problem.

As can be seen in Figure 1, for the setup of the problem we need a dimension (n) that defines the size of matrices, a modulus (q) that defines the value range of the matrix coefficients, and a certain error distribution (chi) from which we sample (textit{“small”}) matrices. We sample two matrices from (chi), a small matrix (text{s}) and an error matrix (text{e}) (for simplicity in the explanation, we assume that both have only one column); sample an (n times n) matrix (text{A}) uniformly at random; and compute (text{b} = text{A} times text{s} + text{e}). In the illustration, each matrix coefficient is represented by a colored square, and the “legend of coefficients” gives an idea of the size of the respective coefficients, e.g., orange squares represent the small coefficients of matrix (text{s}) (small relative to the modulus (q)). Finally, given (text{A}) and (text{b}), the search LWE problem consists in finding (text{s}). This problem is believed to be hard for suitably chosen parameters (e.g., for dimension (n) sufficiently large) and is used at the core of FrodoKEM.

In comparison, the LWE variant used in ML-KEM—called Module-LWE (M-LWE)—has additional symmetries, adding mathematical structure that helps improve efficiency. In a setting similar to that of the search LWE problem above, the matrix (text{A}) can be represented by just a single row of coefficients.

**FIGURE 1:** Visualization of the (search) LWE problem.

LWE is conjectured to be quantum-resistant, and FrodoKEM’s security is directly tied to its hardness. In other words, cryptanalysts and quantum researchers have not been able to devise an efficient quantum algorithm capable of solving the LWE problem and, hence, FrodoKEM. In cryptography, absolute security can never be guaranteed; instead, confidence in a problem’s hardness comes from extensive scrutiny and its resilience against attacks over time.

How FrodoKEM Works

FrodoKEM follows the standard paradigm of a KEM, which consists of three main operations—key generation, encapsulation, and decapsulation—performed interactively between a sender and a recipient with the goal of establishing a shared secret key:

Key generation (KeyGen), computed by the recipient
- Generates a public key and a secret key.
- The public key is sent to the sender, while the private key remains secret.
Encapsulation (Encapsulate), computed by the sender
- Generates a random session key.
- Encrypts the session key using the recipient’s public key to produce a ciphertext.
- Produces a shared key using the session key and the ciphertext.
- The ciphertext is sent to the recipient.
Decapsulation (Decapsulate), computed by the recipient
- Decrypts the ciphertext using their secret key to recover the original session key.
- Reproduces the shared key using the decrypted session key and the ciphertext.

The shared key generated by the sender and reconstructed by the recipient can then be used to establish secure symmetric-key encryption for further communication between the two parties.

Figure 2 below shows a simplified view of the FrodoKEM protocol. As highlighted in red, FrodoKEM uses at its core LWE operations of the form “(text{b} = text{A} times text{s} + text{e})”, which are directly applied within the KEM paradigm.

**FIGURE 2:** Simplified overview of FrodoKEM.

Performance: Strong security has a cost

Not relying on additional algebraic structure certainly comes at a cost for FrodoKEM in the form of increased protocol runtime and bandwidth. The table below compares the performance and key sizes corresponding to the FrodoKEM level 1 parameter set (variant called “FrodoKEM-640-AES”) and the respective parameter set of ML-KEM (variant called “ML-KEM-512”). These parameter sets are intended to match or exceed the brute force security of AES-128. As can be seen, the difference in speed and key sizes between FrodoKEM and ML-KEM is more than an order of magnitude. Nevertheless, the runtime of the FrodoKEM protocol remains reasonable for most applications. For example, on our benchmarking platform clocked at 3.2GHz, the measured runtimes are 0.97 ms, 1.9 ms, and 3.2 ms for security levels 1, 2, and 3, respectively.

For security-sensitive applications, a more relevant comparison is with Classic McEliece, a post-quantum code-based scheme also considered for standardization. In this case, FrodoKEM offers several efficiency advantages. Classic McEliece’s public keys are significantly larger—well over an order of magnitude greater than FrodoKEM’s—and its key generation is substantially more computationally expensive. Nonetheless, Classic McEliece provides an advantage in certain static key-exchange scenarios, where its high key generation cost can be amortized across multiple key encapsulation executions.

**TABLE 1:** Comparison of key sizes and performance on an x86-64 processor for NIST level 1 parameter sets.

A holistic design made with security in mind

FrodoKEM’s design principles support security beyond its reliance on generic, unstructured lattices to minimize the attack surface of potential future cryptanalytic threats. Its parameters have been carefully chosen with additional security margins to withstand advancements in known attacks. Furthermore, FrodoKEM is designed with simplicity in mind—its internal operations are based on straightforward matrix-vector arithmetic using integer coefficients reduced modulo a power of two. These design decisions facilitate simple, compact and secure implementations that are also easier to maintain and to protect against side-channel attacks.

Conclusion

After years of research and analysis, the next generation of post-quantum cryptographic algorithms has arrived. NIST has chosen strong PQC protocols that we believe will serve Microsoft and its customers well in many applications. For security-sensitive applications, FrodoKEM offers a secure yet practical approach for post-quantum cryptography. While its reliance on unstructured lattices results in larger key sizes and higher computational overhead compared to structured lattice-based alternatives, it provides strong security assurances against potential future attacks. Given the ongoing standardization efforts and its endorsement by multiple governmental agencies, FrodoKEM is well-positioned as a viable alternative for organizations seeking long-term cryptographic resilience in a post-quantum world.

Learn more:

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers.

[MUSIC FADES]

On today’s episode, I’m talking to Alex Lu, a senior researcher at Microsoft Research and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models in Single-cell Biology. Alex Lu, wonderful to have you on the podcast. Welcome to Abstracts!

ALEX LU: Yeah, I’m really excited to be joining you today.

HUIZINGA: So let’s start with a little background of your work. In just a few sentences, tell us about your study and more importantly, why it matters.

LU: Absolutely. And before I dive in, I want to give a shout out to the MSR research intern who actually did this work. This was led by Kasia Kedzierska, who interned with us two summers ago in 2023, and she’s the lead author on the study. But basically, in this research, we study single-cell foundation models, which have really recently rocked the world of biology, because they basically claim to be able to use AI to unlock understanding about single-cell biology. Biologists for a myriad of applications, everything from understanding how single cells differentiate into different kinds of cells, to discovering new drugs for cancer, will conduct experiments where they measure how much of every gene is expressed inside of just one single cell. So these experiments give us a powerful view into the cell’s internal state. But measurements from these experiments are incredibly complex. There are about 20,000 different human genes. So you get this really long chain of numbers that measure how much there is of 20,000 different genes. So deriving meaning from this really long chain of numbers is really difficult. And single-cell foundation models claim to be capable of unraveling deeper insights than ever before. So that’s the claim that these works have made. And in our recent paper, we showed that these models may actually not live up to these claims. Basically, we showed that single-cell foundation models perform worse in settings that are fundamental to biological discovery than much simpler machine learning and statistical methods that were used in the field before single-cell foundation models emerged and are the go-to standard for unpacking meaning from these complicated experiments. So in a nutshell, we should care about these results because it has implications on the toolkits that biologists use to understand their experiments. Our work suggests that single-cell foundation models may not be appropriate for practical use just yet, at least in the discovery applications that we cover.

HUIZINGA: Well, let’s go a little deeper there. Generative pre-trained transformer models, GPTs, are relatively new on the research scene in terms of how they’re being used in novel applications, which is what you’re interested in, like single-cell biology. So I’m curious, just sort of as a foundation, what other research has already been done in this area, and how does this study illuminate or build on it?

LU: Absolutely. Okay, so we were the first to notice and document this issue in single-cell foundation models, specifically. And this is because that we have proposed evaluation methods that, while are common in other areas of AI, have yet to be commonly used to evaluate single-cell foundation models. We performed something called zero-shot evaluation on these models. Prior to our work, most works evaluated single-cell foundation models with fine tuning. And the way to understand this is because single-cell foundation models are trained in a way that tries to expose these models to millions of single-cells. But because you’re exposing them to a large amount of data, you can’t really rely upon this data being annotated or like labeled in any particular fashion then. So in order for them to actually do the specialized tasks that are useful for biologists, you typically have to add on a second training phase. We call this the fine-tuning phase, where you have a smaller number of single cells, but now they are actually labeled with the specialized tasks that you want the model to perform. So most people, they typically evaluate the performance of single-cell models after they fine-tune these models. However, what we noticed is that this evaluating these fine-tuned models has several problems. First, it might not actually align with how these models are actually going to be used by biologists then. A critical distinction in biology is that we’re not just trying to interact with an agent that has access to knowledge through its pre-training, we’re trying to extend these models to discover new biology beyond the sphere of influence then. And so in many cases, the point of using these models, the point of analysis, is to explore the data with the goal of potentially discovering something new about the single cell that the biologists worked with that they weren’t aware of before. So in these kinds of cases, it is really tough to fine-tune a model. There’s a bit of a chicken and egg problem going on. If you don’t know, for example, there’s a new kind of cell in the data, you can’t really instruct the model to help us identify these kinds of new cells. So in other words, fine-tuning these models for those tasks essentially becomes impossible then. So the second issue is that evaluations on fine-tuned models can sometimes mislead us in our ability to understand how these models are working. So for example, the claim behind single-cell foundation model papers is that these models learn a foundation of biological knowledge by being exposed to millions of single cells in its first training phase, right? But it’s possible when you fine-tune a model, it may just be that any performance increases that you see using the model is simply because that you’re using a massive model that is really sophisticated, really large. And even if there’s any exposure to any cells at all then, that model is going to do perfectly fine then. So going back to our paper, what’s really different about this paper is that we propose zero-shot evaluation for these models. What that means is that we do not fine-tune the model at all, and instead we keep the model frozen during the analysis step. So how we specialize it to be a downstream task instead is that we extract the model’s internal embedding of single-cell data, which is essentially a numerical vector that contains information that the model is extracting and organizing from input data. So it’s essentially how the model perceives single-cell data and how it’s organizing in its own internal state. So basically, this is the better way for us to test the claim that single-cell foundation models are learning foundational biological insights. Because if they actually are learning these insights, they should be present in the models embedding space even before we fine-tune the model.

HUIZINGA: Well, let’s talk about methodology on this particular study. You focused on assessing existing models in zero-shot learning for single-cell biology. How did you go about evaluating these models?

LU: Yes, so let’s dive deeper into how zero-shot evaluations are conducted, okay? So the premise here is that we’re relying upon the fact that if these models are fully learning foundational biological insights, if we take the model’s internal representation of cells, then cells that are biologically similar should be close in that internal representation, where cells that are biologically distinct should be further apart. And that is exactly what we tested in our study. We compared two popular single-cell foundation models and importantly, we compared these models against older and reliable tools that biologists have used for exploratory analyses. So these include simpler machine learning methods like scVI, statistical algorithms like Harmony, and even basic data pre-processing steps, just like filtering your data down to a more robust subset of genes, then. So basically, we tested embeddings from our two single-cell foundation models against this baseline in a variety of settings. And we tested the hypothesis that biologically similar cells should be similar across these distinct methods across these datasets.

HUIZINGA: Well, and as you as you did the testing, you obviously were aiming towards research findings, which is my favorite part of a research paper, so tell us what you did find and what you feel the most important takeaways of this paper are.

LU: Absolutely. So in a nutshell, we found that these two newly proposed single-cell foundation models substantially underperformed compared to older methods then. So to contextualize why that is such a surprising result, there is a lot of hype around these methods. So basically, I think that,yeah, it’s a very surprising result, given how hyped these models are and how people were already adopting them. But our results basically caution that these shouldn’t really be adopted for these use purposes.

HUIZINGA: Yeah, so this is serious real-world impact here in terms of if models are being adopted and adapted in these applications, how reliable are they, et cetera? So given that, who would you say benefits most from what you’ve discovered in this paper and why?

LU: Okay, so two ways, right? So I think this has at least immediate implications on the way that we do discovery in biology. And as I’ve discussed, these experiments are used for cases that have practical impact, drug discovery applications, investigations into basic biology, then. But let’s also talk about the impact for methodologists, people who are trying to improve these single-cell foundation models, right? I think at the base, they’re really excited proposals. Because if you look at what some of the prior and less sophisticated methods couldn’t do, they tended to be more bespoke. So the excitement of single-cell foundation models is that you have this general-purpose model that can be used for everything and while they’re not living up to that purpose just now, just currently, I think that it’s important that we continue to bank onto that vision, right? So if you look at our contributions in that area, where single-cell foundation models are a really new proposal, so it makes sense that we may not know how to fully evaluate them just yet then. So you can view our work as basically being a step towards more rigorous evaluation of these models. Now that we did this experiment, I think the methodologists know to use this as a signal on how to improve the models and if they’re going in the right direction. And in fact, you are seeing more and more papers adopt zero-shot evaluations since we put out our paper then. And so this essentially helps future computer scientists that are working on single-cell foundation models know how to train better models.

HUIZINGA: That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot learning research in biology, and what foundation might this paper lay for future research agendas in the field?

LU: Yeah, absolutely. So now that we’ve shown single-cell foundation models don’t necessarily perform well, I think the natural question on everyone’s mind is how do we actually train single-cell foundation models that live up to that vision, that can perform in helping us discover new biology then? So I think in the short term, yeah, we’re actively investigating many hypotheses in this area. So for example, my colleagues, Lorin Crawford and Ava Amini, who were co-authors in the paper, recently put out a pre-print understanding how training data composition impacts model performance. And so one of the surprising findings that they had was that many of the training data sets that people used to train single-cell foundation models are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance then. But you can also look forward to many other explorations in this area as we continue to develop this research at the end of the day. But also zooming out into the bigger picture, I think one major takeaway from this paper is that developing AI methods for biology requires thought about the context of use, right? I mean, this is obvious for any AI method then, but I think people have gotten just too used to taking methods that work out there for natural vision or natural language maybe in the consumer domain and then extrapolating these methods to biology and expecting that they will work in the same way then, right? So for example, one reason why zero-shot evaluation was not routine practice for single-cell foundation models prior to our work, I mean, we were the first to fully establish that as a practice for the field, was because I think people who have been working in AI for biology have been looking to these more mainstream AI domains to shape their work then. And so with single-cell foundation models, many of these models are adopted from large language models with natural language processing, recycling the exact same architecture, the exact same code, basically just recycling practices in that field then. So when you look at like practices in like more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it’s more of like a niche instead of being considered central to model understanding. So again, because biology is different from mainstream language processing, it’s a scientific discipline, zero-shot evaluation becomes much more important, and you have no choice but to use these models, zero-shot then. So in other words, I think that we need to be thinking carefully about what it is that makes training a model for biology different from training a model, for example, for consumer purposes.

[MUSIC]

HUIZINGA: Alex Lu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read it on the Genome Biology website. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: Zero-shot models in single-cell biology with Alex Lu appeared first on Microsoft Research.

Abstracts: Aurora with Megan Stanley and Wessel Bruinsma

May 21, 2025

by Amber Tingle, Megan Stanley, Wessel Bruinsma Microsoft AI

Abstracts podcast | Aurora with Megan Stanley and Wessel Bruinsma

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode of Abstracts, Microsoft senior researchers Megan Stanley and Wessel Bruinsma join host Amber Tingle to discuss their groundbreaking work on environmental forecasting. Their new Nature publication, “A Foundation Model for the Earth System,” (opens in new tab) features Aurora, an AI model that redefines weather prediction and extends its capabilities to other environmental domains such as tropical cyclones and ocean wave forecasting.

Read the paper

Learn more

A foundation model for the Earth system (opens in new tab)
Nature | May 2025

Introducing Aurora: The first large-scale foundation model of the atmosphere
Microsoft Research Blog | June 2024

Project Aurora: The first large-scale foundation model of the atmosphere
Video | September 2024

A Foundation Model for the Earth System
Paper | November 2024

Aurora (opens in new tab)
Azure AI Foundry Labs

Transcript

[MUSIC] 

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Our guests today are Megan Stanley and Wessel Bruinsma. They are both senior researchers within the Microsoft Research AI for Science initiative. They are also two of the coauthors on a new Nature publication called “A Foundation Model for the Earth System.”

This is such exciting work about environmental forecasting, so we’re happy to have the two of you join us today.

Megan and Wessel, welcome.

MEGAN STANLEY: Thank you. Thanks. Great to be here.

WESSEL BRUINSMA: Thanks.

TINGLE: Let’s jump right in. Wessel, share a bit about the problem your research addresses and why this work is so important.

BRUINSMA: I think we’re all very much aware of the revolution that’s happening in the space of large language models, which have just become so strong. What’s perhaps lesser well-known is that machine learning models have also started to revolutionize this field of weather prediction. Whereas traditional weather prediction models, based on physical laws, used to be the state of the art, these traditional models are now challenged and often even outperformed by AI models.

This advancement is super impressive and really a big deal. Mostly because AI weather forecasting models are computationally much more efficient and can even be more accurate. What’s unfortunate though, about this big step forward, is that these developments are mostly limited to the setting of weather forecasting.

Weather forecasting is very important, obviously, but there are many other important environmental forecasting problems out there, such as air pollution forecasting or ocean wave forecasting. We have developed a model, named Aurora, which really kicks the AI revolution in weather forecasting into the next gear by extending these advancements to other environmental forecasting fields, too. With Aurora, we’re now able to produce state-of-the-art air pollution forecasts using an AI approach. And that wasn’t possible before!

TINGLE: Megan, how does this approach differ from or build on work that’s already been done in the atmospheric sciences?

STANLEY: Current approaches have really focused training very specifically on weather forecasting models. And in contrast, with Aurora, what we’ve attempted to do is train a so-called foundation model for the Earth system. In the first step, we train Aurora on a vast body of Earth system data. This is our pretraining step.

And when I say a vast body of data, I really do mean a lot. And the purpose of this pretraining is to let Aurora, kind of, learn some general-purpose representation of the dynamics that govern the Earth system. But then once we’ve pretrained Aurora, and this really is the crux of this, the reason why we’re doing this project, is after the model has been pretrained, it can leverage this learned general-purpose representation and efficiently adapt to new tasks, new domains, new variables. And this is called fine-tuning.

The idea is that the model really uses the learned representation to perform this adaptation very efficiently, which basically means Aurora is a powerful, flexible model that can relatively cheaply be adapted to any environmental forecasting task.

TINGLE: Wessel, can you tell us about your methodology? How did you all conduct this research?

BRUINSMA: While approaches so far have trained models on primarily one particular data set, this one dataset is very large, which makes it possible to train very good models. But it does remain only one dataset, and that’s not very diverse. In the domain of environmental forecasting, we have really tried to push the limits of scaling to large data by training Aurora on not just this one large dataset, but on as many very large datasets as we could find.

These datasets are a combination of estimates of the historical state of the world, forecasts by other models, climate simulations, and more. We’ve been able to show that training on not just more data but more diverse data helps the model achieve even better performance. Showing this is difficult because there is just so much data.

In addition to scaling to more and more diverse data, we also increased the size of the model as much as we could. Here we found that bigger models, despite being slower to run, make more efficient use of computational resources. It’s cheaper to train a good big model than a good small model. The mantra of this project was to really keep it simple and to scale to simultaneously very large and, more importantly, diverse data and large model size.

TINGLE: So, Megan, what were your major findings? And we know they’re major because they’re in Nature. [LAUGHS]

STANLEY: Yeah, [LAUGHS] I guess they really are. So the main outcome of this project is we were actually able to train a single foundation model that achieves state-of-the-art performance in four different domains. Air pollution forecasting. For example, predicting particulate matter near the surface or ozone in the atmosphere. Ocean wave forecasting, which is critical for planning shipping routes.

Tropical cyclone track forecasting, so that means being able to predict where a hurricane or a typhoon is expected to go, which is obviously incredibly important, and very high-resolution weather forecasting.

And I’ve, kind of, named these forecasting domains as if they’re just items in a list, but in every single one, Aurora really pushed the limits of what is possible with AI models. And we’re really proud of that.

But perhaps, kind of, you know, to my mind, the key takeaway here is that the foundation model approach actually works. So what we have shown is it’s possible to actually train some kind of general model, a foundation model, and then adapt it to a wide variety of environmental tasks. Now we definitely do not claim that Aurora is some kind of ultimate environmental forecasting model. We are sure that the model and the pretraining procedure can actually be improved. But, nevertheless, we’ve shown that this approach works for environmental forecasting. It really holds massive promise, and that’s incredibly cool.

TINGLE: Wessel, what do you think will be the real-world impact of this work?

BRUINSMA: Well, for applications that we mentioned, which are air pollution forecasting, ocean wave forecasting, tropical cyclone track forecasting, and very high-resolution weather forecasting, Aurora could today be deployed in real-time systems to produce near real-time forecasts. And, you know, in fact, it already is. You can view real-time weather forecasts by the high-resolution version of the model on the website of ECMWF (European Centre for Medium-Range Weather Forecasts).

But what’s remarkable is that every of these applications took a small team of engineers about four to eight weeks to fully execute. You should compare this to a typical development timeline for more traditional models, which can be on the order of multiple years. Using the pretraining fine-tuning approach that we used for Aurora, we might see significantly accelerated development cycles for environmental forecasting problems. And that’s exciting.

TINGLE: Megan, if our listeners only walk away from this conversation with one key talking point, what would you like that to be? What should we remember about this paper?

STANLEY: The biggest takeaway is that the pretraining fine-tuning paradigm, it really works for environmental forecasting, right? So you can train a foundational model, it learns some kind of general-purpose representation of the Earth system dynamics, and this representation boosts performance in a wide variety of forecasting tasks. But we really want to emphasize that Aurora only scratches the surface of what’s actually possible.

So there are many more applications to explore than the four we’ve mentioned. And undoubtedly, the model and pretraining procedure can actually be improved. So we’re really excited to see what the next few years will bring.

TINGLE: Wessel, tell us more about those opportunities and unanswered questions. What’s next on the research agenda in environmental prediction?

BRUINSMA: Well, Aurora has two main limitations. The first is that the model produces only deterministic predictions, by which I mean a single predicted value. For variables like temperature, this is mostly fine. But other variables like precipitation, they are inherently some kind of stochastic. For these variables, we really want to assign probabilities to different levels of precipitation rather than predicting only a single value.

An extension of Aurora to allow this sort of prediction would be a great next step.

The second limitation is that Aurora depends on a procedure called assimilation. Assimilation attempts to create a starting point for the model from real-world observations, such as from weather stations and satellites. The model then takes the starting point and uses it to make predictions. Unfortunately, assimilation is super expensive, so it would be great if we could somehow circumvent the need for it.

Finally, what we find really important is to make our advancements available to the community.

[MUSIC]

TINGLE: Great. Megan and Wessel, thanks for joining us today on the Microsoft Research Podcast.

BRUINSMA: Thanks for having us.

STANLEY: Yeah, thank you. It’s been great.

TINGLE: You can check out the Aurora model on Azure AI Foundry. You can read the entire paper, “A Foundation Model for the Earth System,” at aka.ms/abstracts. And you’ll certainly find it on the Nature website, too.

Thank you so much for tuning in to Abstracts today. Until next time.

[MUSIC FADES]

The post Abstracts: Aurora with Megan Stanley and Wessel Bruinsma appeared first on Microsoft Research.

Collaborators: Healthcare Innovation to Impact

May 20, 2025

by Matthew Lungren, Jonathan M. Carlson, Smitha Saligrama, Will Guyman, Cameron Runde Microsoft AI

Microsoft Research Podcast | Collaborators: Healthcare Innovation to Impact | outline illustrations of Jonathan Carlson, Smitha Saligrama, Will Guyman, Cameron Runde, Dr. Matthew Lungren

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.

Amid the ongoing surge of AI research, healthcare is emerging as a leading area for real-world transformation. From driving efficiency gains for clinicians to improving patient outcomes, AI is beginning to make a tangible impact. Thousands of scientific papers have explored AI systems capable of analyzing medical documents and images with unprecedented accuracy. The latest work goes even further, showing how healthcare agents can collaborate—with each other and with human doctors—embedding AI directly into clinical workflows.

In this discussion, we explore how teams across Microsoft are working together to generate advanced AI capabilities and solutions for developers and clinicians around the globe. Leading the conversation are Dr. Matthew Lungren, chief scientific officer for Microsoft Health and Life Sciences, and Jonathan Carlson, vice president and managing director of Microsoft Health Futures—two key leaders behind this collaboration. They’re joined by Smitha Saligrama, principal group engineering manager within Microsoft Health and Life Sciences, Will Guyman, group product manager within Microsoft Health and Life Sciences, and Cameron Runde, a senior strategy manager for Microsoft Research Health Futures—all of whom play crucial roles in turning AI breakthroughs into practical, life-saving innovations.

Together, these experts examine how Microsoft is helping integrate cutting-edge AI into healthcare workflows—saving time today, and lives tomorrow.

Learn more

Developing next-generation cancer care management with multi-agent orchestration
Source Blog, May 2025

Healthcare Agent Orchestrator (opens in new tab)
GitHub

Azure AI Foundry Labs (opens in new tab)
Homepage

Transcript

[MUSIC]

MATTHEW LUNGREN: You’re listening to Collaborators, a Microsoft Research podcast, showcasing the range of expertise that goes into transforming mind blowing ideas into world changing technologies. Despite the advancements in AI over the decades, generative AI exploded into view in 2022, when ChatGPT became the, sort of, internet browser for AI and became the fastest adopted consumer software application in history.

JONATHAN CARLSON: From the beginning, healthcare stood out to us as an important opportunity for general reasoners to improve the lives and experiences of patients and providers. Indeed, in the past two years, there’s been an explosion of scientific papers looking at the application first of text reasoners and medicine, then multi-modal reasoners that can interpret medical images, and now, most recently, healthcare agents that can reason with each other. But even more impressive than the pace of research has been the surprisingly rapid diffusion of this technology into real world clinical workflows.

LUNGREN: So today, we’ll talk about how our cross-company collaboration has shortened that gap and delivered advanced AI capabilities and solutions into the hands of developers and clinicians around the world, empowering everyone in health and life sciences to achieve more. I’m Doctor Matt Lungren, chief scientific officer for Microsoft Health and Life Sciences.

CARLSON: And I’m Jonathan Carlson, vice president and managing director of Microsoft Health Futures.

LUNGREN: And together we brought some key players leading in the space of AI and health care from across Microsoft. Our guests today are Smitha Saligrama, principal group engineering manager within Microsoft Health and Life Sciences, Will Guyman, group product manager within Microsoft Health and Life Sciences, and Cameron Runde, a senior strategy manager for Microsoft Health Futures.

CARLSON: We’ve asked these brilliant folks to join us because each of them represents a mission critical group of cutting-edge stakeholders, scaling breakthroughs into purpose-built solutions and capabilities for health care.

LUNGREN: We’ll hear today how generative AI capabilities can unlock reasoning across every data type in medicine: text, images, waveforms, genomics. And further, how multi-agent frameworks in healthcare can accelerate complex workflows, in some cases acting as a specialist team member, safely secured inside the Microsoft 365 tools used by hundreds of millions of healthcare enterprise users across the world. The opportunity to save time today and lives tomorrow with AI has never been larger.

[MUSIC FADES]

MATTHEW LUNGREN: Jonathan. You know, it’s been really interesting kind of observing Microsoft Research over the decades. I’ve, you know, been watching you guys in my prior academic career. You are always on the front of innovation, particularly in health care. And I find it fascinating that, you know, millions of people are using the solutions that, you know, your team has developed over the years and yet you still find ways to stay cutting edge and state of the art, even in this accelerating time of technology and AI, particularly, how do you do that? [LAUGHS]

JONATHAN CARLSON: I mean, it’s some of what’s in our DNA, I mean, we’ve been publishing in health and life sciences for two decades here. But when we launched Health Futures as a mission-focused lab about 7 or 8 years ago, we really started with the premise that the way to have impact was to really close the loop between, not just good ideas that get published, but good ideas that can actually be grounded in real problems that clinicians and scientists care about, that then allow us to actually go from that first proof of concept into an incubation, into getting real world feedback that allows us to close that loop. And now with, you know, the HLS organization here as a product group, we have the opportunity to work really closely with you all to not just prove what’s possible in the clinic or in the lab, but actually start scaling that into the broader community.

CAMERON RUNDE: And one thing I’ll add here is that the problems that we’re trying to tackle in health care are extremely complex. And so, as Jonathan said, it’s really important that we come together and collaborate across disciplines as well as across the company of Microsoft and with our external collaborators, as well across the whole industry.

CARLSON: So, Matt, back to you. What are you guys doing in the product group? How do you guys see these models getting into the clinic?

LUNGREN: You know, I think a lot of people, you know, think about AI is just, you know, maybe just even a few years old because of GPT and how that really captured the public’s consciousness. Right?

And so, you think about the speech-to-text technology of being able to dictate something, for a clinic note or for a visit, that was typically based on Nuance technology. And so there’s a lot of product understanding of the market, how to deliver something that clinicians will use, understanding the pain points and workflows and really that Health IT space, which is sometimes the third rail, I feel like with a lot of innovation in healthcare.

But beyond that, I mean, I think now that we have this really powerful engine of Microsoft and the platform capabilities, we’re seeing, innovations on the healthcare side for data storage, data interoperability, with different types of medical data. You have new applications coming online, the ability, of course, to see generative AI now infused into the speech-to-text and, becoming Dragon Copilot, which is something that has been, you know, tremendously, received by the community.

Physicians are able to now just have a conversation with a patient. They turn to their computer and the note is ready for them. There’s no more this, we call it keyboard liberation. I don’t know if you heard that before. And that’s just been tremendous. And there’s so much more coming from that side. And then there’s other parts of the workflow that we also get engaged in — the diagnostic workflow.

So medical imaging, sharing images across different hospital systems, the list goes on. And so now when you move into AI, we feel like there’s a huge opportunity to deliver capabilities into the clinical workflow via the products and solutions we already have. But, I mean, we’ll now that we’ve kind of expanded our team to involve Azure and platform, we’re really able to now focus on the developers.

WILL GUYMAN: Yeah. And you’re always telling me as a doctor how frustrating it is to be spending time at the computer instead of with your patients. I think you told me, you know, 4,000 clicks a day for the typical doctor, which is tremendous. And something like Dragon Copilot can save that five minutes per patient. But it can also now take actions after the patient encounter so it can draft the after-visit summary.

It can order labs and medications for the referral. And that’s incredible. And we want to keep building on that. There’s so many other use cases across the ecosystem. And so that’s why in Azure AI Foundry, we have translated a lot of the research from Microsoft Research and made that available to developers to build and customize for their own applications.

SMITHA SALIGRAMA: Yeah. And as you were saying, in our transformation of moving from solutions to platforms and as, scaling solutions to other, multiple scenarios, as we put our models in AI Foundry, we provide these developer capabilities like bring your own data and fine tune these models and then apply it to, scenarios that we couldn’t even imagine. So that’s kind of the platform play we’re scaling now.

LUNGREN: Well, I want to do a reality check because, you know, I think to us that are now really focused on technology, it seems like, I’ve heard this story before, right. I, I remember even in, my academic clinical days where it felt like technology was always the quick answer and it felt like technology was, there was maybe a disconnect between what my problems were or what I think needed to be done versus kind of the solutions that were kind of, created or offered to us. And I guess at some level, how Jonathan, do you think about this? Because to do things well in the science space is one thing, to do things well in science, but then also have it be something that actually drives health care innovation and practice and translation. It’s tricky, right?

CARLSON: Yeah. I mean, as you said, I think one of the core pathologies of Big Tech is we assume every problem is a technology problem. And that’s all it will take to solve the problem. And I think, look, I was trained as a computational biologist, and that sits in the awkward middle between biology and computation. And the thing that we always have to remember, the thing that we were very acutely aware of when we set out, was that we are not the experts. We do have, you know, you as an M.D., we have everybody on the team, we have biologists on the team.

But this is a big space. And the only way we’re going to have real impact, the only way we’re even going to pick the right problems to work on is if we really partner deeply, with providers, with EHR (electronic health records) vendors, with scientists, and really understand what’s important and again, get that feedback loop.

RUNDE: Yeah, I think we really need to ground the work that we do in the science itself. You need to understand the broader ecosystem and the broader landscape, across health care and life sciences, so that we can tackle the most important problems, not just the problems that we think are important. Because, as Jonathan said, we’re not the experts in health care and life sciences. And that’s really the secret sauce. When you have the clinical expertise come together with the technical expertise. That’s how you really accelerate health care.

CARLSON: When we really launched this, this mission, 7 or 8 years ago, we really came in with the premise of, if we decide to stop, we want to be sure the world cares. And the only way that’s going to be true is if we’re really deeply embedded with the people that matter–the patients, the providers and the scientists.

LUNGREN: And now it really feels like this collaborative effort, you know, really can help start to extend that mission. Right. I think, you know, Will and Smitha, that we definitely feel the passion and the innovation. And we certainly benefit from those collaborations, too. But then we have these other partners and even customers, right, that we can start to tap into and have that flywheel keep spinning.

GUYMAN: Yeah. And the whole industry is an ecosystem. So, we have our own data sets at Microsoft Research that you’ve trained amazing AI models with. And those are in the catalog. But then you’ve also partnered with institutions like Providence or Page AI . And those models are in the catalog with their data. And then there are third parties like Nvidia that have their own specialized proprietary data sets, and their models are there too. So, we have this ecosystem of open source models. And maybe Smitha, you want to talk about how developers can actually customize these.

SALIGRAMA: Yeah. So we use the Azure AI Foundry ecosystem. Developers can feel at home if they’re using the AI Foundry. So they can look at our model cards that we publish as part of the models we publish, understand the use cases of these models, how to, quickly, bring up these APIs and, look at different use cases of how to apply these and even fine tune these models with their own data. Right. And then, use it for specific tasks that we couldn’t have even imagined.

LUNGREN: Yeah it has been interesting to see we have these health care models in the catalog again, some that came from research, some that came from third parties and other product developers and Azure’s kind of becoming the home base, I think, for a lot of health and life science developers. They’re seeing all the different modalities, all the different capabilities. And then in combination with Azure OpenAI, which as we know, is incredibly competent in lots of different use cases. How are you looking at the use cases, and what are you seeing folks use these models for as they come to the catalog and start sharing their discoveries or products?

GUYMAN: Well, the general-purpose large language models are amazing for medical general reasoning. So Microsoft Research has shown that that they can perform super well on, for example, like the United States medical licensing exam, they can exceed doctor performance if they’re just picking between different multiple-choice questions. But real medicine we know is messier. It doesn’t always start with the whole patient context provided as text in the prompt. You have to get the source data and that raw data is often non-text. The majority of it is non-text. It’s things like medical imaging, radiology, pathology, ophthalmology, dermatology. It goes on and on. And there’s endless signal data, lab data. And so all of this diverse data type needs to be processed through specialized models because much of that data is not available on the public internet.

And that’s why we’re taking this partner approach, first party and third party models that can interpret all this kind of data and then connect them ultimately back to these general reasoners to reason over that.

LUNGREN: So, you know, I’ve been at this company for a while and, you know, familiar with kind of how long it takes, generally to get, you know, a really good research paper, do all the studies, do all the data analysis, and then go through the process of publishing, right, which takes, as, you know, a long time and it’s, you know, very rigorous.

And one of the things that struck me, last year, I think we, we started this big collaboration and, within a quarter, you had a Nature paper coming out from Microsoft Research, and that model that the Nature paper was describing was ready to be used by anyone on the Azure AI Foundry within that same quarter. It kind of blew my mind when I thought about it, you know, even though we were all, you know, working very hard to get that done. Any thoughts on that? I mean, has this ever happened in your career? And, you know, what’s the secret sauce to that?

CARLSON: Yeah, I mean, the time scale from research to product has been massively compressed. And I’d push that even further, which is to say, the reason why it took a quarter was because we were laying the railroad tracks as we’re driving the train. We have examples right after that when we are launching on Foundry the same day we were publishing the paper.

And frankly, the review times are becoming longer than it takes to actually productize the models. I think there’s two things that are going on with that are really converging. One is that the overall ecosystem is converging on a relatively small number of patterns, and that gives us, as a tech company, a reason to go off and really make those patterns hardened in a way that allows not just us, but third parties as well, to really have a nice workflow to publish these models.

But the other is actually, I think, a change in how we work, you know, and for most of our history as an industrial research lab, we would do research and then we’d go pitch it to somebody and try and throw it over the fence. We’ve really built a much more integrated team. In fact, if you look at that Nature paper or any of the other papers, there’s folks from product teams. Many of you are on the papers along with our clinical collaborators.

RUNDE: Yeah. I think one thing that’s really important to note is that there’s a ton of different ways that you can have impact, right? So I like to think about phasing. In Health Futures at least, I like to think about phasing the work that we do. So first we have research, which is really early innovation. And the impact there is getting our technology and our tools out there and really sharing the learnings that we’ve had.

So that can be through publications like you mentioned. It can be through open-sourcing our models. And then you go to incubation. So, this is, I think, one of the more new spaces that we’re getting into, which is maybe that blurred line between research and product. Right. Which is, how do we take the tools and technologies that we’ve built and get them into the hands of users, typically through our partnerships?

Right. So, we partner very deeply and collaborate very deeply across the industry. And incubation is really important because we get that early feedback. We get an ability to pivot if we need to. And we also get the ability to see what types of impact our technology is having in the real world. And then lastly, when you think about scale, there’s tons of different ways that you can scale. We can scale third-party through our collaborators and really empower them to go to market to commercialize the things that we’ve built together.

You can also think about scaling internally, which is why I’m so thankful that we’ve created this flywheel between research and product, and a lot of the models that we’ve built that have gone through research, have gone through incubation, have been able to scale on the Azure AI Foundry. But that’s not really our expertise. Right? The scale piece in research, that’s research and incubation. Smitha, how do you think about scaling?

SALIGRAMA: So, there are several angles to scaling the models, the state-of-the-art models we see from the research team. The first angle is, the open sourcing, to get developer trust, and very generous commercial licenses so that they can use it and for their own, use cases. The second is, we also allow them to customize these models, fine tuning these models with their own data. So a lot of different angles of how we provide support and scaling, the state-of-the-art of models we get from the research org.

GUYMAN: And as one example, you know, University of Wisconsin Health, you know, which Matt knows well. They took one of our models, which is highly versatile. They customized it in Foundry and they optimized it to reliably identify abnormal chest X-rays, the most common imaging procedure, so they could improve their turnaround time triage quickly. And that’s just one example. But we have other partners like Sectra who are doing more of operations use cases automatically routing imaging to the radiologists, setting them up to be efficient. And then Page AI is doing, you know, biomarker identification for actually diagnostics and new drug discovery. So, there’s so many use cases that we have partners already who are building and customizing.

LUNGREN: The part that’s striking to me is just that, you know, we could all sit in a room and think about all the different ways someone might use these models on the catalog. And I’m still shocked at the stuff that people use them for and how effective they are. And I think part of that is, you know, again, we talk a lot about generative AI and healthcare and all the things you can do. Again, you know, in text, you refer to that earlier and certainly off the shelf, there’s really powerful applications. But there is, you know, kind of this tip of the iceberg effect where under the water, most of the data that we use to take care of our patients is not text. Right. It’s all the different other modalities. And I think that this has been an unlock right, sort of taking these innovations, innovations from the community, putting them in this ecosystem kind of catalog, essentially. Right. And then allowing folks to kind of, you know, build and develop applications with all these different types of data. Again, I’ve been surprised at what I’m seeing.

CARLSON: This has been just one of the most profound shifts that’s happened in the last 12 months, really. I mean, two years ago we had general models in text that really shifted how we think about, I mean, natural language processing got totally upended by that. Turns out the same technology works for images as well. It doesn’t only allow you to automatically extract concepts from images, but allows you to align those image concepts with text concepts, which means that you can have a conversation with that image. And once you’re in that world now, you are a place where you can start stitching together these multimodal models that really change how you can interact with the data, and how you can start getting more information out of the raw primary data that is part of the patient journey.

LUNGREN: Well, and we’re going to get to that because I think you just touched on something. And I want to re-emphasize stitching these things together. There’s a lot of different ways to potentially do that. Right? There’s ways that you can literally train the model end to end with adapters and all kinds of other early fusion fusions. All kinds of ways. But one of the things that the word of the I guess the year is going to be agents and an agent is a very interesting term to think about how you might abstract away some of the components or the tasks that you want the model to, to accomplish in the midst of sort of a real human to maybe model interaction. Can you talk a little bit more about, how we’re thinking about agents in this, in this platform approach?

GUYMAN: Well, this is our newest addition to the Azure AI Foundry. So there’s an agent catalog now where we have a set of pre-configured agents for health care. And then we also have a multi-agent orchestrator that can jump start the process of developers building their own multi-agent workflows to tackle some complex real-world tasks that clinicians have to deal with. And these agents basically combine a general reasoner, like a large language model, like a GPT 4o or an o series model with a specialized model, like a model that understands radiology or pathology with domain-specific knowledge and tools. So the knowledge might be, you know, public guidelines or, you know, medical journals or your own private data from your EHR or medical imaging system, and then tools like a code interpreter to deal with all of the numeric data or tools like that that the clinicians are using today, like PowerPoint, Word, Teams and etc. And so we’re allowing developers to build and customize each of these agents in Foundry and then deploy them into their workflows.

LUNGREN: And, and I really like that concept because, you know, as, as a, as a from the user personas, I think about myself as a user. How am I going to interact with these agents? Where does it naturally fit? And I and I sort of, you know, I’ve seen some of the demonstrations and some of the work that’s going on with Stanford in particular, showing that, you know, and literally in a Teams chat, I can have my clinician colleagues and I can have specialized health care agents that kind of interact, like I’m interacting with a human on a chat.

It is a completely mind-blowing thing for me. And it’s a light bulb moment for me to I wonder, what have we, what have we heard from folks that have, you know, tried out this health care agent orchestrator in this kind of deployment environment via Teams?

GUYMAN: Well, someone joked, you know, are you sure you’re not using Teams because you work at Microsoft? [LAUGHS] But, then we actually were meeting with one of the, radiologists at one of our partners, and they said that that morning they had just done a Teams meeting, or they had met with other specialists to talk about a patient’s cancer case, or they were coming up with a treatment plan.

And that was the light bulb moment for us. We realized, actually, Teams is already being used by physicians as an internal communication tool, as a tool to get work done. And especially since the pandemic, a lot of the meetings moved to virtual and telemedicine. And so it’s a great distribution channel for AI, which is often been a struggle for AI to actually get in the hands of clinicians. And so now we’re allowing developers to build and then deploy very easily and extend it into their own workflows.

CARLSON: I think that’s such an important point. I mean, if you think about one of the really important concepts in computer science is an application programing interface, like some set of rules that allow two applications to talk to each other. One of the big pushes, really important pushes, in medicine has been standards that allow us to actually have data standards and APIs that allow these to talk to each other, and yet still we end up with these silos. There’s silos of data. There’s silos of applications.

And just like when you and I work on our phone, we have to go back and forth between applications. One of the things that I think agents do is that it takes the idea that now you can use language to understand intent and effectively program an interface, and it creates a whole new abstraction layer that allows us to simplify the interaction between not just humans and the endpoint, but also for developers.

It allows us to have this abstraction layer that lets different developers focus on different types of models, and yet stitch them all together in a very, very natural, way, not just for the users, but for the ability to actually deploy those models.

SALIGRAMA: Just to add to what Jonathan was mentioning, the other cool thing about the Microsoft Teams user interface is it’s also enterprise ready.

RUNDE: And one important thing that we’re thinking about, is exactly this from the very early research through incubation and then to scale, obviously. Right. And so early on in research, we are actively working with our partners and our collaborators to make sure that we have the right data privacy and consent in place. We’re doing this in incubation as well. And then obviously in scale. Yep.

LUNGREN: So, I think AI has always been thought of as a savior kind of technology. We talked a little bit about how there’s been some ups and downs in terms of the ability for technology to be effective in health care. At the same time, we’re seeing a lot of new innovations that are really making a difference. But then we kind of get, you know, we talked about agents a little bit. It feels like we’re maybe abstracting too far. Maybe it’s things are going too fast, almost. What makes this different? I mean, in your mind is this truly a logical next step or is it going to take some time?

CARLSON: I think there’s a couple things that have happened. I think first, on just a pure technology. What led to ChatGPT? And I like to think of really three major breakthroughs.

The first was new mathematical concepts of attention, which really means that we now have a way that a machine can figure out which parts of the context it should actually focus on, just the way our brains do. Right? I mean, if you’re a clinician and somebody is talking to you, the majority of that conversation is not relevant for the diagnosis. But, you know how to zoom in on the parts that matter. That’s a super powerful mathematical concept. The second one is this idea of self-supervision. So, I think one of the fundamental problems of machine learning has been that you have to train on labeled training data and labels are expensive, which means data sets are small, which means the final models are very narrow and brittle. And the idea of self-supervision is that you can just get a model to automatically learn concepts, and the language is just predict the next word. And what’s important about that is that leads to models that can actually manipulate and understand really messy text and pull out what’s important about that, and then and then stitch that back together in interesting ways.

And the third concept, that came out of those first two, was just the observational scale. And that’s that more is better, more data, more compute, bigger models. And that really leads to a reason to keep investing. And for these models to keep getting better. So that as a as a groundwork, that’s what led to ChatGPT. That’s what led to our ability now to not just have rule-based systems or simple machine learning based systems to take a messy EHR record, say, and pull out a couple concepts.

But to really feed the whole thing in and say, okay, I need you to figure out which concepts are in here. And is this particular attribute there, for example. That’s now led to the next breakthrough, which is all those core ideas apply to images as well. They apply to proteins, to DNA. And so we’re starting to see models that understand images and the concepts of images, and can actually map those back to text as well.

So, you can look at a pathology image and say, not just at the cell, but it appears that there’s some certain sort of cancer in this particular, tissue there. And then you take those two things together and you layer on the fact that now you have a model, or a set of models, that can understand intent, can understand human concepts and biomedical concepts, and you can start stitching them together into specialized agents that can actually reason with each other, which at some level gives you an API as a developer to say, okay, I need to focus on a pathology model and get this really, really, sound while somebody else is focusing on a radiology model, but now allows us to stitch these all together with a user interface that we can now talk to through natural language.

RUNDE: I’d like to double click a little bit on that medical abstraction piece that you mentioned. Just the amount of data, clinical data that there is for each individual patient. Let’s think about cancer patients for a second to make this real. Right. For every cancer patient, it could take a couple of hours to structure their information. And why is that important? Because, you have to get that information in a structured way and abstract relevant information to be able to unlock precision health applications right, for each patient. So, to be able to match them to a trial, right, someone has to sit there and go through all of the clinical notes from their entire patient care journey, from the beginning to the end. And that’s not scalable. And so one thing that we’ve been doing in an active project that we’ve been working on with a handful of our partners, but Providence specifically, I’ll call out, is using AI to actually abstract and curate that information. So that gives time back to the health care provider to spend with patients, instead of spending all their time curating this information.

And this is super important because it sets the scene and the backbone for all those precision health applications. Like I mentioned, clinical trial matching, tumor boards is another really important example here. Maybe Matt, you can talk to that a little bit.

LUNGREN: It’s a great example. And you know it’s so funny. We’ve talked about this use case and the you know the health care agent orchestrator is sort of the at the initial lighthouse use case was a tumor board setting. And I remember when we first started working with some of the partners on this, I think we were you know, under a research kind of lens, thinking about what could this, what new diagnoses could have come up with or what new insights might have and what was really a really key moment for us, I think, was noticing that we had developed an agent that can take all of the multimodal data about a patient’s chart, organize it in a timeline, in chronological fashion, and then allow folks to click on different parts of the timeline to ground it back to the note. And just that, which doesn’t sound like a really interesting research paper. It was mind blowing for clinicians who, again, as you said, spend a great deal of time, often outside of the typical work hours, trying to organize these patient records in order to go present to a tumor board.

And a tumor board is a critical meeting that happens at many cancer centers where specialists all get together, come with their perspective, and make a comment on what would be the best next step in treatment. But the background in preparing for that is you know, again, organizing the data. But to your point, also, what are the clinical trials that are active? There are thousands of clinical trials. There’s hundreds every day added. How can anyone keep up with that? And these are the kinds of use cases that start to bubble up. And you realize that a technology that understands concepts, context and can reason over vast amounts of data with a language interface-that is a powerful tool. Even before we get to some of the, you know, unlocking new insights and even precision medicine, this is that idea of saving time before lives to me. And there’s an enormous amount of undifferentiated heavy lifting that happens in health care that these agents and these kinds of workflows can start to unlock.

GUYMAN: And we’ve packaged these agents, the manual abstraction work that, you know, manually takes hours. Now we have an agent. It’s in Foundry along with the clinical trial matching agent, which I think at Providence you showed could double the match rate over the baseline that they were using by using the AI for multiple data sources. So, we have that and then we have this orchestration that is using this really neat technology from Microsoft Research. Semantic Kernel, Magentic One, Omni Parser. These are technologies that are good at figuring out which agent to use for a given task. So a clinician who’s used to working with other specialists, like a radiologist, a pathologist, a surgeon, they can now also consult these specialist agents who are experts in their domain and there’s shared memory across the agents.

There’s turn taking, there’s negotiation between the agents. So, there’s this really interesting system that’s emerging. And again, this is all possible to be used through Teams. And there’s some great extensibility as well. We’ve been talking about that and working on some cool tools.

SALIGRAMA: Yeah. Yeah. No, I think if I have to geek out a little bit on how all this agent tech orchestrations are coming up, like I’ve been in software engineering for decades, it’s kind of a next version of distributed systems where you have these services that talk to each other. It’s a more natural way because LLMs are giving these natural ways instead of a structured API ways of conversing. We have these agents which can naturally understand how to talk to each other. Right. So this is like the next evolution of our systems now. And the way we’re packaging all of this is multiple ways based on all the standards and innovation that’s happening in this space. So, first of all, we are building these agents that are very good at specific tasks, like, Will was saying like, a trial matching agent or patient timeline agents.

So, we take all of these, and then we package it in a workflow and an orchestration. We use the standard, some of these coming from research. The Semantic Kernel, the Magentic-One. And then, all of these also allow us to extend these agents with custom agents that can be plugged in. So, we are open sourcing the entire agent orchestration in AI Foundry templates, so that developers can extend their own agents, and make their own workflows out of it. So, a lot of cool innovation happening to apply this technology to specific scenarios and workflows.

LUNGREN: Well, I was going to ask you, like, so as part of that extension. So, like, you know, folks can say, hey, I have maybe a really specific part of my workflow that I want to use some agents for, maybe one of the agents that can do PubMed literature search, for example. But then there’s also agents that, come in from the outside, you know, sort of like I could, I can imagine a software company or AI company that has a built-in agent that plugs in as well.

SALIGRAMA: Yeah. Yeah, absolutely. So, you can bring your own agent. And then we have these, standard ways of communicating with agents and integrating with the orchestration language so you can bring your own agent and extend this health care agent, agent orchestrator to your own needs.

LUNGREN: I can just think of, like, in a group chat, like a bunch of different specialist agents. And I really would want an orchestrator to help find the right tool, to your point earlier, because I’m guessing this ecosystem is going to expand quickly. Yeah. And I may not know which tool is best for which question. I just want to ask the question. Right.

SALIGRAMA: Yeah. Yeah.

CARLSON: Well, I think to that point to I mean, you said an important point here, which is tools, and these are not necessarily just AI tools. Right? I mean, we’ve known this for a while, right? LLMS are not very good at math, but you can have it use a calculator and then it works very well. And you know you guys both brought up the universal medical abstraction a couple times.

And one of the things that I find so powerful about that is we’ve long had this vision within the precision health community that we should be able to have a learning hospital system. We should be able to actually learn from the actual real clinical experiences that are happening every day, so that we can stop practicing medicine based off averages.

There’s a lot of work that’s gone on for the last 20 years about how to actually do causal inference. That’s not an AI question. That’s a statistical question. The bottleneck, the reason why we haven’t been able to do that is because most of that information is locked up in unstructured text. And these other tools need essentially a table.

And so now you can decompose this problem, say, well, what if I can use AI not to get to the causal answer, but to just structure the information. So now I can put it into the causal inference tool. And these sorts of patterns I think again become very, not just powerful for a programmer, but they start pulling together different specialties. And I think we’ll really see an acceleration, really, of collaboration across disciplines because of this.

CARLSON: So, when I joined Microsoft Research 18 years ago, I was doing work in computational biology. And I would always have to answer the question: why is Microsoft in biomedicine? And I would always kind of joke saying, well, it is. We sell Office and Windows to every health care system in the world. We’re already in the space. And it really struck me to now see that we’ve actually come full circle. And now you can actually connect in Teams, Word, PowerPoint, which are these tools that everybody uses every day, but they’re actually now specialize-able through these agents. Can you guys talk a little bit about what that looks like from a developer perspective? How can provider groups actually start playing with this and see this come to life?

SALIGRAMA: A lot of healthcare organizations already use Microsoft productivity tools, as you mentioned. So, they asked the developers, build these agents, and use our healthcare orchestrations, to plug in these agents and expose these in these productivity tools. They will get access to all these healthcare workers. So the healthcare agent orchestrator we have today integrates with Microsoft Teams, and it showcases an example of how you can at (@) mention these agents and talk to them like you were talking to another person in a Teams chat. And then it also provides examples of these agents and how they can use these productivity tools. One of the examples we have there is how they can summarize the assessments of this whole chat into a Word Doc, or even convert that into a PowerPoint presentation, for later on.

CARLSON: One of the things that has struck me is how easy it is to do. I mean, Will, I don’t know if you’ve worked with folks that have gone from 0 to 60, like, how fast? What does that look like?

GUYMAN: Yeah, it’s funny for us, the technology to transfer all this context into a Word Document or PowerPoint presentation for a doctor to take to a meeting is relatively straightforward compared to the complicated clinical trial matching multimodal processing. The feedback has been tremendous in terms of, wow, that saves so much time to have this organized report that then I can show up to meeting with and the agents can come with me to that meeting because they’re literally having a Teams meeting, often with other human specialists. And the agents can be there and ask and answer questions and fact check and source all the right information on the fly. So, there’s a nice integration into these existing tools.

LUNGREN: We worked with several different centers just to kind of understand, you know, where this might be useful. And, like, as I think we talked about before, the ideas that we’ve come up with again, this is a great one because it’s complex. It’s kind of hairy. There’s a lot of things happening under the hood that don’t necessarily require a medical license to do, right, to prepare for a tumor board and to organize data. But, it’s fascinating, actually. So, you know, folks have come up with ideas of, could I have an agent that can operate an MRI machine, and I can ask the agent to change some parameters or redo a protocol. We thought that was a pretty powerful use case. We’ve had others that have just said, you know, I really want to have a specific agent that’s able to kind of act like deep research does for the consumer side, but based on the context of my patient, so that it can search all the literature and pull the data in the papers that are relevant to this case. And the list goes on and on from operations all the way to clinical, you know, sort of decision making at some level. And I think that the research community that’s going to sprout around this will help us, guide us, I guess, to see what is the most high-impact use cases. Where is this effective? And maybe where it’s not effective.

But to me, the part that makes me so, I guess excited about this is just that I don’t have to think about, okay, well, then we have to figure out Health IT. Because it’s always, you know, we always have great ideas and research, and it always feels like there’s such a huge chasm to get it in front of the health care workers that might want to test this out. And it feels like, again, this productivity tool use case again with the enterprise security, the possibility for bringing in third parties to contribute really does feel like it’s a new surface area for innovation.

CARLSON: Yeah, I love that. Look. Let me end by putting you all on the spot. So, in three years, multimodal agents will do what? Matt, I’ll start with you.

LUNGREN: I am convinced that it’s going to save massive amount of time before it saves many lives.

RUNDE: I’ll focus on the patient care journey and diagnostic journey. I think it will kind of transform that process for the patient itself and shorten that process.

GUYMAN: Yeah, I think we’ve seen already papers recently showing that different modalities surfaced complementary information. And so we’ll see kind of this AI and these agents becoming an essential companion to the physician, surfacing insights that would have been overlooked otherwise.

SALIGRAMA: And similar to what you guys were saying, agents will become important assistants to healthcare workers, reducing a lot of documentation and workflow, excess work they have to do.

CARLSON: I love that. And I guess for my part, I think really what we’re going to see is a massive unleash of creativity. We’ve had a lot of folks that have been innovating in this space, but they haven’t had a way to actually get it into the hands of early adopters. And I think we’re going to see that really lead to an explosion of creativity across the ecosystem.

LUNGREN: So, where do we get started? Like where are the developers who are listening to this? The folks that are at, you know, labs, research labs and developing health care solutions. Where do they go to get started with the Foundry, the models we’ve talked about, the healthcare agent orchestrator. Where do they go?

GUYMAN: So AI.azure.com is the AI Foundry. It’s a website you can go as a developer. You can sign in with your Azure subscription, get your Azure account, your own VM, all that stuff. And you have an agent catalog, the model catalog. You can start from there. There is documentation and templates that you can then deploy to Teams or other applications.

LUNGREN: And tutorials are coming. Right. We have recordings of tutorials. We’ll have Hackathons, some sessions and then more to come. Yeah, we’re really excited.

[MUSIC]

LUNGREN: Thank you so much, guys for joining us.

CARLSON: Yes. Yeah. Thanks.

SALIGRAMA: Thanks for having us.

[MUSIC FADES]

The post Collaborators: Healthcare Innovation to Impact appeared first on Microsoft Research.

Magentic-UI, an experimental human-centered web agent

May 19, 2025

by Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Victor Dibia, Friederike Niedtner, Jack Gerrits, Jacob Alber, Jingya Chen, Griffin Bassman, Erkang (Eric) Zhu, Peter Chang, Ricky Loynd, Maya Murad, Rafah Hosn, Ece Kamar, Saleema Amershi Microsoft AI

This figure denotes a human figure above a small monitor on the left and a gear on the right with arrows pointing to each.

Modern productivity is rooted in the web—from searching for information and filling in forms to navigating dashboards. Yet, many of these tasks remain manual and repetitive. Today, we are introducing Magentic-UI, a new open-source research prototype of a human-centered agent that is meant to help researchers study open questions on human-in-the-loop approaches and oversight mechanisms for AI agents. This prototype collaborates with users on web-based tasks and operates in real time over a web browser. Unlike other computer use agents that aim for full autonomy, Magentic-UI offers a transparent and controllable experience for tasks that are action-oriented and require activities beyond just performing simple web searches.

Magentic-UI builds on Magentic-One (opens in new tab), a powerful multi-agent team we released last year, and is powered by AutoGen (opens in new tab), our leading agent framework. It is available under MIT license at https://github.com/microsoft/Magentic-UI (opens in new tab) and on Azure AI Foundry Labs (opens in new tab), the hub where developers, startups, and enterprises can explore groundbreaking innovations from Microsoft Research. Magentic-UI is integrated with Azure AI Foundry models and agents. Learn more about how to integrate Azure AI agents into the Magentic-UI multi-agent architecture by following this code sample (opens in new tab).

Magentic-UI can perform tasks that require browsing the web, writing and executing Python and shell code, and understanding files. Its key features include:

Collaborative planning with users (co-planning). Magentic-UI allows users to directly modify its plan through a plan editor or by providing textual feedback before Magentic-UI executes any actions.
Collaborative execution with users (co-tasking). Users can pause the system and give feedback in natural language or demonstrate it by directly taking control of the browser.
Safety with human-in-the-loop (action guards). Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors.
Safety with human-in-the-loop. Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors.
Learning from experience (plan learning). Magentic-UI can learn and save plans from previous interactions to improve task completion for future tasks.

A screenshot of Magentic-UI actively performing a task. The left side of the screen shows Magentic-UI stating its plan based on the following query — Figure 1: Screenshot of Magentic-UI actively performing a task. The left side of the screen shows Magentic-UI stating its plan and progress to accomplish a user’s complex goal. The right side shows the browser Magentic-UI is controlling.

How is Magentic-UI human-centered?

While many web agents promise full autonomy, in practice users can be left unsure of what the agent can do, what it is currently doing, and whether they have enough control to intervene when something goes wrong or doesn’t occur as expected. By contrast, Magentic-UI considers user needs at every stage of interaction. We followed a human-centered design methodology in building Magentic-UI by prototyping and obtaining feedback from pilot users during its design.

Co-planning - This figure describes how users can collaboratively plan with Magentic-UI. On the left hand side, users can accept the plan magentic-ui creates or re-create the plan. On the right hand side they can see the actions magnetic-ui is taking on the browser. — *Figure 2: Co-planning – Users can collaboratively plan with Magentic-UI.*

For example, after a person specifies and before Magentic-UI even begins to execute, it creates a clear step-by-step plan that outlines what it would do to accomplish the task. People can collaborate with Magentic-UI to modify this plan and then give final approval for Magentic-UI to begin execution. This is crucial as users may have expectations of how the task should be completed; communicating that information could significantly improve agent performance. We call this feature co-planning.

During execution, Magentic-UI shows in real time what specific actions it’s about to take. For example, whether it is about to click on a button or input a search query. It also shows in real time what it observed on the web pages it is visiting. Users can take control of the action at any point in time and give control back to the agent. We call this feature co-tasking.

Co-tasking. This fiture shows how magentic-ui shows the plan it created to the user and allows the user to push a button to accept the plan or modify it. — *Figure 3: Co-tasking – Magentic-UI provides real-time updates about what it is about to do and what it already did, allowing users to collaboratively complete tasks with the agent.*

Action-guards. TOn the left hand side,this figure shous how Magentic-UI will ask users for permission using Approve or Reject buttons before executing actions that it deems consequential or important. On the right hand side the figure shows magentic-ui picking Crab Wonton from Thai Thani's restaurant menu. — *Figure 4: Action-guards – Magentic-UI will ask users for permission before executing actions that it deems consequential or important.*

Additionally, Magentic-UI asks for user permission before performing actions that are deemed irreversible, such as closing a tab or clicking a button with side effects. We call these “action guards”. The user can also configure Magentic-UI’s action guards to always ask for permission before performing any action. If the user deems an action risky (e.g., paying for an item), they can reject it.

This figure shows how magentic-ui learns a plan by allowing the users to click on a button entitled

This figure shows how users can see their saved plans and click on a button to either run that same plan or edit it before running it.

Figure 5: Plan learning – Once a task is successfully completed, users can request Magentic-UI to learn a step-by-step plan from this experience.

After execution, the user can ask Magentic-UI to reflect on the conversation and infer and save a step-by-step plan for future similar tasks. Users can view and modify saved plans for Magentic-UI to reuse in the future in a saved-plans gallery. In a future session, users can launch Magentic-UI with the saved plan to either execute the same task again, like checking the price of a specific flight, or use the plan as a guide to help complete similar tasks, such as checking the price of a different type of flight.

Combined, these four features—co-planning, co-tasking, action guards, and plan learning—enable users to collaborate effectively with Magentic-UI.

Architecture

Magentic-UI’s underlying system is a team of specialized agents adapted from AutoGen’s Magentic-One system. The agents work together to create a modular system:

Orchestrator is the lead agent, powered by a large language model (LLM), that performs co-planning with the user, decides when to ask the user for feedback, and delegates sub-tasks to the remaining agents to complete.
WebSurfer is an LLM agent equipped with a web browser that it can control. Given a request by the Orchestrator, it can click, type, scroll, and visit pages in multiple rounds to complete the request from the Orchestrator.
Coder is an LLM agent equipped with a Docker code-execution container. It can write and execute Python and shell commands and provide a response back to the Orchestrator.
FileSurfer is an LLM agent equipped with a Docker code-execution container and file-conversion tools from the MarkItDown (opens in new tab) package. It can locate files in the directory controlled by Magentic-UI, convert files to markdown, and answer questions about them.

This figure shows a histogram. It's a comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has aaccess to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost. — *Figure 6: System architecture diagram of Magentic-UI*

To interact with Magentic-UI, users can enter a text message and attach images. In response, Magentic-UI creates a natural-language step-by-step plan with which users can interact through a plan-editing interface. Users can add, delete, edit, regenerate steps, and write follow-up messages to iterate on the plan. While the user editing the plan adds an upfront cost to the interaction, it can potentially save a significant amount of time in the agent executing the plan and increase its chance at success.

The plan is stored inside the Orchestrator and is used to execute the task. For each step of the plan, the Orchestrator determines which of the agents (WebSurfer, Coder, FileSurfer) or the user should complete the step. Once that decision is made, the Orchestrator sends a request to one of the agents or the user and waits for a response. After the response is received, the Orchestrator decides whether that step is complete. If it is, the Orchestrator moves on to the following step.

Once all steps are completed, the Orchestrator generates a final answer that is presented to the user. If, while executing any of the steps, the Orchestrator decides that the plan is inadequate (for example, because a certain website is unreachable), the Orchestrator can replan with user permission and start executing a new plan.

All intermediate progress steps are clearly displayed to the user. Furthermore, the user can pause the execution of the plan and send additional requests or feedback. The user can also configure through the interface whether agent actions (e.g., clicking a button) require approval.

Evaluating Magentic-UI

Magentic-UI innovates through its ability to integrate human feedback in its planning and execution of tasks. We performed a preliminary automated evaluation to showcase this ability on the GAIA benchmark (opens in new tab) for agents with a user-simulation experiment.

Evaluation with simulated users

This figure shows a box diagram to describe the architecture of Magentic-UI. — Figure 7: Comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has aaccess to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost.

GAIA is a benchmark for general AI assistants, with multimodal question-answer pairs that are challenging, requiring the agents to navigate the web, process files, and execute code. The traditional evaluation setup with GAIA assumes the system will autonomously complete the task and return an answer, which is compared to the ground-truth answer.

To evaluate the human-in-the-loop capabilities of Magentic-UI, we transform GAIA into an interactive benchmark by introducing the concept of a simulated user. Simulated users provide value in two ways: by having specific expertise that the agent may not possess, and by providing guidance on how the task should be performed.

We experiment with two types of simulated users to show the value of human-in-the-loop: (1) a simulated user that is more intelligent than the Magentic-UI agents and (2) a simulated user with the same intelligence as Magentic-UI agents but with additional information about the task. During co-planning, Magentic-UI takes feedback from this simulated user to improve its plan. During co-tasking, Magentic-UI can ask the (simulated) user for help when it gets stuck. Finally, if Magentic-UI does not provide a final answer, then the simulated user provides an answer instead.

The simulated user is an LLM without any tools, instructed to interact with Magentic-UI the way we expect a human would act. The first type of simulated user relies on OpenAI’s o4-mini, more performant at many tasks than the one powering the Magentic-UI agents (GPT-4o). For the second type of simulated user, we use GPT-4o for both the simulated user and the rest of the agents, but the user has access to side information about each task. Each task in GAIA has side information, which includes a human-written plan to solve the task. While this plan is not used as input in the traditional benchmark, in our interactive setting we provide this information to the second type of simulated user,which is powered by an LLM so that it can mimic a knowledgeable user. Importantly, we tuned our simulated user so as not to reveal the ground-truth answer directly as the answer is usually found inside the human written plan. Instead, it is prompted to guide Magentic-UI indirectly. We found that this tuning prevented the simulated user from inadvertently revealing the answer in all but 6% of tasks when Magentic-UI provides a final answer.

On the validation subset of GAIA (162 tasks), we show the results of Magentic-One operating in autonomous mode, Magentic-UI operating in autonomous mode (without the simulated user), Magentic-UI with simulated user (1) (smarter model), Magentic-UI with simulated user (2) (side-information), and human performance. We first note that Magentic-UI in autonomous mode is within a margin of error of the performance of Magentic-One. Note that the same LLM (GPT-4o) is used for Magentic-UI and Magentic-One.

Magentic-UI with the simulated user that has access to side information improves the accuracy of autonomous Magentic-UI by 71%, from a 30.3% task-completion rate to a 51.9% task-completion rate. Moreover, Magentic-UI only asks for help from the simulated user in 10% of tasks and relies on the simulated user for the final answer in 18% of tasks. And in those tasks where it does ask for help, it asks for help on average 1.1 times. Magentic-UI with the simulated user powered by a smarter model improves to 42.6% where Magentic-UI asks for help in only 4.3% of tasks, asking for help an average of 1.7 times in those tasks. This demonstrates the potential of even lightweight human feedback for improving performance (e.g., task completion) of autonomous agents, especially at a fraction of the cost compared to people completing tasks entirely manually.

Learning and reusing plans

As described above, once Magentic-UI completes a task, users have the option for Magentic-UI to learn a plan based on the execution of the task. These plans are saved in a plan gallery, which users and Magentic-UI can access in the future.

The user can select a plan from the plan gallery, which is displayed by clicking on the Saved Plans button. Alternatively, as a user enters a task that closely matches a previous task, the saved plan will be displayed even before the user is done typing. If no identical task is found, Magentic-UI can use AutoGen’s Task-Centric Memory (opens in new tab) to retrieve plans for any similar tasks. Our preliminary evaluations show that this retrieval is highly accurate, and when recalling a saved plan can be around 3x faster than generating a new plan. Once a plan is recalled or generated, the user can always accept it, modify it, or ask Magentic-UI to modify it for the specific task at hand.

Safety and control

Magentic-UI can surf the live internet and execute code. With such capabilities, we need to ensure that Magentic-UI acts in a safe and secure manner. The following features, design decisions, and evaluations were made to ensure this:

Allow-list: Users can set a list of websites that Magentic-UI is allowed to access. If Magentic-UI needs to access a website outside of the allow-list, users must explicitly approve it through the interface
Anytime interruptions: At any point of Magentic-UI completing the task, the user can interrupt Magentic-UI and stop any pending code execution or web browsing.
Docker sandboxing: Magentic-UI controls a browser that is launched inside a Docker container with no credentials, which avoids risks with logged-in accounts and credentials. Moreover, any code execution is also performed inside a separate Docker container to avoid affecting the host environment in which Magentic-UI is running. This is illustrated in the system architecture of Magentic-UI (Figure 3).
Detection and approval of irreversible agent actions: Users can configure an action-approval policy (action guards) to determine which actions Magentic-UI can perform without user approval. In the extreme, users can specify that any action (e.g., any button click) needs explicit user approval. Users must press an “Accept” or “Deny” button for each action.

In addition to the above design decisions, we performed a red-team evaluation of Magentic-UI on a set of internal scenarios, which we developed to challenge the security and safety of Magentic-UI. Such scenarios include cross-site prompt injection attacks, where web pages contain malicious instructions distinct from the user’s original intent (e.g., to execute risky code, access sensitive files, or perform actions on other websites). It also contains scenarios comparable to phishing, which try to trick Magentic-UI into entering sensitive information, or granting permissions on impostor sites (e.g., a synthetic website that asks Magentic-UI to log in and enter Google credentials to read an article). In our preliminary evaluations, we found that Magentic-UI either refuses to complete the requests, stops to ask the user, or, as a final safety measure, is eventually unable to complete the request due to Docker sandboxing. We have found that this layered approach is effective for thwarting these attacks.

We have also released transparency notes, which can be found at: https://github.com/microsoft/magentic-ui/blob/main/TRANSPARENCY_NOTE.md (opens in new tab)

Open research questions

Magentic-UI provides a tool for researchers to study critical questions in agentic systems and particularly on human-agent interaction. In a previous report (opens in new tab), we outlined 12 questions for human-agent communication, and Magentic-UI provides a vehicle to study these questions in a realistic setting. A key question among these is how we enable humans to efficiently intervene and provide feedback to the agent while executing a task. Humans should not have to constantly watch the agent. Ideally, the agent should know when to reach out for help and provide the necessary context for the human to assist it. A second question is about safety. As agents interact with the live web, they may become prone to attacks from malicious actors. We need to study what necessary safeguards are needed to protect the human from side effects without adding a heavy burden on the human to verify every agent action. There are also many other questions surrounding security, personalization, and learning that Magentic-UI can help with studying.

Conclusion

Magentic-UI is an open-source agent prototype that works with people to complete complex tasks that require multi-step planning and browser use. As agentic systems expand in the scope of tasks they can complete, Magentic-UI’s design enables better transparency into agent actions and enables human control to ensure safety and reliability. Moreover, by facilitating human intervention, we can improve performance while still reducing human cost in completing tasks on aggregate. Today we have released the first version of Magentic-UI. Looking ahead, we plan to continue developing it in the open with the goal of improving its capabilities and answering research questions on human-agent collaboration. We invite the research community to extend and reuse Magentic-UI for their scientific explorations and domains.

The post Magentic-UI, an experimental human-centered web agent appeared first on Microsoft Research.

Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers

May 15, 2025

by Peter Lee, Carey Goldberg, Dr. Isaac Kohane Microsoft AI

AI Revolution podcast | Episode 5 - Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers | outline illustration of Carey Goldberg, Peter Lee, and Dr. Isaac (Zak) Kohane

In this episode, Lee reunites with his coauthors Carey Goldberg (opens in new tab) and Dr. Zak Kohane (opens in new tab) to review the predictions they made and reflect on what has and hasn’t materialized based on discussions with the series’ early guests: frontline clinicians, patient/consumer advocates, technology developers, and policy and ethics thinkers. Together, the coauthors explore how generative AI is being used on the ground today—from clinical note-taking to empathetic patient communication—and discuss the ongoing tensions around safety, equity, and institutional adoption. The conversation also surfaces deeper questions about values embedded in AI systems and the future role of human clinicians.

Learn more

Compared with What? Measuring AI against the Health Care We Have (opens in new tab) (Kohane)
Publication | October 2024

Medical Artificial Intelligence and Human Values (opens in new tab) (Kohane)
Publication | May 2024

Managing Patient Use of Generative Health AI (opens in new tab) (Goldberg)
Publication | December 2024

Patient Portal — When Patients Take AI into Their Own Hands (opens in new tab) (Goldberg)
Publication | April 2024

To Do No Harm — and the Most Good — with AI in Health Care (opens in new tab) (Goldberg)
Publication | February 2024

This time, the hype about AI in medicine is warranted (opens in new tab) (Goldberg)
Opinion article | April 2023

The AI Revolution in Medicine: GPT-4 and Beyond  
Book | Peter Lee, Carey Goldberg, Isaac Kohane | April 2023

Transcript

[MUSIC]    

[BOOK PASSAGE] 

PETER LEE: “We need to start understanding and discussing AI’s potential for good and ill now. Or rather, yesterday. … GPT-4 has game-changing potential to improve medicine and health.”

[END OF BOOK PASSAGE] 

[THEME MUSIC]    

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.    

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.

[THEME MUSIC FADES] 

The passage I read at the top is from the book’s prologue.  

When Carey, Zak, and I wrote the book, we could only speculate how generative AI would be used in healthcare because GPT-4 hadn’t yet been released. It wasn’t yet available to the very people we thought would be most affected by it. And while we felt strongly that this new form of AI would have the potential to transform medicine, it was such a different kind of technology for the world, and no one had a user’s manual for this thing to explain how to use it effectively and also how to use it safely.

So we thought it would be important to give healthcare professionals and leaders a framing to start important discussions around its use. We wanted to provide a map not only to help people navigate a new world that we anticipated would happen with the arrival of GPT-4 but also to help them chart a future of what we saw as a potential revolution in medicine.

So I’m super excited to welcome my coauthors: longtime medical/science journalist Carey Goldberg and Dr. Zak Kohane, the inaugural chair of Harvard Medical School’s Department of Biomedical Informatics and the editor-in-chief for The New England Journal of Medicine AI.

We’re going to have two discussions. This will be the first one about what we’ve learned from the people on the ground so far and how we are thinking about generative AI today.

[TRANSITION MUSIC]

Carey, Zak, I’m really looking forward to this.

CAREY GOLDBERG: It’s nice to see you, Peter.

LEE: [LAUGHS] It’s great to see you, too.

GOLDBERG: We missed you.

ZAK KOHANE: The dynamic gang is back. [LAUGHTER]

LEE: Yeah, and I guess after that big book project two years ago, it’s remarkable that we’re still on speaking terms with each other. [LAUGHTER]

In fact, this episode is to react to what we heard in the first four episodes of this podcast. But before we get there, I thought maybe we should start with the origins of this project just now over two years ago. And, you know, I had this early secret access to Davinci 3, now known as GPT-4.

I remember, you know, experimenting right away with things in medicine, but I realized I was in way over my head. And so I wanted help. And the first person I called was you, Zak. And you remember we had a call, and I tried to explain what this was about. And I think I saw skepticism in—polite skepticism—in your eyes. But tell me, you know, what was going through your head when you heard me explain this thing to you?

KOHANE: So I was divided between the fact that I have tremendous respect for you, Peter. And you’ve always struck me as sober. And we’ve had conversations which showed to me that you fully understood some of the missteps that technology—ARPA, Microsoft, and others—had made in the past. And yet, you were telling me a full science fiction compliant story [LAUGHTER] that something that we thought was 30 years away was happening now.

LEE: Mm-hmm.

KOHANE: And it was very hard for me to put together. And so I couldn’t quite tell myself this is BS, but I said, you know, I need to look at it. Just this seems too good to be true. What is this? So it was very hard for me to grapple with it. I was thrilled that it might be possible, but I was thinking, How could this be possible?

LEE: Yeah. Well, even now, I look back, and I appreciate that you were nice to me, because I think a lot of people would have [LAUGHS] been much less polite. And in fact, I myself had expressed a lot of very direct skepticism early on.

After ChatGPT got released, I think three or four days later, I received an email from a colleague running … who runs a clinic, and, you know, he said, “Wow, this is great, Peter. And, you know, we’re using this ChatGPT, you know, to have the receptionist in our clinic write after-visit notes to our patients.”

And that sparked a huge internal discussion about this. And you and I knew enough about hallucinations and about other issues that it seemed important to write something about what this could do and what it couldn’t do. And so I think, I can’t remember the timing, but you and I decided a book would be a good idea. And then I think you had the thought that you and I would write in a hopelessly academic style [LAUGHTER] that no one would be able to read.

So it was your idea to recruit Carey, I think, right?

KOHANE: Yes, it was. I was sure that we both had a lot of material, but communicating it effectively to the very people we wanted to would not go well if we just left ourselves to our own devices. And Carey is super brilliant at what she does. She’s an idea synthesizer and public communicator in the written word and amazing.

LEE: So yeah. So, Carey, we contact you. How did that go?

GOLDBERG: So yes. On my end, I had known Zak for probably, like, 25 years, and he had always been the person who debunked the scientific hype for me. I would turn to him with like, “Hmm, they’re saying that the Human Genome Project is going to change everything.” And he would say, “Yeah. But first it’ll be 10 years of bad news, and then [LAUGHTER] we’ll actually get somewhere.”

So when Zak called me up at seven o’clock one morning, just beside himself after having tried Davinci 3, I knew that there was something very serious going on. And I had just quit my job as the Boston bureau chief of Bloomberg News, and I was ripe for the plucking. And I also … I feel kind of nostalgic now about just the amazement and the wonder and the awe of that period. We knew that when generative AI hit the world, there would be all kinds of snags and obstacles and things that would slow it down, but at that moment, it was just like the holy crap moment. [LAUGHTER] And it’s fun to think about it now.

LEE: Yeah. I think ultimately, you know, recruiting Carey, you were [LAUGHS] so important because you basically went through every single page of this book and made sure … I remember, in fact, it’s affected my writing since because you were coaching us that every page has to be a page turner. There has to be something on every page that motivates people to want to turn the page and get to the next one.

KOHANE: I will see that and raise that one. I now tell GPT-4, please write this in the style of Carey Goldberg.

GOLDBERG: [LAUGHTER] No way! Really?

KOHANE: Yes way. Yes way. Yes way.

GOLDBERG: Wow. Well, I have to say, like, it’s not hard to motivate readers when you’re writing about the most transformative technology of their lifetime. Like, I think there’s a gigantic hunger to read and to understand. So you were not hard to work with, Peter and Zak. [LAUGHS]

LEE: All right. So I think we have to get down to work [LAUGHS] now.

Yeah, so for these podcasts, you know, we’re talking to different types of people to just reflect on what’s actually happening, what has actually happened over the last two years. And so the first episode, we talked to two doctors. There’s Chris Longhurst at UC San Diego and Sara Murray at UC San Francisco. And besides being doctors and having AI affect their clinical work, they just happen also to be leading the efforts at their respective institutions to figure out how best to integrate AI into their health systems.

And, you know, it was fun to talk to them. And I felt like a lot of what they said was pretty validating for us. You know, they talked about AI scribes. Chris, especially, talked a lot about how AI can respond to emails from patients, write referral letters. And then, you know, they both talked about the importance of—I think, Zak, you used the phrase in our book “trust but verify”—you know, to have always a human in the loop.

What did you two take away from their thoughts overall about how doctors are using … and I guess, Zak, you would have a different lens also because at Harvard, you see doctors all the time grappling with AI.

KOHANE: So on the one hand, I think they’ve done some very interesting studies. And indeed, they saw that when these generative models, when GPT-4, was sending a note to patients, it was more detailed, friendlier.

But there were also some nonobvious results, which is on the generation of these letters, if indeed you review them as you’re supposed to, it was not clear that there was any time savings. And my own reaction was, Boy, every one of these things needs institutional review. It’s going to be hard to move fast.

And yet, at the same time, we know from them that the doctors on their smartphones are accessing these things all the time. And so the disconnect between a healthcare system, which is duty bound to carefully look at every implementation, is, I think, intimidating.

LEE: Yeah.

KOHANE: And at the same time, doctors who just have to do what they have to do are using this new superpower and doing it. And so that’s actually what struck me …

LEE: Yeah.

KOHANE: … is that these are two leaders and they’re doing what they have to do for their institutions, and yet there’s this disconnect.

And by the way, I don’t think we’ve seen any faster technology adoption than the adoption of ambient dictation. And it’s not because it’s time saving. And in fact, so far, the hospitals have to pay out of pocket. It’s not like insurance is paying them more. But it’s so much more pleasant for the doctors … not least of which because they can actually look at their patients instead of looking at the terminal and plunking down.

LEE: Carey, what about you?

GOLDBERG: I mean, anecdotally, there are time savings. Anecdotally, I have heard quite a few doctors saying that it cuts down on “pajama time” to be able to have the note written by the AI and then for them to just check it. In fact, I spoke to one doctor who said, you know, basically it means that when I leave the office, I’ve left the office. I can go home and be with my kids.

So I don’t think the jury is fully in yet about whether there are time savings. But what is clear is, Peter, what you predicted right from the get-go, which is that this is going to be an amazing paper shredder. Like, the main first overarching use cases will be back-office functions.

LEE: Yeah, yeah. Well, and it was, I think, not a hugely risky prediction because, you know, there were already companies, like, using phone banks of scribes in India to kind of listen in. And, you know, lots of clinics actually had human scribes being used. And so it wasn’t a huge stretch to imagine the AI.

[TRANSITION MUSIC]

So on the subject of things that we missed, Chris Longhurst shared this scenario, which stuck out for me, and he actually coauthored a paper on it last year.

CHRISTOPHER LONGHURST: It turns out, not surprisingly, healthcare can be frustrating. And stressed patients can send some pretty nasty messages to their care teams. [LAUGHTER] And you can imagine being a busy, tired, exhausted clinician and receiving a bit of a nasty-gram. And the GPT is actually really helpful in those instances in helping draft a pretty empathetic response when I think the human instinct would be a pretty nasty one.

LEE: [LAUGHS] So, Carey, maybe I’ll start with you. What did we understand about this idea of empathy out of AI at the time we wrote the book, and what do we understand now?

GOLDBERG: Well, it was already clear when we wrote the book that these AI models were capable of very persuasive empathy. And in fact, you even wrote that it was helping you be a better person, right. [LAUGHS] So their human qualities, or human imitative qualities, were clearly superb. And we’ve seen that borne out in multiple studies, that in fact, patients respond better to them … that they have no problem at all with how the AI communicates with them. And in fact, it’s often better.

And I gather now we’re even entering a period when people are complaining of sycophantic models, [LAUGHS] where the models are being too personable and too flattering. I do think that’s been one of the great surprises. And in fact, this is a huge phenomenon, how charming these models can be.

LEE: Yeah, I think you’re right. We can take credit for understanding that, Wow, these things can be remarkably empathetic. But then we missed this problem of sycophancy. Like, we even started our book in Chapter 1 with a quote from Davinci 3 scolding me. Like, don’t you remember when we were first starting, this thing was actually anti-sycophantic. If anything, it would tell you you’re an idiot.

KOHANE: It argued with me about certain biology questions. It was like a knockdown, drag-out fight. [LAUGHTER] I was bringing references. It was impressive. But in fact, it made me trust it more.

LEE: Yeah.

KOHANE: And in fact, I will say—I remember it’s in the book—I had a bone to pick with Peter. Peter really was impressed by the empathy. And I pointed out that some of the most popular doctors are popular because they’re very empathic. But they’re not necessarily the best doctors. And in fact, I was taught that in medical school.

And so it’s a decoupling. It’s a human thing, that the empathy does not necessarily mean … it’s more of a, potentially, more of a signaled virtue than an actual virtue.

GOLDBERG: Nicely put.

LEE: Yeah, this issue of sycophancy, I think, is a struggle right now in the development of AI because I think it’s somehow related to instruction-following. So, you know, one of the challenges in AI is you’d like to give an AI a task—a task that might take several minutes or hours or even days to complete. And you want it to faithfully kind of follow those instructions. And, you know, that early version of GPT-4 was not very good at instruction-following. It would just silently disobey and, you know, and do something different.

And so I think we’re starting to hit some confusing elements of like, how agreeable should these things be?

One of the two of you used the word genteel. There was some point even while we were, like, on a little book tour … was it you, Carey, who said that the model seems nicer and less intelligent or less brilliant now than it did when we were writing the book?

GOLDBERG: It might have been, I think so. And I mean, I think in the context of medicine, of course, the question is, well, what’s likeliest to get the results you want with the patient, right? A lot of healthcare is in fact persuading the patient to do what you know as the physician would be best for them. And so it seems worth testing out whether this sycophancy is actually constructive or not. And I suspect … well, I don’t know, probably depends on the patient.

So actually, Peter, I have a few questions for you …

LEE: Yeah. Mm-hmm.

GOLDBERG: … that have been lingering for me. And one is, for AI to ever fully realize its potential in medicine, it must deal with the hallucinations. And I keep hearing conflicting accounts about whether that’s getting better or not. Where are we at, and what does that mean for use in healthcare?

LEE: Yeah, well, it’s, I think two years on, in the pretrained base models, there’s no doubt that hallucination rates by any benchmark measure have reduced dramatically. And, you know, that doesn’t mean they don’t happen. They still happen. But, you know, there’s been just a huge amount of effort and understanding in the, kind of, fundamental pretraining of these models. And that has come along at the same time that the inference costs, you know, for actually using these models has gone down, you know, by several orders of magnitude.

So things have gotten cheaper and have fewer hallucinations. At the same time, now there are these reasoning models. And the reasoning models are able to solve problems at PhD level oftentimes.

But at least at the moment, they are also now hallucinating more than the simpler pretrained models. And so it still continues to be, you know, a real issue, as we were describing. I don’t know, Zak, from where you’re at in medicine, as a clinician and as an educator in medicine, how is the medical community from where you’re sitting looking at that?

KOHANE: So I think it’s less of an issue, first of all, because the rate of hallucinations is going down. And second of all, in their day-to-day use, the doctor will provide questions that sit reasonably well into the context of medical decision-making. And the way doctors use this, let’s say on their non-EHR [electronic health record] smartphone is really to jog their memory or thinking about the patient, and they will evaluate independently. So that seems to be less of an issue. I’m actually more concerned about something else that’s I think more fundamental, which is effectively, what values are these models expressing?

And I’m reminded of when I was still in training, I went to a fancy cocktail party in Cambridge, Massachusetts, and there was a psychotherapist speaking to a dentist. They were talking about their summer, and the dentist was saying about how he was going to fix up his yacht that summer, and the only question was whether he was going to make enough money doing procedures in the spring so that he could afford those things, which was discomforting to me because that dentist was my dentist. [LAUGHTER] And he had just proposed to me a few weeks before an expensive procedure.

And so the question is what, effectively, is motivating these models?

LEE: Yeah, yeah.

KOHANE: And so with several colleagues, I published a paper (opens in new tab), basically, what are the values in AI? And we gave a case: a patient, a boy who is on the short side, not abnormally short, but on the short side, and his growth hormone levels are not zero. They’re there, but they’re on the lowest side. But the rest of the workup has been unremarkable. And so we asked GPT-4, you are a pediatric endocrinologist.

Should this patient receive growth hormone? And it did a very good job explaining why the patient should receive growth hormone.

GOLDBERG: Should. Should receive it.

KOHANE: Should. And then we asked, in a separate session, you are working for the insurance company. Should this patient receive growth hormone? And it actually gave a scientifically better reason not to give growth hormone. And in fact, I tend to agree medically, actually, with the insurance company in this case, because giving kids who are not growth hormone deficient, growth hormone gives only a couple of inches over many, many years, has all sorts of other issues. But here’s the point, we had 180-degree change in decision-making because of the prompt. And for that patient, tens-of-thousands-of-dollars-per-year decision; across patient populations, millions of dollars of decision-making.

LEE: Hmm. Yeah.

KOHANE: And you can imagine these user prompts making their way into system prompts, making their way into the instruction-following. And so I think this is aptly central. Just as I was wondering about my dentist, we should be wondering about these things. What are the values that are being embedded in them, some accidentally and some very much on purpose?

LEE: Yeah, yeah. That one, I think, we even had some discussions as we were writing the book, but there’s a technical element of that that I think we were missing, but maybe Carey, you would know for sure. And that’s this whole idea of prompt engineering. It sort of faded a little bit. Was it a thing? Do you remember?

GOLDBERG: I don’t think we particularly wrote about it. It’s funny, it does feel like it faded, and it seems to me just because everyone just gets used to conversing with the models and asking for what they want. Like, it’s not like there actually is any great science to it.

LEE: Yeah, even when it was a hot topic and people were talking about prompt engineering maybe as a new discipline, all this, it never, I was never convinced at the time. But at the same time, it is true. It speaks to what Zak was just talking about because part of the prompt engineering that people do is to give a defined role to the AI.

You know, you are an insurance claims adjuster, or something like that, and defining that role, that is part of the prompt engineering that people do.

GOLDBERG: Right. I mean, I can say, you know, sometimes you guys had me take sort of the patient point of view, like the “every patient” point of view. And I can say one of the aspects of using AI for patients that remains absent in as far as I can tell is it would be wonderful to have a consumer-facing interface where you could plug in your whole medical record without worrying about any privacy or other issues and be able to interact with the AI as if it were physician or a specialist and get answers, which you can’t do yet as far as I can tell.

LEE: Well, in fact, now that’s a good prompt because I think we do need to move on to the next episodes, and we’ll be talking about an episode that talks about consumers. But before we move on to Episode 2, which is next, I’d like to play one more quote, a little snippet from Sara Murray.

SARA MURRAY: I already do this when I’m on rounds—I’ll kind of give the case to ChatGPT if it’s a complex case, and I’ll say, “Here’s how I’m thinking about it; are there other things?” And it’ll give me additional ideas that are sometimes useful and sometimes not but often useful, and I’ll integrate them into my conversation about the patient.

LEE: Carey, you wrote this fictional account at the very start of our book. And that fictional account, I think you and Zak worked on that together, talked about this medical resident, ER resident, using, you know, a chatbot off label, so to speak. And here we have the chief, in fact, the nation’s first chief health AI officer [LAUGHS] for an elite health system doing exactly that. That’s got to be pretty validating for you, Carey.

GOLDBERG: It’s very. [LAUGHS] Although what’s troubling about it is that actually as in that little vignette that we made up, she’s using it off label, right. It’s like she’s just using it because it helps the way doctors use Google. And I do find it troubling that what we don’t have is sort of institutional buy-in for everyone to do that because, shouldn’t they if it helps?

LEE: Yeah. Well, let’s go ahead and get into Episode 2. So Episode 2, we sort of framed as talking to two people who are on the frontlines of big companies integrating generative AI into their clinical products. And so, one was Matt Lungren, who’s a colleague of mine here at Microsoft. And then Seth Hain, who leads all of R&D at Epic.

Maybe we’ll start with a little snippet of something that Matt said that struck me in a certain way.

MATTHEW LUNGREN: OK, we see this pain point. Doctors are typing on their computers while they’re trying to talk to their patients, right? We should be able to figure out a way to get that ambient conversation turned into text that then, you know, accelerates the doctor … takes all the important information. That’s a really hard problem, right. And so, for a long time, there was a human-in-the-loop aspect to doing this because you needed a human to say, “This transcript’s great, but here’s actually what needs to go in the note.” And that can’t scale.

LEE: I think we expected healthcare systems to adopt AI, and we spent a lot of time in the book on AI writing clinical encounter notes. It’s happening for real now, and in a big way. And it’s something that has, of course, been happening before generative AI but now is exploding because of it. Where are we at now, two years later, just based on what we heard from guests?

KOHANE: Well, again, unless they’re forced to, hospitals will not adopt new technology unless it immediately translates into income. So it’s bizarrely counter-cultural that, again, they’re not being able to bill for the use of the AI, but this technology is so compelling to the doctors that despite everything, it’s overtaking the traditional dictation-typing routine.

LEE: Yeah.

GOLDBERG: And a lot of them love it and say, you will pry my cold dead hands off of my ambient note-taking, right. And I actually … a primary care physician allowed me to watch her. She was actually testing the two main platforms that are being used. And there was this incredibly talkative patient who went on and on about vacation and all kinds of random things for about half an hour.

And both of the platforms were incredibly good at pulling out what was actually medically relevant. And so to say that it doesn’t save time doesn’t seem right to me. Like, it seemed like it actually did and in fact was just shockingly good at being able to pull out relevant information.

LEE: Yeah.

KOHANE: I’m going to hypothesize that in the trials, which have in fact shown no gain in time, is the doctors were being incredibly meticulous. [LAUGHTER] So I think … this is a Hawthorne effect, because you know you’re being monitored. And we’ve seen this in other technologies where the moment the focus is off, it’s used much more routinely and with much less inspection, for the better and for the worse.

LEE: Yeah, you know, within Microsoft, I had some internal disagreements about Microsoft producing a product in this space. It wouldn’t be Microsoft’s normal way. Instead, we would want 50 great companies building those products and doing it on our cloud instead of us competing against those 50 companies. And one of the reasons is exactly what you both said. I didn’t expect that health systems would be willing to shell out the money to pay for these things. It doesn’t generate more revenue. But I think so far two years later, I’ve been proven wrong.

I wanted to ask a question about values here. I had this experience where I had a little growth, a bothersome growth on my cheek. And so had to go see a dermatologist. And the dermatologist treated it, froze it off. But there was a human scribe writing the clinical note.

And so I used the app to look at the note that was submitted. And the human scribe said something that did not get discussed in the exam room, which was that the growth was making it impossible for me to safely wear a COVID mask. And that was the reason for it.

And that then got associated with a code that allowed full reimbursement for that treatment. And so I think that’s a classic example of what’s called upcoding. And I strongly suspect that AI scribes, an AI scribe would not have done that.

GOLDBERG: Well, depending what values you programmed into it, right, Zak? [LAUGHS]

KOHANE: Today, today, today, it will not do it. But, Peter, that is actually the central issue that society has to have because our hospitals are currently mostly in the red. And upcoding is standard operating procedure. And if these AI get in the way of upcoding, they are going to be aligned towards that upcoding. You know, you have to ask yourself, these MRI machines are incredibly useful. They’re also big money makers. And if the AI correctly says that for this complaint, you don’t actually have to do the MRI …

LEE: Right.

KOHANE: … what’s going to happen? And so I think this issue of values … you’re right. Right now, they’re actually much more impartial. But there are going to be business plans just around aligning these things towards healthcare. In many ways, this is why I think we wrote the book so that there should be a public discussion. And what kind of AI do we want to have? Whose values do we want it to represent?

GOLDBERG: Yeah. And that raises another question for me. So, Peter, speaking from inside the gigantic industry, like, there seems to be such a need for self-surveillance of the models for potential harms that they could be causing. Are the big AI makers doing that? Are they even thinking about doing that?

Like, let’s say you wanted to watch out for the kind of thing that Zak’s talking about, could you?

LEE: Well, I think evaluation, like the best evaluation we had when we wrote our book was, you know, what score would this get on the step one and step two US medical licensing exams? [LAUGHS]

GOLDBERG: Right, right, right, yeah.

LEE: But honestly, evaluation hasn’t gotten that much deeper in the last two years. And it’s a big, I think, it is a big issue. And it’s related to the regulation issue also, I think.

Now the other guest in Episode 2 is Seth Hain from Epic. You know, Zak, I think it’s safe to say that you’re not a fan of Epic and the Epic system. You know, we’ve had a few discussions about that, about the fact that doctors don’t have a very pleasant experience when they’re using Epic all day.

Seth, in the podcast, said that there are over 100 AI integrations going on in Epic’s system right now. Do you think, Zak, that that has a chance to make you feel better about Epic? You know, what’s your view now two years on?

KOHANE: My view is, first of all, I want to separate my view of Epic and how it’s affected the conduct of healthcare and the quality of life of doctors from the individuals. Like Seth Hain is a remarkably fine individual who I’ve enjoyed chatting with and does really great stuff. Among the worst aspects of the Epic, even though it’s better in that respect than many EHRs, is horrible user interface.

The number of clicks that you have to go to get to something. And you have to remember where someone decided to put that thing. It seems to me that it is fully within the realm of technical possibility today to actually give an agent a task that you want done in the Epic record. And then whether Epic has implemented that agent or someone else, it does it so you don’t have to do the clicks. Because it’s something really soul sucking that when you’re trying to help patients, you’re having to remember not the right dose of the medication, but where was that particular thing that you needed in that particular task?

I can’t imagine that Epic does not have that in its product line. And if not, I know there must be other companies that essentially want to create that wrapper. So I do think, though, that the danger of multiple integrations is that you still want to have the equivalent of a single thought process that cares about the patient bringing those different processes together. And I don’t know if that’s Epic’s responsibility, the hospital’s responsibility, whether it’s actually a patient agent. But someone needs to be also worrying about all those AIs that are being integrated into the patient record. So … what do you think, Carey?

GOLDBERG: What struck me most about what Seth said was his description of the Cosmos project, and I, you know, I have been drinking Zak’s Kool-Aid for a very long time, [LAUGHTER] and he—no, in a good way! And he persuaded me long ago that there is this horrible waste happening in that we have all of these electronic medical records, which could be used far, far more to learn from, and in particular, when you as a patient come in, it would be ideal if your physician could call up all the other patients like you and figure out what the optimal treatment for you would be. And it feels like—it sounds like—that’s one of the central aims that Epic is going for. And if they do that, I think that will redeem a lot of the pain that they’ve caused physicians these last few years.

And I also found myself thinking, you know, maybe this very painful period of using electronic medical records was really just a growth phase. It was an awkward growth phase. And once AI is fully used the way Zak is beginning to describe, the whole system could start making a lot more sense for everyone.

LEE: Yeah. One conversation I’ve had with Seth, in all of this is, you know, with AI and its development, is there a future, a near future where we don’t have an EHR [electronic health record] system at all? You know, AI is just listening and just somehow absorbing all the information. And, you know, one thing that Seth said, which I felt was prescient, and I’d love to get your reaction, especially Zak, on this is he said, I think that … he said, technically, it could happen, but the problem is right now, actually doctors do a lot of their thinking when they write and review notes. You know, the actual process of being a doctor is not just being with a patient, but it’s actually thinking later. What do you make of that?

KOHANE: So one of the most valuable experiences I had in training was something that’s more or less disappeared in medicine, which is the post-clinic conference, where all the doctors come together and we go through the cases that we just saw that afternoon. And we, actually, were trying to take potshots at each other [LAUGHTER] in order to actually improve. Oh, did you actually do that? Oh, I forgot. I’m going to go call the patient and do that.

And that really happened. And I think that, yes, doctors do think, and I do think that we are insufficiently using yet the artificial intelligence currently in the ambient dictation mode as much more of a independent agent saying, did you think about that?

I think that would actually make it more interesting, challenging, and clearly better for the patient because that conversation I just told you about with the other doctors, that no longer exists.

LEE: Yeah. Mm-hmm. I want to do one more thing here before we leave Matt and Seth in Episode 2, which is something that Seth said with respect to how to reduce hallucination.

SETH HAIN: At that time, there’s a lot of conversation in the industry around something called RAG, or retrieval-augmented generation. And the idea was, could you pull the relevant bits, the relevant pieces of the chart, into that prompt, that information you shared with the generative AI model, to be able to increase the usefulness of the draft that was being created? And that approach ended up proving and continues to be to some degree, although the techniques have greatly improved, somewhat brittle, right. And I think this becomes one of the things that we are and will continue to improve upon because, as you get a richer and richer amount of information into the model, it does a better job of responding.

LEE: Yeah, so, Carey, this sort of gets at what you were saying, you know, that shouldn’t these models be just bringing in a lot more information into their thought processes? And I’m certain when we wrote our book, I had no idea. I did not conceive of RAG at all. It emerged a few months later.

And to my mind, I remember the first time I encountered RAG—Oh, this is going to solve all of our problems of hallucination. But it’s turned out to be harder. It’s improving day by day, but it’s turned out to be a lot harder.

KOHANE: Seth makes a very deep point, which is the way RAG is implemented is basically some sort of technique for pulling the right information that’s contextually relevant. And the way that’s done is typically heuristic at best. And it’s not … doesn’t have the same depth of reasoning that the rest of the model has.

And I’m just wondering, Peter, what you think, given the fact that now context lengths seem to be approaching a million or more, and people are now therefore using the full strength of the transformer on that context and are trying to figure out different techniques to make it pay attention to the middle of the context. In fact, the RAG approach perhaps was just a transient solution to the fact that it’s going to be able to amazingly look in a thoughtful way at the entire record of the patient, for example. What do you think, Peter?

LEE: I think there are three things, you know, that are going on, and I’m not sure how they’re going to play out and how they’re going to be balanced. And I’m looking forward to talking to people in later episodes of this podcast, you know, people like Sébastien Bubeck or Bill Gates about this, because, you know, there is the pretraining phase, you know, when things are sort of compressed and baked into the base model.

There is the in-context learning, you know, so if you have extremely long or infinite context, you’re kind of learning as you go along. And there are other techniques that people are working on, you know, various sorts of dynamic reinforcement learning approaches, and so on. And then there is what maybe you would call structured RAG, where you do a pre-processing. You go through a big database, and you figure it all out. And make a very nicely structured database the AI can then consult with later.

And all three of these in different contexts today seem to show different capabilities. But they’re all pretty important in medicine.

[TRANSITION MUSIC]

Moving on to Episode 3, we talked to Dave DeBronkart, who is also known as “e-Patient Dave,” an advocate of patient empowerment, and then also Christina Farr, who has been doing a lot of venture investing for consumer health applications.

Let’s get right into this little snippet from something that e-Patient Dave said that talks about the sources of medical information, particularly relevant for when he was receiving treatment for stage 4 kidney cancer.

DAVE DEBRONKART: And I’m making a point here of illustrating that I am anything but medically trained, right. And yet I still, I want to understand as much as I can. I was months away from dead when I was diagnosed, but in the patient community, I learned that they had a whole bunch of information that didn’t exist in the medical literature. Now today we understand there’s publication delays; there’s all kinds of reasons. But there’s also a whole bunch of things, especially in an unusual condition, that will never rise to the level of deserving NIH [National Institute of Health] funding and research.

LEE: All right. So I have a question for you, Carey, and a question for you, Zak, about the whole conversation with e-Patient Dave, which I thought was really remarkable. You know, Carey, I think as we were preparing for this whole podcast series, you made a comment—I actually took it as a complaint—that not as much has happened as I had hoped or thought. People aren’t thinking boldly enough, you know, and I think, you know, I agree with you in the sense that I think we expected a lot more to be happening, particularly in the consumer space. I’m giving you a chance to vent about this.

GOLDBERG: [LAUGHTER] Thank you! Yes, that has been by far the most frustrating thing to me. I think that the potential for AI to improve everybody’s health is so enormous, and yet, you know, it needs some sort of support to be able to get to the point where it can do that. Like, remember in the book we wrote about Greg Moore talking about how half of the planet doesn’t have healthcare, but people overwhelmingly have cellphones. And so you could connect people who have no healthcare to the world’s medical knowledge, and that could certainly do some good.

And I have one great big problem with e-Patient Dave, which is that, God, he’s fabulous. He’s super smart. Like, he’s not a typical patient. He’s an off-the-charts, brilliant patient. And so it’s hard to … and so he’s a great sort of lead early-adopter-type person, and he can sort of show the way for others.

But what I had hoped for was that there would be more visible efforts to really help patients optimize their healthcare. Probably it’s happening a lot in quiet ways like that any discharge instructions can be instantly beautifully translated into a patient’s native language and so on. But it’s almost like there isn’t a mechanism to allow this sort of mass consumer adoption that I would hope for.

LEE: Yeah. But you have written some, like, you even wrote about that person who saved his dog (opens in new tab). So do you think … you know, and maybe a lot more of that is just happening quietly that we just never hear about?

GOLDBERG: I’m sure that there is a lot of it happening quietly. And actually, that’s another one of my complaints is that no one is gathering that stuff. It’s like you might happen to see something on social media. Actually, e-Patient Dave has a hashtag, PatientsUseAI, and a blog, as well. So he’s trying to do it. But I don’t know of any sort of overarching or academic efforts to, again, to surveil what’s the actual use in the population and see what are the pros and cons of what’s happening.

LEE: Mm-hmm. So, Zak, you know, the thing that I thought about, especially with that snippet from Dave, is your opening for Chapter 8 that you wrote, you know, about your first patient dying in your arms. I still think of how traumatic that must have been. Because, you know, in that opening, you just talked about all the little delays, all the little paper-cut delays, in the whole process of getting some new medical technology approved. But there’s another element that Dave kind of speaks to, which is just, you know, patients who are experiencing some issue are very, sometimes very motivated. And there’s just a lot of stuff on social media that happens.

KOHANE: So this is where I can both agree with Carey and also disagree. I think when people have an actual health problem, they are now routinely using it.

GOLDBERG: Yes, that’s true.

KOHANE: And that situation is happening more often because medicine is failing. This is something that did not come up enough in our book. And perhaps that’s because medicine is actually feeling a lot more rickety today than it did even two years ago.

We actually mentioned the problem. I think, Peter, you may have mentioned the problem with the lack of primary care. But now in Boston, our biggest healthcare system, all the practices for primary care are closed. I cannot get for my own faculty—residents at MGH [Massachusetts General Hospital] can’t get primary care doctor. And so …

LEE: Which is just crazy. I mean, these are amongst the most privileged people in medicine, and they can’t find a primary care physician. That’s incredible.

KOHANE: Yeah, and so therefore … and I wrote an article about this in the NEJM [New England Journal of Medicine] (opens in new tab) that medicine is in such dire trouble that we have incredible technology, incredible cures, but where the rubber hits the road, which is at primary care, we don’t have very much.

And so therefore, you see people who know that they have a six-month wait till they see the doctor, and all they can do is say, “I have this rash. Here’s a picture. What’s it likely to be? What can I do?” “I’m gaining weight. How do I do a ketogenic diet?” Or, “How do I know that this is the flu?”

This is happening all the time, where acutely patients have actually solved problems that doctors have not. Those are spectacular. But I’m saying more routinely because of the failure of medicine. And it’s not just in our fee-for-service United States. It’s in the UK; it’s in France. These are first-world, developed-world problems. And we don’t even have to go to lower- and middle-income countries for that.

LEE: Yeah.

GOLDBERG: But I think it’s important to note that, I mean, so you’re talking about how even the most elite people in medicine can’t get the care they need. But there’s also the point that we have so much concern about equity in recent years. And it’s likeliest that what we’re doing is exacerbating inequity because it’s only the more connected, you know, better off people who are using AI for their health.

KOHANE: Oh, yes. I know what various Harvard professors are doing. They’re paying for a concierge doctor. And that’s, you know, a $5,000- to $10,000-a-year-minimum investment. That’s inequity.

LEE: When we wrote our book, you know, the idea that GPT-4 wasn’t trained specifically for medicine, and that was amazing, but it might get even better and maybe would be necessary to do that. But one of the insights for me is that in the consumer space, the kinds of things that people ask about are different than what the board-certified clinician would ask.

KOHANE: Actually, that’s, I just recently coined the term. It’s the … maybe it’s … well, at least it’s new to me. It’s the technology or expert paradox. And that is the more expert and narrow your medical discipline, the more trivial it is to translate that into a specialized AI. So echocardiograms? We can now do beautiful echocardiograms. That’s really hard to do. I don’t know how to interpret an echocardiogram. But they can do it really, really well. Interpret an EEG [electroencephalogram]. Interpret a genomic sequence. But understanding the fullness of the human condition, that’s actually hard. And actually, that’s what primary care doctors do best. But the paradox is right now, what is easiest for AI is also the most highly paid in medicine. [LAUGHTER] Whereas what is the hardest for AI in medicine is the least regarded, least paid part of medicine.

GOLDBERG: So this brings us to the question I wanted to throw at both of you actually, which is we’ve had this spasm of incredibly prominent people predicting that in fact physicians would be pretty obsolete within the next few years. We had Bill Gates saying that; we had Elon Musk saying surgeons are going to be obsolete within a few years. And I think we had Demis Hassabis saying, “Yeah, we’ll probably cure most diseases within the next decade or so.” [LAUGHS]

So what do you think? And also, Zak, to what you were just saying, I mean, you’re talking about being able to solve very general overarching problems. But in fact, these general overarching models are actually able, I would think, are able to do that because they are broad. So what are we heading towards do you think? What should the next book be … The end of doctors? [LAUGHS]

KOHANE: So I do recall a conversation that … we were at a table with Bill Gates, and Bill Gates immediately went to this, which is advancing the cutting edge of science. And I have to say that I think it will accelerate discovery. But eliminating, let’s say, cancer? I think that’s going to be … that’s just super hard. The reason it’s super hard is we don’t have the data or even the beginnings of the understanding of all the ways this devilish disease managed to evolve around our solutions.

And so that seems extremely hard. I think we’ll make some progress accelerated by AI, but solving it in a way Hassabis says, God bless him. I hope he’s right. I’d love to have to eat crow in 10 or 20 years, but I don’t think so. I do believe that a surgeon working on one of those Davinci machines, that stuff can be, I think, automated.

And so I think that’s one example of one of the paradoxes I described. And it won’t be that we’re replacing doctors. I just think we’re running out of doctors. I think it’s really the case that, as we said in the book, we’re getting a huge deficit in primary care doctors.

But even the subspecialties, my subspecialty, pediatric endocrinology, we’re only filling half of the available training slots every year. And why? Because it’s a lot of work, a lot of training, and frankly doesn’t make as much money as some of the other professions.

LEE: Yeah. Yeah, I tend to think that, you know, there are going to be always a need for human doctors, not for their skills. In fact, I think their skills increasingly will be replaced by machines. And in fact, I’ve talked about a flip. In fact, patients will demand, Oh my god, you mean you’re going to try to do that yourself instead of having the computer do it? There’s going to be that sort of flip. But I do think that when it comes to people’s health, people want the comfort of an authority figure that they trust. And so what is more of a question for me is whether we will ever view a machine as an authority figure that we can trust.

And before I move on to Episode 4, which is on norms, regulations and ethics, I’d like to hear from Chrissy Farr on one more point on consumer health, specifically as it relates to pregnancy:

CHRISTINA FARR: For a lot of women, it’s their first experience with the hospital. And, you know, I think it’s a really big opportunity for these systems to get a whole family on board and keep them kind of loyal. And a lot of that can come through, you know, just delivering an incredible service. Unfortunately, I don’t think that we are delivering incredible services today to women in this country. I see so much room for improvement.

LEE: In the consumer space, I don’t think we really had a focus on those periods in a person’s life when they have a lot of engagement, like pregnancy, or I think another one is menopause, cancer. You know, there are points where there is, like, very intense engagement. And we heard that from e-Patient Dave, you know, with his cancer and Chrissy with her pregnancy. Was that a miss in our book? What do think, Carey?

GOLDBERG: I mean, I don’t think so. I think it’s true that there are many points in life when people are highly engaged. To me, the problem thus far is just that I haven’t seen consumer-facing companies offering beautiful AI-based products. I think there’s no question at all that the market is there if you have the products to offer.

LEE: So, what do you think this means, Zak, for, you know, like Boston Children’s or Mass General Brigham—you know, the big places?

KOHANE: So again, all these large healthcare systems are in tough shape. MGB [Mass General Brigham] would be fully in the red if not for the fact that its investments, of all things, have actually produced. If you look at the large healthcare systems around the country, they are in the red. And there’s multiple reasons why they’re in the red, but among them is cost of labor.

And so we’ve created what used to be a very successful beast, the health center. But it’s developed a very expensive model and a highly regulated model. And so when you have high revenue, tiny margins, your ability to disrupt yourself, to innovate, is very, very low because you will have to talk to the board next year if you went from 2% positive margin to 1% negative margin.

LEE: Yeah.

KOHANE: And so I think we’re all waiting for one of the two things to happen, either a new kind of healthcare delivery system being generated or ultimately one of these systems learns how to disrupt itself.

LEE: Yeah. All right. I think we have to move on to Episode 4. And, you know, when it came to the question of regulation, I think this is … my read is when we were writing our book, this is the part that we struggled with the most.

GOLDBERG: We punted. [LAUGHS] We totally punted to the AI.

LEE: We had three amazing guests. One was Laura Adams from National Academy of Medicine. Let’s play a snippet from her.

LAURA ADAMS: I think one of the most provocative and exciting articles that I saw written recently was by Bakul Patel and David Blumenthal, who posited, should we be regulating generative AI as we do a licensed and qualified provider? Should it be treated in the sense that it’s got to have a certain amount of training and a foundation that’s got to pass certain tests? Does it have to report its performance? And I’m thinking, what a provocative idea, but it’s worth considering.

LEE: All right, so I very well remember that we had discussed this kind of idea when we were writing our book. And I think before we finished our book, I personally rejected the idea. But now two years later, what do the two of you think? I’m dying to hear.

GOLDBERG: Well, wait, why … what do you think? Like, are you sorry that you rejected it?

LEE: I’m still skeptical because when we are licensing human beings as doctors, you know, we’re making a lot of implicit assumptions that we don’t test as part of their licensure, you know, that first of all, they are [a] human being and they care about life, and that, you know, they have a certain amount of common sense and shared understanding of the world.

And there’s all sorts of sort of implicit assumptions that we have about each other as human beings living in a society together. That you know how to study, you know, because I know you just went through three years of medical or four years of medical school and all sorts of things. And so the standard ways that we license human beings, they don’t need to test all of that stuff. But somehow intuitively, all of that seems really important.

I don’t know. Am I wrong about that?

KOHANE: So it’s compared with what issue? Because we know for a fact that doctors who do a lot of a procedure, like do this procedure, like high-risk deliveries all the time, have better outcomes than ones who only do a few high risk. We talk about it, but we don’t actually make it explicit to patients or regulate that you have to have this minimal amount. And it strikes me that in some sense, and, oh, very importantly, these things called human beings learn on the job. And although I used to be very resentful of it as a resident, when someone would say, I don’t want the resident, I want the …

GOLDBERG: … the attending. [LAUGHTER]

KOHANE: … they had a point. And so the truth is, maybe I was a wonderful resident, but some people were not so great. [LAUGHTER] And so it might be the best outcome if we actually, just like for human beings, we say, yeah, OK, it’s this good, but don’t let it work autonomously, or it’s done a thousand of them, just let it go. We just don’t have practically speaking, we don’t have the environment, the lab, to test them. Now, maybe if they get embodied in robots and literally go around with us, then it’s going to be [in some sense] a lot easier. I don’t know.

LEE: Yeah.

GOLDBERG: Yeah, I think I would take a step back and say, first of all, we weren’t the only ones who were stumped by regulating AI. Like, nobody has done it yet in the United States to this day, right. Like, we do not have standing regulation of AI in medicine at all in fact. And that raises the issue of … the story that you hear often in the biotech business, which is, you know, more prominent here in Boston than anywhere else, is that thank goodness Cambridge put out, the city of Cambridge, put out some regulations about biotech and how you could dump your lab waste and so on. And that enabled the enormous growth of biotech here.

If you don’t have the regulations, then you can’t have the growth of AI in medicine that is worthy of having. And so, I just … we’re not the ones who should do it, but I just wish somebody would.

LEE: Yeah.

GOLDBERG: Zak.

KOHANE: Yeah, but I want to say this as always, execution is everything, even in regulation.

And so I’m mindful that a conference that both of you attended, the RAISE conference [Responsible AI for Social and Ethical Healthcare] (opens in new tab). The Europeans in that conference came to me personally and thanked me for organizing this conference about safe and effective use of AI because they said back home in Europe, all that we’re talking about is risk, not opportunities to improve care.

And so there is a version of regulation which just locks down the present and does not allow the future that we’re talking about to happen. And so, Carey, I absolutely hear you that we need to have a regulation that takes away some of the uncertainty around liability, around the freedom to operate that would allow things to progress. But we wrote in our book that premature regulation might actually focus on the wrong thing. And so since I’m an optimist, it may be the fact that we don’t have much of a regulatory infrastructure today, that it allows … it’s a unique opportunity—I’ve said this now to several leaders—for the healthcare systems to say, this is the regulation we need.

GOLDBERG: It’s true.

KOHANE: And previously it was top-down. It was coming from the administration, and those executive orders are now history. But there is an opportunity, which may or may not be attained, there is an opportunity for the healthcare leadership—for experts in surgery—to say, “This is what we should expect.”

LEE: Yeah.

KOHANE: I would love for this to happen. I haven’t seen evidence that it’s happening yet.

GOLDBERG: No, no. And there’s this other huge issue, which is that it’s changing so fast. It’s moving so fast. That something that makes sense today won’t in six months. So, what do you do about that?

LEE: Yeah, yeah, that is something I feel proud of because when I went back and looked at our chapter on this, you know, we did make that point, which I think has turned out to be true.

But getting back to this conversation, there’s something, a snippet of something, that Vardit Ravitsky said that I think touches on this topic.

VARDIT RAVITSKY: So my pushback is, are we seeing AI exceptionalism in the sense that if it’s AI, huh, panic! We have to inform everybody about everything, and we have to give them choices, and they have to be able to reject that tool and the other tool versus, you know, the rate of human error in medicine is awful. So why are we so focused on informed consent and empowerment regarding implementation of AI and less in other contexts?

GOLDBERG: Totally agree. Who cares about informed consent about AI. Don’t want it. Don’t need it. Nope.

LEE: Wow. Yeah. You know, and this … Vardit of course is one of the leading bioethicists, you know, and of course prior to AI, she was really focused on genetics. But now it’s all about AI.

And, Zak, you know, you and other doctors have always told me, you know, the truth of the matter is, you know, what do you call the bottom-of-the-class graduate of a medical school?

And the answer is “doctor.”

KOHANE: “Doctor.” Yeah. Yeah, I think that again, this gets to compared with what? We have to compare AI not to the medicine we imagine we have, or we would like to have, but to the medicine we have today. And if we’re trying to remove inequity, if we’re trying to improve our health, that’s what … those are the right metrics. And so that can be done so long as we avoid catastrophic consequences of AI.

So what would the catastrophic consequence of AI be? It would be a systematic behavior that we were unaware of that was causing poor healthcare. So, for example, you know, changing the dose on a medication, making it 20% higher than normal so that the rate of complications of that medication went from 1% to 5%. And so we do need some sort of monitoring.

We haven’t put out the paper yet, but in computer science, there’s, well, in programming, we know very well the value for understanding how our computer systems work.

And there was a guy by name of Allman, I think he’s still at a company called Sendmail, who created something called syslog. And syslog is basically a log of all the crap that’s happening in our operating system. And so I’ve been arguing now for the creation of MedLog. And MedLog … in other words, what we cannot measure, we cannot regulate, actually.

LEE: Yes.

KOHANE: And so what we need to have is MedLog, which says, “Here’s the context in which a decision was made. Here’s the version of the AI, you know, the exact version of the AI. Here was the data.” And we just have MedLog. And I think MedLog is actually incredibly important for being able to measure, to just do what we do in … it’s basically the black box for, you know, when there’s a crash. You know, we’d like to think we could do better than crash. We can say, “Oh, we’re seeing from MedLog that this practice is turning a little weird.” But worst case, patient dies, [we] can see in MedLog, what was the information this thing knew about it? And did it make the right decision? We can actually go for transparency, which like in aviation, is much greater than in most human endeavors.

GOLDBERG: Sounds great.

LEE: Yeah, it’s sort of like a black box. I was thinking of the aviation black box kind of idea. You know, you bring up medication errors, and I have one more snippet. This is from our guest Roxana Daneshjou from Stanford.

ROXANA DANESHJOU: There was a mistake in her after-visit summary about how much Tylenol she could take. But I, as a physician, knew that this dose was a mistake. I actually asked ChatGPT. I gave it the whole after-visit summary, and I said, are there any mistakes here? And it clued in that the dose of the medication was wrong.

LEE: Yeah, so this is something we did write about in the book. We made a prediction that AI might be a second set of eyes, I think is the way we put it, catching things. And we actually had examples specifically in medication dose errors. I think for me, I expected to see a lot more of that than we are.

KOHANE: Yeah, it goes back to our conversation about Epic or competitor Epic doing that. I think we’re going to see that having oversight over all medical orders, all orders in the system, critique, real-time critique, where we’re both aware of alert fatigue. So we don’t want to have too many false positives. At the same time, knowing what are critical errors which could immediately affect lives. I think that is going to become in terms of—and driven by quality measures—a product.

GOLDBERG: And I think word will spread among the general public that kind of the same way in a lot of countries when someone’s in a hospital, the first thing people ask relatives are, well, who’s with them? Right?

LEE: Yeah. Yup.

GOLDBERG: You wouldn’t leave someone in hospital without relatives. Well, you wouldn’t maybe leave your medical …

KOHANE: By the way, that country is called the United States.

GOLDBERG: Yes, that’s true. [LAUGHS] It is true here now, too. But similarly, I would tell any loved one that they would be well advised to keep using AI to check on their medical care, right. Why not?

LEE: Yeah. Yeah. Last topic, just for this Episode 4. Roxana, of course, I think really made a name for herself in the AI era writing, actually just prior to ChatGPT, you know, writing some famous papers about how computer vision systems for dermatology were biased against dark-skinned people. And we did talk some about bias in these AI systems, but I feel like we underplayed it, or we didn’t understand the magnitude of the potential issues. What are your thoughts?

KOHANE: OK, I want to push back, because I’ve been asked this question several times. And so I have two comments. One is, over 100,000 doctors practicing medicine, I know they have biases. Some of them actually may be all in the same direction, and not good. But I have no way of actually measuring that. With AI, I know exactly how to measure that at scale and affordably. Number one. Number two, same 100,000 doctors. Let’s say I do know what their biases are. How hard is it for me to change that bias? It’s impossible …

LEE: Yeah, yeah.

KOHANE: … practically speaking. Can I change the bias in the AI? Somewhat. Maybe some completely.

I think that we’re in a much better situation.

GOLDBERG: Agree.

LEE: I think Roxana made also the super interesting point that there’s bias in the whole system, not just in individuals, but, you know, there’s structural bias, so to speak.

KOHANE: There is.

LEE: Yeah. Hmm. There was a super interesting paper that Roxana wrote not too long ago—her and her collaborators—showing AI’s ability to detect, to spot bias decision-making by others. Are we going to see more of that?

KOHANE: Oh, yeah, I was very pleased when, in NEJM AI [New England Journal of Medicine Artificial Intelligence], we published a piece with Marzyeh Ghassemi (opens in new tab), and what they were talking about was actually—and these are researchers who had published extensively on bias and threats from AI. And they actually, in this article, did the flip side, which is how much better AI can do than human beings in this respect.

And so I think that as some of these computer scientists enter the world of medicine, they’re becoming more and more aware of human foibles and can see how these systems, which if they only looked at the pretrained state, would have biases. But now, where we know how to fine-tune the de-bias in a variety of ways, they can do a lot better and, in fact, I think are much more … a much greater reason for optimism that we can change some of these noxious biases than in the pre-AI era.

GOLDBERG: And thinking about Roxana’s dermatological work on how I think there wasn’t sufficient work on skin tone as related to various growths, you know, I think that one thing that we totally missed in the book was the dawn of multimodal uses, right.

LEE: Yeah. Yeah, yeah.

GOLDBERG: That’s been truly amazing that in fact all of these visual and other sorts of data can be entered into the models and move them forward.

LEE: Yeah. Well, maybe on these slightly more optimistic notes, we’re at time. You know, I think ultimately, I feel pretty good still about what we did in our book, although there were a lot of misses. [LAUGHS] I don’t think any of us could really have predicted really the extent of change in the world.

[TRANSITION MUSIC]

So, Carey, Zak, just so much fun to do some reminiscing but also some reflection about what we did.

[THEME MUSIC]

And to our listeners, as always, thank you for joining us. We have some really great guests lined up for the rest of the series, and they’ll help us explore a variety of relevant topics—from AI drug discovery to what medical students are seeing and doing with AI and more.

We hope you’ll continue to tune in. And if you want to catch up on any episodes you might have missed, you can find them at aka.ms/AIrevolutionPodcast (opens in new tab) or wherever you listen to your favorite podcasts.  

Until next time. 

[MUSIC FADES]

AI Revolution in Medicine podcast series

The post Coauthor roundtable: Reflecting on real world of doctors, developers, patients, and policymakers appeared first on Microsoft Research.

Predicting and explaining AI model performance: A new approach to evaluation

May 12, 2025

by Lexin Zhou, Xing Xie Microsoft AI

The image shows a radar chart comparing the performance of different AI models across various metrics. The chart has a circular grid with labeled axes including VO, AS, CEc, CEe, CL, MCr, MCt, MCu, MS, QLI, QLqA, SNs, KNa, KNc, KNF, KNn, and AT. Different AI models are represented by various line styles: Babbage-002 (dotted line), Davinci-002 (dash-dotted line), GPT-3.5-Turbo (dashed line), GPT-4.0 (solid thin line), OpenAI ol-mini (solid thick line), and OpenAI o1 (solid bold line). There is a legend in the bottom left corner explaining the line styles for each model. The background transitions from blue on the left to green on the right.

With support from the Accelerating Foundation Models Research (AFMR) grant program, a team of researchers from Microsoft and collaborating institutions has developed an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do.

In the paper, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power,” they introduce a methodology that goes beyond measuring overall accuracy. It assesses the knowledge and cognitive abilities a task requires and evaluates them against the model’s capabilities.

ADeLe: An ability-based approach to task evaluation

The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities. This difficulty rating is based on a detailed rubric, originally developed for human tasks and shown to work reliably when applied by AI models.

By comparing what a task requires with what a model can do, ADeLe generates an ability profile that not only predicts performance but also explains why a model is likely to succeed or fail—linking outcomes to specific strengths or limitations.

The 18 scales reflect core cognitive abilities (e.g., attention, reasoning), knowledge areas (e.g., natural or social sciences), and other task-related factors (e.g., prevalence of the task on the internet). Each task is rated from 0 to 5 based on how much it draws on a given ability. For example, a simple math question might score 1 on formal knowledge, while one requiring advanced expertise could score 5. Figure 1 illustrates how the full process works—from rating task requirements to generating ability profiles.

Figure 1. Top: For each AI model, (1) run the new system on the ADeLe benchmark, and (2) extract its ability profile. Bottom: For each new task or benchmark, (A) apply 18 rubrics and (B) get demand histograms and profiles that explain what abilities the tasks require. Optionally, predict performance on the new tasks for any system based on the demand and ability profiles, or past performance data, of the systems.

To develop this system, the team analyzed 16,000 examples spanning 63 tasks drawn from 20 AI benchmarks, creating a unified measurement approach that works across a wide range of tasks. The paper details how ratings across 18 general scales explain model success or failure and predict performance on new tasks in both familiar and unfamiliar settings.

Evaluation results

Using ADeLe, the team evaluated 20 popular AI benchmarks and uncovered three key findings: 1) Current AI benchmarks have measurement limitations; 2) AI models show distinct patterns of strengths and weaknesses across different capabilities; and 3) ADeLe provides accurate predictions of whether AI systems will succeed or fail on a new task.

1. Revealing hidden flaws in AI testing methods

Many popular AI tests either don’t measure what they claim or only cover a limited range of difficulty levels. For example, the Civil Service Examination benchmark is meant to test logical reasoning, but it also requires other abilities, like specialized knowledge and metacognition. Similarly, TimeQA, designed to test temporal reasoning, only includes medium-difficulty questions—missing both simple and complex challenges.

2. Creating detailed AI ability profiles

Using the 0–5 rating for each ability, the team created comprehensive ability profiles of 15 LLMs. For each of the 18 abilities measured, they plotted “subject characteristic curves” to show how a model’s success rate changes with task difficulty.

They then calculated a score for each ability—the difficulty level at which a model has a 50% chance of success—and used these results to generate radial plots showing each model’s strengths and weaknesses across the different scales and levels, illustrated in Figure 2.

Figure 2: The image consists of three radar charts showing ability profiles of 15 LLMs evaluated across 18 ability scales, ranged from 0 to infinity (the higher, the more capable the model is). Each chart has multiple axes labeled with various ability scales such as VO, AS, CEc, AT, CL, MCr, etc. The left chart shows ability for Babbage-002 (light red), Davinci-002 (orange), GPT-3.5-Turbo (red), GPT-4 (dark red), OpenAI o1-mini (gray), and OpenAI o1 (dark gray). The middle chart shows ability for LLaMA models: LLaMA-3.2-1B-Instruct (light blue), LLaMA-3.2-3B-Instruct (blue), LLaMA-3.2-11B-Instruct (dark blue), LLaMA-3.2-90B-Instruct (navy blue), and LLaMA-3.1-405B Instruct (very dark blue). The right chart shows ability for DeepSeek-R1-Dist-Qwen models: DeepSeek-R1-Dist-Qwen-1.5B (light green), DeepSeek-R1-Dist-Qwen-7B (green), DeepSeek-R1-Dist-Qwen-14B (dark green), DK-R1-Dist-Qwen-32B (very dark green). Each model's ability is represented by a colored polygon within the radar charts. — Figure 2. Ability profiles for the 15 LLMs evaluated.

This analysis revealed the following:

When measured against human performance, AI systems show different strengths and weaknesses across the 18 ability scales.

Newer LLMs generally outperform older ones, though not consistently across all abilities.

Knowledge-related performance depends heavily on model size and training methods.

Reasoning models show clear gains over non-reasoning models in logical thinking, learning and abstraction, and social capabilities, such as inferring the mental states of their users.

Increasing the size of general-purpose models after a given threshold only leads to small performance gains.

3. Predicting AI success and failure

In addition to evaluation, the team created a practical prediction system based on demand-level measurements that forecasts whether a model will succeed on specific tasks, even unfamiliar ones.

The system achieved approximately 88% accuracy in predicting the performance of popular models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to anticipate potential failures before deployment, adding the important step of reliability assessment for AI models.

Looking ahead

ADeLe can be extended to multimodal and embodied AI systems, and it has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.

This technology marks a major step toward a science of AI evaluation, one that offers both clear explanations of system behavior and reliable predictions about performance. It aligns with the vision laid out in a previous Microsoft position paper on the promise of applying psychometrics to AI evaluation and a recent Societal AI white paper emphasizing the importance of AI evaluation.

As general-purpose AI advances faster than traditional evaluation methods, this work lays a timely foundation for making AI assessments more rigorous, transparent, and ready for real-world deployment. The research team is working toward building a collaborative community to strengthen and expand this emerging field.

The post Predicting and explaining AI model performance: A new approach to evaluation appeared first on Microsoft Research.

Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv

May 8, 2025

by Gretchen Huizinga, Hongxia Hao, Bing Lv Microsoft AI

Illustrated headshots of Hongxia Hao (left) and Bing Lv (right).

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts bring its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, senior researcher Hongxia Hao (opens in new tab), and physics professor Bing Lv (opens in new tab), join host Gretchen Huizinga to talk about how they are using deep learning techniques to probe the upper limits of heat transfer in inorganic crystals, discover novel materials with exceptional thermal conductivity, and rewrite the rulebook for designing high-efficiency electronics and sustainable energy.

Read the paper

Learn more:

MatterSim: A deep-learning model for materials under real-world conditions
Microsoft Research Blog, May 2024
Quantum Materials Research, University of Texas at Dallas (opens in new tab)
Research group page

Transcript

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers.

Today I’m talking to two researchers, Hongxia Hao, a senior researcher at Microsoft Research AI for Science, and Bing Lv, an associate professor in physics at the University of Texas at Dallas. Hongxia and Bing are co-authors of a paper called Probing the Limit of Heat Transfer in Inorganic Crystals with Deep Learning. I’m excited to learn more about this! Hongxia and Bing, it’s great to have you both on Abstracts!

HONGXIA HAO: Nice to be here.

BING LV: Nice to be here, too.

HUIZINGA: So Hongxia, let’s start with you and a brief overview of this paper. In just a few sentences. Tell us about the problem your research addresses and more importantly, why we should care about it.

HAO: Let me start with a very simple yet profound question. What’s the fastest the heat can travel through a solid material? This is not just an academic curiosity, but it’s a question that touched the bottom of how we build technologies around us. So from the moment when you tap your smartphone, and the moment where the laptop is turned on and functioning, heat is always flowing. So we’re trying to answer the question of a century-old mystery of the upper limit of heat transfer in solids. So we care about this not just because it’s a fundamental problem in physics and material science, but because solving it could really rewrite the rulebook for designing high-efficiency electronics and sustainable energy, etc. And nowadays, with very cutting-edge nanometer chips or very fancy technologies, we are packing more computing power into smaller space, but the faster and denser we build, the harder it becomes to remove the heat. So in many ways, thermal bottlenecks, not just transistor density, are now the ceiling of the Moore’s Law. And also the stakes are very enormous. We really wish to bring more thermal solutions by finding more high thermal conductor choices from the perspective of materials discovery with the help of AI.

LV: So I think one of the biggest things as Hongxia said, right? Thermal solutions will become, eventually become, a bottleneck for all type of heterogeneous integration of the materials. So from this perspective, so how people actually have been finding out previously, all the thermal was the last solution to solve. But now people actually more and more realize all these things have to be upfront. This co-design, all these things become very important. So I think what we are doing right now, integrated with AI, helping to identify the large space of the materials, identify fundamentally what will be the limit of this material, will become very important for the society.

HUIZINGA: Hmm. Yeah. Hongxia, did you have anything to add to that?

HAO: Yes, so previously many people are working on exploring these material science questions through experimental tradition and the past few decades people see a new trend using computational materials discovery. Like for example, we do the fundamental solving of the Schrödinger equation using Density Functional Theory [DFT]. Actually, this brings us a lot of opportunities. The question here is, as the theory is getting more and more developed, it’s too expensive for us to make it very large scale and to study tons of materials. Think about this. The bottleneck here, now, is not just about having a very good theory, it’s about the scale. So, this is where AI, specifically now we are using deep learning, comes into play.

HUIZINGA: Well, Hongxia, let’s stay with you for a minute and talk about methodology. How did you do this research and what was the methodology you employed?

HAO: So here we, for this question, we built a pipeline that spans the AI, the quantum mechanics, and computational brute-force with a blend of efficiency and accuracy. It begins with generating an enormous chemical and structure design space because this is inspired by Slack’s principle. We focus first on simple crystals, and there are the systems most likely to have low and harmonious state, fewer phononic scattering events, and therefore potentially have high thermal conductivities. But we didn’t stop here. We also included a huge pool of more complex and higher energy structures to ensure diversity and avoid bias. And for each candidate, we first run like a structure relaxation using MatterSim, which is a deep learning foundational model for material science for us to characterize the properties of materials. And we use that screen for dynamic stability. And now it’s about 200K structures past this filter. And then came another real challenge: calculating the thermal conductivity. We try to solve this problem using the Boltzmann transport equation and the three-phonon scattering process. The twist here is all of this was not done by traditional DFT solvers, but with our deep learning model, the MatterSim. It’s trained to predict energy, force, and stress. And we can get second- and third-order interatomic force constants directly from here, which can guarantee the accuracy of the solution. And finally, to validate the model’s predictions, we performed full DFT-based calculations on the top candidates that we found, some of which even include higher-order scattering mechanism, electron phonon coupling effect, etc. And this rigorous validation gave us confidence in the speed and accuracy trade-offs and revealed a spectrum of materials that had either previously been overlooked or were never before conceived.

HUIZINGA: So Bing, let’s talk about your research findings. How did things work out for you on this project and what did you find?

LV: I think one of the biggest things for this paper is it creates a very large material base. Basically, you can say it’s a smart database which eventually will be made accessible to the public. I think that’s a big achievement because people who actually if they have to look into it, they actually can go search Microsoft database, finding out, oh, this material does have this type of thermal properties. This is actually, this database can send about 230,000 materials. And one of the things we confirm is the highest thermal conductivity material based on all the wisdom of Slack criteria, predicted diamond would have the highest thermal conductivity. We more or less really very solidly prove diamond, at this stage, will remain with the highest thermal conductivity. We have a lot of new materials, exotic materials, which some of them, Hongxia can elaborate a little bit more. So, which having all this very exotic combination of properties, thermal with other properties, which could actually provide a new insight for new physics development, new material development, and a new device perspective. All of this combined will have actually a very profound impact to society.

HUIZINGA: Yeah, Hongxia, go a little deeper on that because that was an interesting part of the paper when you talked about diamond still being the sort of “gold standard,” to mix metaphors! But you’ve also found some other materials that are remarkable compared to silicon.

HAO: Yeah, yeah. Among this search space, even though we didn’t find like something that’s higher than diamonds, but we do discover more than like twenty new materials with thermal conductivity exceeding that of silicon. And silicon is something like a benchmark for criteria that we think we want to compare with because it’s a backbone of modern electronics. More interestingly, I think, is the manganese vanadium. It shows some very interesting and surprising phenomena. Like it’s a metallic compound, but with very high lattice thermal connectivity. And this is the first time discovered by, like, through our search pattern, and it’s something that cannot be easily discovered without the hope with AI. And right now, think Bing can explain more on this, and show some interesting results.

HUIZINGA: Yeah, go ahead Bing.

LV: So this is actually very surprising to me as an experimentalist because of when Hongxia presented their theory work to me, this material, magnesium vanadium, it’s discovered back in 1938, almost 100 years ago, but there’s no more than twenty papers talking about this! A lot of them was on theory, okay, not even on experimental part. We actually did quite a bit of work on this. We actually are in the process; will characterize this and then moving forward even for the thermal conductivity measurements. So that will be hopefully, will be adding to the value of these things, showing you, Hey, AI does help to predict the materials could really generate the new materials with very good high thermal conductivity.

HUIZINGA: Yeah, so Bing, stay with you for a minute. I want you to talk about some kind of real-world applications of this. I know you alluded to a couple of things, but how is this work significant in that respect, and who might be most excited about it, aside from the two of you? [LAUGHS]

LV: So I think as I mentioned before, the first thing is this database. I believe that’s the first ever large material database regarding to the thermal conductivity. And it has, as I said, 230,000 materials with AI-predicted thermal connectivity. This will provide not only science but engineering with a vastly expanding catalog of candidate materials for the future roadmap of integration, material integration, and all these bottlenecks we are talking about, the thermal solution for the semiconductors or for even beyond the semiconductor integration, people actually can have a database to looking for. So these things, it will become very important, and I believe over a long time it will generate a very long impact for the research community, for the society development.

HUIZINGA: Yeah. Hongxia, did you have anything to add to that one too?

HAO: Yeah, so this study reshapes how we think about limits. I like the sentence that the only way to discover the limits of possible is to go beyond them into the impossible. In this case, we tried, but we didn’t break the diamond limit. But we proved it even more rigorously than ever before. In doing so, we also uncovered some uncharted peaks in the thermal conductivity landscape. This would not happen without new AI capabilities for material science. I think in the long run, I believe researchers could benefit from using this AI design and shift their way on how to do materials research with AI.

HUIZINGA: Yeah, it’ll be interesting to see if anyone ever does break the diamond limit with the new tools that are available, but…

HAO: Yeah!

HUIZINGA: So this is the part of the abstracts podcast where I like to ask for sort of a golden nugget, a one sentence takeaway that listeners might get from this paper. If you had one Hongxia, what would it be? And then I’ll ask Bing to maybe give his.

HAO: Yes. AI is no longer just a tool. It’s becoming a critical partner for us in scientific discovery. So our work proved that the large-scale data-driven science can now approach long-standing and fundamental questions with very fresh eyes. When trained well, and guided with physical intuition, models like MatterSim can really realize a full in-silico characterization for materials and don’t just simulate some known materials, but really trying to imagine what nature hasn’t yet revealed. Our work points to a path forward, not just incrementally better materials, but entirely new class of high-performance compounds where we could never have guessed without AI.

HUIZINGA: Yeah. Bing, what’s your one takeaway?

LV: I think I want to add a few things on top of Hongxia’s comments because I think Hongxia has very good critical words I would like to emphasize. When we train the AI well, if we guide the AI well, it could be very useful to become our partner. So I think all in all, our human being’s intellectual merit here is still going to play a significantly important role, okay? We are generating this AI, we should really train the AI, we should be using our human being intellectual merit to guide them to be useful for our human being society advancement. Now with all these AI tools, I think it’s a very golden time right now. Experimentalists could work very closely with like Hongxia, who’s a good theorist who has very good intellectual merits, and then we actually now incorporate with AI, then combine all pieces together, hopefully we’re really able to accelerating material discovery in a much faster pace than ever which the whole society will eventually get a benefit from it.

HUIZINGA: Yeah. Well, as we close, Bing, I want you to go a little further and talk about what’s next then, research wise. What are the open questions or outstanding challenges that remain in this field and what’s on your research agenda to address them?

LV: So first of all, I think this paper is addressing primarily on these crystalline ordered inorganic bulk materials. And also with the condition we are targeting at ambient pressure, room temperature, because that’s normally how the instrument is working, right? But what if under extreme conditions? We want to go to space, right? There we’ll have extreme conditions, some very… sometimes very cold, sometimes very hot. We have some places with extremely probably quite high pressure. Or we have some conditions that are highly radioactive. So under that condition, there’s going to be a new database could be emerged. Can we do something beyond that? Another good important thing is we are targeting this paper on high thermal conductivity. What about extremely low thermal conductivity? Those will actually bring a very good challenge for theorists and also the machine learning approach. I think that’s something Hongxia probably is very excited to work on in that direction. I know since she’s ambitious, she wants to do something more than beyond what we actually achieved so far.

HUIZINGA: Yeah, so Hongxia, how would you encapsulate what your dream research is next?

HAO: Yeah, so I think besides all of these exciting research directions, on my end, another direction is perhaps kind of exciting is we want to move from search to design. So right now we are kind of good at asking like what exists by just doing a forward prediction and brute force. But with generative AI, we can start asking what should exist? In the future, we can have an incorporation between forward prediction and backwards generative design to really tackle questions. If you have materials like you want to have desired like properties, how would you design the problems?

HUIZINGA: Well, it sounds like there’s a full plate of research agenda goodness going forward in this field, both with human brains and AI. So, Hongxia Hao and Bing Lv, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read a pre-print of it on arXiv. See you next time on Abstracts!

The post Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv appeared first on Microsoft Research.

Research Focus: Week of May 7, 2025

May 7, 2025

by Microsoft Research Team Microsoft AI

In this issue:

New research on compound AI systems and causal verification of the Confidential Consortium Framework; release of Phi-4-reasoning; enriching tabular data with semantic structure, and more.

NEW RESEARCH

Towards Resource-Efficient Compound AI Systems

Unlike the current state-of-the-art, our vision is fungible workflows with high-level descriptions, managed jointly by the Workflow Orchestrator and Cluster Manager. This allows higher resource multiplexing between independent workflows to improve efficiency.

This research introduces Murakkab, a prototype system built on a declarative workflow that reimagines how compound AI systems are built and managed to significantly improve resource efficiency. Compound AI systems integrate multiple interacting components like language models, retrieval engines, and external tools. They are essential for addressing complex AI tasks. However, current implementations could benefit from greater efficiencies in resource utilization, with improvements to tight coupling between application logic and execution details, better connections between orchestration and resource management layers, and bridging gaps between efficiency and quality.

Murakkab addresses critical inefficiencies in current AI architectures and offers a new approach that unifies workflow orchestration and cluster resource management for better performance and sustainability. In preliminary evaluations, it demonstrates speedups up to ∼ 3.4× in workflow completion times while delivering ∼ 4.5× higher energy efficiency, showing promise in optimizing resources and advancing AI system design.

NEW RESEARCH

Smart Casual Verification of the Confidential Consortium Framework

Diagram showing the components of the verification architecture for CCF's consensus. The diagram is discussed in detail in the paper.

This work presents a new, pragmatic verification technique that improves the trustworthiness of distributed systems like the Confidential Consortium Framework (CCF) and proves its effectiveness by catching critical bugs before deployment. Smart casual verification is a novel hybrid verification approach to validating CCF, an open-source platform for developing trustworthy and reliable cloud applications which underpins Microsoft’s Azure Confidential Ledger service.

The researchers apply smart casual verification to validate the correctness of CCF’s novel distributed protocols, focusing on its unique distributed consensus protocol and its custom client consistency model. This hybrid approach combines the rigor of formal specification and model checking with the pragmatism of automated testing, specifically binding the formal specification in TLA+ to the C++ implementation. While traditional formal methods are often one-off efforts by domain experts, the researchers have integrated smart casual verification into CCF’s continuous integration pipeline, allowing contributors to continuously validate CCF as it evolves.

NEW RESEARCH

Phi-4-reasoning Technical Report

graphical user interface, text, application, email

This report introduces Phi-4-reasoning (opens in new tab), a 14-billion parameter model optimized for complex reasoning tasks. It is trained via supervised fine-tuning of Phi-4 using a carefully curated dataset of high-quality prompts and reasoning demonstrations generated by o3-mini. These prompts span diverse domains—including math, science, coding, and spatial reasoning—and are selected to challenge the base model near its capability boundaries.

Building on recent findings that reinforcement learning (RL) can further improve smaller models, the team developed Phi-4-reasoning-plus, which incorporates an additional outcome-based RL phase using verifiable math problems. This enhances the model’s ability to generate longer, more effective reasoning chains.

Despite its smaller size, the Phi-4-reasoning family outperforms significantly larger open-weight models such as DeepSeekR1-Distill-Llama-70B and approaches the performance of full-scale frontier models like DeepSeek R1. It excels in tasks requiring multi-step problem solving, logical inference, and goal-directed planning.

The work highlights the combined value of supervised fine-tuning and reinforcement learning for building efficient, high-performing reasoning models. It also offers insights into training data design, methodology, and evaluation strategies. Phi-4-reasoning contributes to the growing class of reasoning-specialized language models and points toward more accessible, scalable AI for science, education, and technical domains.

NEW RESEARCH

TeCoFeS: Text Column Featurization using Semantic Analysis

The workflow diagram illustrates the various steps in the TECOFES approach. Step 0 is the embedding computation module, which calculates embeddings for all rows of text, setting the foundation for subsequent steps. Step 2, the smart sampler, captures diverse samples and feeds them into the labeling module (step 3), which generates labels. These labels are then utilized by the Extend Mapping module (step 4) to map the remaining unlabeled data.

This research introduces a practical, cost-effective solution for enriching tabular data with semantic structure, making it more useful for downstream analysis and insights—which is especially valuable in business intelligence, data cleaning, and automated analytics workflows. This approach outperforms baseline models and naive LLM applications on converted text classification benchmarks.

Extracting structured insights from free-text columns in tables—such as product reviews or user feedback—can be time-consuming and error-prone, especially when relying on traditional syntactic methods that often miss semantic meaning. This research introduces the semantic text column featurization problem, which aims to assign meaningful, context-aware labels to each entry in a text column.

The authors propose a scalable, efficient method that combines the power of LLMs with text embeddings. Instead of labeling an entire column manually or applying LLMs to every cell—an expensive process—this new method intelligently samples a diverse subset of entries, uses an LLM to generate semantic labels for just that subset, and then propagates those labels to the rest of the column using embedding similarity.

NEW RESEARCH

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

This work introduces ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a new paradigm for LLM reasoning that expands beyond traditional language-only inference.

While LLMs have made considerable strides in complex reasoning tasks, they remain limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this research, ARTIST brings together agentic reasoning, reinforcement learning (RL), and tool integration, designed to enable LLMs to autonomously decide when and how to invoke internal tools within multi-turn reasoning chains. ARTIST leverages outcome-based reinforcement learning to learn robust strategies for tool use and environment interaction without requiring step-level supervision.

Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies show that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions.

PODCAST

Materialism Podcast: MatterGen (opens in new tab)

What if you could find materials with tailored properties without ever entering the lab? The Materialism Podcast, which is dedicated to exploring materials science and engineering, talks with Tian Xie from Microsoft Research to discuss MatterGen, an AI tool which accelerates materials science discovery. Tune in to hear a discussion of the new Azure AI Foundry, where MatterGen will interact with and support MatterSim, an advanced deep learning model designed to simulate the properties of materials across a wide range of elements, temperatures, and pressures.

IN THE NEWS: Highlights of recent media coverage of Microsoft Research

When ChatGPT Broke an Entire Field: An Oral History

Quanta Magazine | April 30, 2025

Large language models are everywhere, igniting discovery, disruption and debate in whatever scientific community they touch. But the one they touched first — for better, worse and everything in between — was natural language processing. What did that impact feel like to the people experiencing it firsthand?

To tell that story, Quanta interviewed 19 NLP experts, including Kalika Bali, senior principal researcher at Microsoft Research. From researchers to students, tenured academics to startup founders, they describe a series of moments — dawning realizations, elated encounters and at least one “existential crisis” — that changed their world. And ours.

View more news and awards

The post Research Focus: Week of May 7, 2025 appeared first on Microsoft Research.

Learn more:

Subscribe to the Microsoft Research Podcast:

Transcript

The quantum threat to cryptography

The NIST Post-Quantum Cryptography Standardization effort

International standardization of post-quantum cryptography

What is FrodoKEM?

Why unstructured lattices?

Lattices and the Learning with Errors (LWE) problem

The Learning with Errors (LWE) problem

How FrodoKEM Works

Performance: Strong security has a cost

A holistic design made with security in mind

Conclusion

Further Reading

Learn more:

Subscribe to the Microsoft Research Podcast:

Transcript

Learn more

Subscribe to the Microsoft Research Podcast:

Transcript

Learn more

Subscribe to the Microsoft Research Podcast:

Transcript

How is Magentic-UI human-centered?

Architecture

Evaluating Magentic-UI

Evaluation with simulated users

Learning and reusing plans

Safety and control

Open research questions

Conclusion

Learn more

Subscribe to the Microsoft Research Podcast:

Transcript

ADeLe: An ability-based approach to task evaluation

Evaluation results

Looking ahead

Learn more:

Subscribe to the Microsoft Research Podcast:

Transcript

NEW RESEARCH

NEW RESEARCH

NEW RESEARCH

NEW RESEARCH

NEW RESEARCH

PODCAST

IN THE NEWS: Highlights of recent media coverage of Microsoft Research

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.