Crescent library brings privacy to digital identity systems

Crescent library brings privacy to digital identity systems

Three white line icons on a gradient background transitioning from blue to pink. From left to right: icon representing a computer chip, padlock icon, an avatar icon

Digital identities, the electronic credentials embedded in phone wallets, workplace logins, and other apps, are becoming ubiquitous. While they offer unprecedented convenience, they also create new privacy risks, particularly around tracking and surveillance. 

One of these risks is linkability, the ability to associate one or more uses of a credential to a specific person. Currently, when people use their mobile driver’s license or log into various apps, hidden identifiers can link these separate activities together, building detailed profiles of user behavior.  

To address this, we have released Crescent (opens in new tab), a cryptographic library that adds unlinkability to widely used identity formats, protecting privacy. These include JSON Web Tokens (the authentication standard behind many app logins) and mobile driver’s licenses. Crescent also works without requiring the organizations that issue these credentials to update their systems.  

The protection goes beyond existing privacy features. Some digital identity systems already offer selective disclosure, allowing users to share only specific pieces of information in each interaction.  

But even with selective disclosure, credentials can still be linked through serial numbers, cryptographic signatures, or embedded identifiers. Crescent’s unlinkability feature is designed to prevent anything in the credential, beyond what a user explicitly chooses to reveal, from being used to connect their separate digital interactions.

Figure 1: Unlinkability between a credential issuance and presentation
Figure 1: Unlinkability between a credential issuance and presentation

Two paths to unlinkability 

To understand how Crescent works, it helps to examine the two main approaches researchers have developed for adding unlinkability to identity systems: 

  1. Specialized cryptographic signature schemes. These schemes can provide unlinkability but require extensive changes to existing infrastructure. New algorithms must be standardized, implemented, and integrated into software and hardware platforms. For example, the BBS (opens in new tab) signature scheme is currently being standardized by the Internet Engineering Task Force (IETF), but even after completion, adoption may be slow.   
  1. Zero-knowledge proofs with existing credentials. This approach, used by Crescent (opens in new tab), allows users to prove specific facts about their credentials without revealing the underlying data that could enable tracking. For example, someone could prove they hold a valid driver’s license and live in a particular ZIP code without exposing any other personal information or identifiers that could link this interaction to future ones. 

Zero-knowledge proofs have become more practical since they were first developed 40 years ago but they are not as efficient as the cryptographic algorithms used in today’s credentials. Crescent addresses this computational challenge through preprocessing, performing the most complex calculations once in advance so that later proof generation is quick and efficient for mobile devices. 

Beyond unlinkability, Crescent supports selective disclosure, allowing users to prove specific facts without revealing unnecessary details. For example, it can confirm that a credential is valid and unexpired without disclosing the exact expiration date, which might otherwise serve as a unique identifier. These privacy protections work even when credentials are stored in a phone’s secure hardware, which keeps them tied to the device and prevents unauthorized access.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.


Behind the cryptographic curtain 

At its core, Crescent uses a sophisticated form of cryptographic proof called a zero-knowledge SNARK (Zero-Knowledge Succinct Noninteractive Argument of Knowledge). This method allows one party to prove possession of information or credentials without revealing the underlying data itself. 

Crescent specifically uses the Groth16 proof system, one of the first practical implementations of this technology. What makes Groth16 particularly useful is that its proofs are small in size, quick to verify, and can be shared in a single step without back-and-forth communication between the user and verifier. 

The system works by first establishing shared cryptographic parameters based on a credential template. Multiple organizations issuing similar credentials, such as different state motor vehicle departments issuing mobile driver’s licenses, can use the same parameters as long as they follow compatible data formats and security standards. 

The mathematical rules that define what each proof will verify are written using specialized programming tools that convert them into a Rank-1 Constraint System (R1CS), a mathematical framework that describes exactly what needs to be proven about a credential. 

To make the system fast enough for real-world use, Crescent splits the proof generation into two distinct stages: 

  1. Prepare stage. This step runs once and generates cryptographic values that can be stored on the user’s device for repeated use. 
  1. Show stage. When a user needs to present their credential, this quicker step takes the stored values and randomizes them to prevent any connection to previous presentations. It also creates a compact cryptographic summary that reveals only the specific information needed for that particular interaction. 

Figures 2 and 3 illustrate this credential-proving workflow and the division between the prepare and show steps.

Figure 2: Crescent’s credential-proving workflow includes a compilation of a circuit to R1CS, followed by the prepare and show steps. The output zero-knowledge proof is sent to the verifier.
Figure 2: Crescent’s credential-proving workflow includes a compilation of a circuit to R1CS, followed by the prepare and show steps. The output zero-knowledge proof is sent to the verifier.
Figure 3: The Crescent presentation steps show the division between prepare and show steps.
Figure 3: The Crescent presentation steps show the division between prepare and show steps.

A sample application 

To demonstrate how Crescent works, we created a sample application covering two real-world scenarios: verifying employment and proving age for online access. The application includes sample code for setting up fictional issuers and verifiers as Rust servers, along with a browser-extension wallet for the user. The step numbers correspond to the steps in Figure 4. 

Setup 

  1. A Crescent service pre-generates the zero-knowledge parameters for creating and verifying proofs from JSON Web Tokens and mobile driver’s licenses. 
  1. The user obtains a mobile driver’s license from their Department of Motor Vehicles. 
  1. The user obtains a proof-of-employment JSON Web Token from their employer, Contoso. 
  1. These credentials and their private keys are stored in the Crescent wallet. 

Scenarios 

  1. Employment verification: The user presents their JSON Web Token to Fabrikam, an online health clinic, to prove they are employed at Contoso and eligible for workplace benefits. Fabrikam learns that the user works at Contoso but not the user’s identity, while Contoso remains unaware of the interaction. 
  1. Age verification: The user presents their mobile driver’s license to a social network, proving they are over 18. The proof confirms eligibility without revealing their age or identity. 

Across both scenarios, Crescent ensures that credential presentations remain unlinkable, preventing any party from connecting them to the user. 

For simplicity, the sample defines its own issuance and presentation protocol, but it could be integrated into higher-level identity frameworks such as OpenID/OAuth, Verifiable Credentials, or the mobile driver’s license ecosystem.

Figure 4. The sample architecture, from credential issuance to presentation.
Figure 4. The sample architecture, from credential issuance to presentation.

To learn more about the project, visit the Crescent project GitHub (opens in new tab) page, or check out our recent presentations given at the Real-Word Crypto 2025 (opens in new tab) and North Sec 2025 (opens in new tab) conferences. 

The post Crescent library brings privacy to digital identity systems appeared first on Microsoft Research.

Read More

Applicability vs. job displacement: further notes on our recent research on AI and occupations

Applicability vs. job displacement: further notes on our recent research on AI and occupations

Three white icons on a gradient background transitioning from blue to green. From left to right: a network structure with connected circles, an upward-trending line graph with bars and an arrow, and a checklist with horizontal lines and checkmarks.

Recently, we released a paper (Working with AI: Measuring the Occupational Implications of Generative AI) that studied what occupations might find AI chatbots useful, and to what degree. The paper sparked significant discussion, which is no surprise since people care deeply about the future of AI and jobs–that’s part of why we think it’s important to study these topics.

Unfortunately, not all the discussion was accurate in its portrayal of the study’s scope or conclusions. Specifically, our study does not draw any conclusions about jobs being eliminated; in the paper, we explicitly cautioned against using our findings to make that conclusion. 

Given the importance of this topic, we want to clarify any misunderstandings and provide a more digestible summary of the paper, our methodology, and its limitations. 

What did our research find?

We set out to better understand how people are using AI, highlighting where AI might be useful in different occupations. To do this, we analyzed how people currently use generative AI—specifically Microsoft Bing Copilot (now Microsoft Copilot)—to assist with tasks. We then compared these sets of tasks against the O*NET database (opens in new tab), a widely used occupational classification system, to understand potential applicability to various occupations.

We found that AI is most useful for tasks related to knowledge work and communication, particularly tasks such as writing, gathering information, and learning.

Those in occupations with these tasks may benefit by considering how AI can be used as a tool to help improve their workflows. On the flip side, it’s not surprising that physical tasks like performing surgeries or moving objects had less direct AI chatbot applicability.

So, to summarize, our paper is about identifying the occupations where AI may be most useful, by assisting or performing subtasks.  Our data do not indicate, nor did we suggest, that certain jobs will be replaced by AI.

Methodological limitations are acknowledged—and important

The paper is transparent about the limitations of our approach.  

We analyzed anonymized Bing Copilot conversations to see what activities users are seeking AI assistance with and what activities AI can perform when mapped to the O*NET database. While O*NET provides a structured list of activities associated with various occupations, it does not capture the full spectrum of skills, context, and nuance required in the real world.  A job is far more than the collection of tasks that make it up.

For example, a task might involve “writing reports,” but O*NET won’t reflect the interpersonal judgment, domain expertise, or ethical considerations that go into doing that well. The paper acknowledges this gap and warns against over-interpreting the AI applicability scores as measures of AI’s ability to perform an occupation.

Additionally, the dataset is based on user queries from Bing Copilot (from January – September 2024), which may be influenced by factors like awareness, access, or comfort with AI tools.  Different people use different LLMs for different purposes and it also is very difficult (or nearly impossible) to determine what conversations are performed in a work context or for leisure. 

Finally, we only evaluated AI chatbot usage, so this study does not evaluate the impact or applicability of other forms of AI.

Where do we go from here?

Given the intense interest in how AI will shape our collective future, it’s important we continue to study and better understand its societal and economic impact. As with all research on this topic, the findings are nuanced, and it’s important to pay attention to this nuance. 

The public interest in our research is based, in large part, on the topic of AI and job displacement. However, our current methodology for this study is unlikely to lead to firm conclusions about this.  AI may prove to be a useful tool for many occupations, and we believe the right balance lies in finding how to use the technology in a way that leverages its abilities while complementing human strengths and accounting for people’s preferences.    

For more information from Microsoft on the future of work and AI skilling, check out Microsoft’s Annual Work Trend Index (opens in new tab) and Microsoft Elevate (opens in new tab)

The post Applicability vs. job displacement: further notes on our recent research on AI and occupations appeared first on Microsoft Research.

Read More

Coauthor roundtable: Reflecting on healthcare economics, biomedical research, and medical education

Coauthor roundtable: Reflecting on healthcare economics, biomedical research, and medical education

Illustrated headshots of Carey Goldberg, Peter Lee, and Dr. Isaac Kohane.

In November 2022, OpenAI’s ChatGPT kick-started a new era in AI. This was followed less than a half year later by the release of GPT-4. In the months leading up to GPT-4’s public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee. 

In this series finale, Lee welcomes back coauthors Carey Goldberg (opens in new tab) and Dr. Zak Kohane (opens in new tab) to discuss how their predictions stack up against key takeaways from guests in the second half of the series: experts on AI’s economic and societal impact; technologists on the cutting edge; leaders in AI-driven medicine; next-generation physicians; and heads of healthcare organizations. Lee, Goldberg, and Kohane explore thinking innovatively about existing healthcare processes, including the structure of care teams and the role of specialties, to take advantage of AI opportunities and consider what clinicians and patients might need these new AI tools to be to feel empowered when it comes to giving and receiving the best healthcare. They close the episode with their hopes for the future of AI in health.

Transcript

[MUSIC] 

[BOOK PASSAGE] 

PETER LEE: “As a society—indeed, as a species—we have a choice to make. Do we constrain or even kill artificial intelligence out of fear of its risks and obvious ability to create new harms? Do we submit ourselves to Al and allow it to freely replace us, make us less useful and less needed? Or do we start, today, shaping our Al future together, with the aspiration to accomplish things that humans alone, and Al alone, can’t do but that humans+Al can? The choice is in our hands … .” 

[END OF BOOK PASSAGE] [THEME MUSIC]

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee. 

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong? 

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here. 


[THEME MUSIC FADES] 

The book passage I read at the top is from the epilogue, and I think it’s a truly fitting closing sentiment for the conclusion of this podcast series—because it calls back to the very beginning.

As I’ve mentioned before, Carey, Zak, and I wrote The AI Revolution in Medicine as a guide to help answer these big questions, particularly as they pertain to medicine. You know, we wrote the book to empower people to make a choice about AI’s development and use. Well, have they? Have we?

Perhaps we’ll need more time to tell. But over the course of this podcast series, I’ve had the honor of speaking with folks from across the healthcare ecosystem. And my takeaway? They’re all committed to shaping AI into a tool that can improve the industry for practitioners and patients alike.

In this final episode, I’m thrilled to welcome back my coauthors, Carey Goldberg and Dr. Zak Kohane. We’ll examine the insights from the second half of the season. 

[TRANSITION MUSIC] 

Carey, Zak—it’s really great to have you here again! 

CAREY GOLDBERG: Hey, Peter! 

ZAK KOHANE: Hi, Peter. 

LEE: So this is the second roundtable. And just to recap, you know, we had several early episodes of the podcast where we talked to some doctors, some technology developers, some people who think about regulation and public policy, patient advocates, a venture capitalist who invests in, kind of, consumer and patient-facing medical ventures, and some bioethicists. 

And I think we had a great conversation there. I think, you know, it felt mostly validating. A lot of the things that we predicted might happen happened, and then we learned a lot of new things. But now we have five more episodes, and the mix of kinds of people that we talk to here is different than the original. 

And so I thought it would be great for us to have a conversation and recap what we think we heard from all of them. So let’s just start at the top. 

So in this first episode in the second half of this podcast series, we talked to economists Azeem Azhar and Ethan Mollick. And I thought those conversations were really interesting. Maybe there were, kind of, two things, two main topics. One was just the broader impact on the economy, on the cost of healthcare, on overall workforce issues. 

One of the things that I thought was really interesting was something that Ethan Mollick brought up. And maybe just to refresh our memories, let’s play this little clip from Ethan. 

LEE: So let me start with you, Zak. Does that make sense to you? Are you seeing something similar? 

KOHANE: I thought it was incredibly insightful because we discussed on our earlier podcast how a chief AI officer in one of the healthcare hospitals, in one of the healthcare systems, was highly regulating the use of AI, but yet in her own practice on her smartphone was using all these AI technologies. 

And so it’s insightful that on the one hand, she is increasing her personal productivity, … 

LEE: Right. 

KOHANE: … and perhaps she’s increasing her quality of her care. But it’s very hard for the healthcare system to actually realize any gains. It’s unlikely … let’s put it this way. It would be for her a defeat if they said, “Now you should see more patients.” 

LEE: Yes. [LAUGHS] 

KOHANE: Now, I’m not saying that won’t happen. It could happen. But, you know, gains of productivity are really at the individual level of the doctors. And that’s why they’re adopting it. That’s why the ambient dictation tools are so successful. But really turning it into things that matter in terms of productivity for healthcare, namely making sure that patients are getting healthy, requires that every piece of the puzzle works well together. You know, it’s well-tread ground to talk about how patients get very expensive procedures, like a cardiac transplant, and then go home, and they’re not put on blood thinners … 

LEE: Right. 

KOHANE: … and then they get a stroke. You know, the chain is as strong as the weakest link. And just having AI in one part of it is not going to do it. And so hospitals, I think, are doubly burdened by the fact that, (A) they tend to not like innovation because they are high-revenue, low-margin companies. But if they want it implemented effectively, they have to do it across the entire processes of healthcare, which are vast and not completely under their control. 

LEE: Yeah. Yep. You know, that was Sara Murray, who’s the chief health AI officer at UC San Francisco. 

And then, you know, Carey, remember, we were puzzled by Chris Longhurst’s finding in a controlled study that the, you know, having an AI respond to patient emails didn’t seem to lead to any, I guess you would call it, productivity benefits. I remember we were both kind of puzzled by that. I wonder if that’s related to what Ethan is saying here. 

GOLDBERG: I mean, possibly, but I think we’ve seen since then that there have been multiple studies showing that in fact using AI can be extremely effective or helpful, even, for example, for diagnosis. 

And so I find just from the patient point of view, it kind of drives me crazy that you have individual physicians using AI because they know that it will improve the care that they’re offering. And yet you don’t have their institutions kind of stepping up and saying, “OK, these are the new norms.” 

By the way, Ethan Mollick is a national treasure, right. Like, he is the classic example of someone who just stepped up at this moment … 

LEE: Yeah. 

GOLDBERG: … when we saw this extraordinary technological advance. And he’s not only stepping up for himself. He’s spreading the word to the masses that this is what these things can do. 

And so it’s frustrating to see the institutions not stepping up and instead the individual doctors having to do it. 

KOHANE: But he made another very interesting point, which was that the reason that he could be so informative to not only the public but practitioners of AI is these things would emerge out of the shop, and they would not be aged too long, like a fine wine, before they were just released to the public. 

And so he was getting exposure to these models just weeks after some of the progenitors had first seen it. And therefore, because he’s actually a really creative person in terms of how he exercises models, he sees uses and problems very early on. But the point is institutions, think about how much they are disadvantaged. They’re not Ethan Mollick. They’re not the progenitors. So they’re even further behind. So it’s very hard. If you talk to most of the C-suite of hospitals, they’d be delighted to know as much about the impact as Ethan Mollick. 

LEE: Yeah. By the way, you know, I picked out this quote because within Microsoft, and I suspect every other software company, we’re seeing something very similar, where individual programmers are 20 to 30% more productive just in the number of lines of code they write per day or the number of pull requests per week. Any way you measure it, it’s very consistent. And yet by the time you get to, say, a 25-person software engineering team, the productivity of that whole team isn’t 25% more productive. 

Now, that is starting to change because we’re starting to figure out that, well, maybe we should reshape how the team operates. And there’s more of an orientation towards having, you know, smaller teams of full-stack developers. And then you start to see the gains. But if you just keep the team organized in the usual way, there seems to be a loss. So there’s something about what Ethan was saying that resonated very strongly with me. 

GOLDBERG: But I would argue that it’s not just productivity we’re talking about. There’s a moral imperative to improve the care. And if you have tools that will do that, you should be using them or trying harder to. 

LEE: Right. Yep. 

KOHANE: I think, yes, first of all, absolutely you would. Unfortunately, most of the short-term productivity measures will not measure improvements in the quality of care because it takes a long time to die even with bad care. 

And so that doesn’t show up right away. But I think what Peter just said actually came across in several of the podcasts, which is that it’s very tricky trying to shoehorn these things into making what we’re already doing more productive. 

GOLDBERG: Yeah. Existing structures. 

KOHANE: Yeah. And I know, Carey, that you’ve raised this issue many times. But it really calls into question, what should we be doing with our time with doctors? And they are a scarce resource. And what is the most efficient way to use them? 

You know, I remember we [The New England Journal of Medicine AI] published a paper of someone who was able to use AI to increase the throughput of their emergency room (opens in new tab) by actually more appropriately having the truly sick people in the sick queue, in the triage queue, for urgent care. 

And so I think we’re going to have to think that way more broadly, about we don’t have to now look at every patient as an unknown with maybe a few pointers on diagnosis. We can have a fairly extensive profiling. 

And I know that colleagues in Clalit [Health Services] in Israel, for example, are using the overall trajectory of the patient and some considerations about utilities to actually figure out who to see next week. 

LEE: Yeah, you know, what you said brings up another maybe connection to one thing that we see also in software development. And it relates to also what we were discussing earlier: about the last thing a doctor wants is to have a tool that allows them to see even yet more patients per day. 

So in software development, there’s always this tension. Like, how many lines of code can you write per day? That’s one productivity measure. 

But sometimes we’re taught, well, don’t write more lines of code per day, but make sure that your code is well structured. Take the time to document it. Make sure it’s fully commented. Take the time to talk to your fellow software engineering team members to make sure that it’s well coordinated. And in the long run, even if you’re writing half the number of lines of code per day, the software process will be far more efficient.

And so I’ve wondered whether there’s a similar thing where doctors could see 20% fewer patients in a day, but if they take the time and also had AI help to coordinate, maybe a patient’s journey might be half as long. And therefore, the health system would be able to see twice as many patients in a year’s period or something like that. 

KOHANE: So I think you’ve “nerd sniped” me because you [LAUGHTER]—which is all too easy—but I think there’s a central issue here. And I think this is the stumbling block between what Ethan’s telling us about between the individual productivity and the larger productivity, is the team’s productivity. 

And there is actually a good analogy in computer science and that’s, uh, Brooks’s “mythical man-month,” … 

LEE: Yes, exactly. 

KOHANE: … where he shows how you can have more and more resources, but when the coordination starts failing, because you have so many, uh, individuals on the team, you start falling apart. And so even if the, uh, individual doctors get that much better, yeah, they take better care of patients, make less stupid things. 

But in terms of giving the “I get you into the emergency room, and I get you out of a hospital as fast as possible, as safely as possible, as effectively as possible,” that’s teamwork. And we don’t do it. And we’re not really optimizing our tools for that. 

GOLDBERG: And just to throw in a little reality check, I’m not aware of any indication yet that AI is in any way shortening medical journeys or making physicians more efficient. Yet … 

LEE: Right. 

GOLDBERG: at least. Yeah. 

LEE: Yes. So I think, you know, with respect to our book, critiquing our book, you know, I think it’s fair to say we were fairly focused or maybe even fixated on the individual doctor or nurse or patient, and we didn’t really, at least I never had a time where I stepped back to think about the whole care coordination team or the whole health system. 

KOHANE: And I think that’s right. It’s because, first of all, you weren’t thinking about it? It’s not what we’re taught in medical school. We’re not taught to talk about team communication excellence. And I think it’s absolutely essential. 

There’s a … what’s the … there was an early … [Terry] Winograd. And he was trying to capture what are the different kinds of actions related to pronouncements that you could expect and how could AI use that. And that was beginning to get at it. 

But I actually think this is dark matter of human organizational technology that is not well understood. And our products don’t do well. You know, we can talk about all the groupware things that are out there. But they all don’t quite get to that thing. 

LEE: Right. 

KOHANE: And I can imagine an AI serving as a team leader, a really active team leader, a real quarterback of, let’s say, a care team. 

LEE: Well, in fact, you know, we have been trying to experiment with this. My colleague, Matt Lungren, who was also one of the interviewees early on, has been working with Stanford Medicine on a tumor board AI agent—something that would facilitate tumor board meetings. 

And the early experiences are pretty interesting. Whether it relates to efficiency or productivity I think remains to be seen, but it does seem pretty interesting. 

But let’s move on. 

GOLDBERG: Well, actually, Peter, … 

LEE: Oh, go ahead. 

GOLDBERG: if you’re willing to not quite move on yet … 

LEE: [LAUGHS] All right. 

GOLDBERG: … this kind of segues into one of, I think, the most provocative questions that arose in the course of these episodes and that I’d love to have you answer, which was, remember, it was a question at a gathering that you were at, and you were asked, “Well, you’re focusing a lot on potential AI effects on individual patient and physician experiences. But what about the revolution, right? What about, like, can you be more big-picture and envision how generative AI could actually, kind of, overturn or fix the broken system, right?” 

I’m sure you’ve thought about that a lot. Like, what’s your answer? 

LEE: You know, I think ultimately, it will have to. For it to really make a difference, I think that the normal processes, our normal concept of how healthcare is delivered—how new medical discoveries are made and brought into practice—I think those things are going to have to change a lot. 

You know, one of the things I think about a lot right at the moment is, you know, we tend to think about, let’s say, medical diagnosis as a problem-solving exercise. And I think, at least at the Kaiser Permanente School of Medicine, the instruction really treats it as a kind of detective thing based on a lot of knowledge about biology and biomedicine and human condition, and so on. 

But there’s another way to think about it, given AI, which is when you see a patient and you develop some data, maybe through a physical exam, labs, and so on, you can just simply ask, “You know, what did the 500 other people who are most similar to this experience, how were they diagnosed? How were they treated? What were their outcomes? What were their experiences?” 

And that’s really a fundamentally different paradigm. And it just seems like at least the technical means will be there. And by the way, that also then relates to [the questions]: “And what was most efficacious cost-wise? What was most efficient in terms of the total length of the patient journey? How does this relate to my quality scores so I can get more money from Medicare and Medicaid?” 

All of those things, I think, you know, we’re starting to confront. 

One of the other episodes that we’re going to talk about, was my interview with two medical students. Actually, thinking of a Morgan Cheatham as just a medical student or medical resident [LAUGHTER] is a little strange. But he is. 

One of the things he talks about is the importance that he placed in his medical training about adopting AI. So, Zak, I assume you see this also with some students at Harvard Medical School. And the other medical student we interviewed, Daniel Chen, seemed to indicate this, too, where it seems like it’s the students who are bringing AI into the medical education ahead of the faculty. Does that resonate with you? 

KOHANE: It absolutely resonates with me. There are students I run into who, honestly, my first thought when I’m talking to them is, why am I teaching you [LAUGHTER], and why are you not starting a big AI company, AI medicine company, now and really change healthcare instead of going through the rest of the rigmarole? And I think broadly, higher education has a problem there, which is we have not embraced, again, going back to Ethan, a lot of the tools that can be used. And it’s because we don’t know necessarily the right way to teach them. And so far, the only lasting heuristic seems to be: use them and use them often. 

And so it’s an awkward thing, where the person who knows how to use the AI tools now in the first-year medical school can teach themselves better and faster than anybody else in their class who is just relying on the medical school curriculum. 

LEE: Now, the reason I brought up Morgan now after our discussion with Ethan Mollick is Morgan also talked about AI collapsing medical specialties. 

GOLDBERG: Yes. 

LEE: And so let’s hear this snippet from him. 

LEE: So on the specific question about specialties, Zak, do you have a point of view? And let me admit, first of all, for us, all three of us, we didn’t have any clue about this in our book. I don’t think. 

KOHANE: Not much. Not much of a clue. 

So I’m reminded of a New Yorker cartoon where you see a bunch of surgeons around the patient, and someone says, “Is that a spleen?” And it says, “I don’t know. I slept during the spleen lecture,” [LAUGHTER] and … or “I didn’t take the spleen course.” 

And yet when we measure things, we measure things much more than we think we are doing. So for example, we [NEJM AI] just published a paper where echocardiograms were being done. And it turns out those ultrasound waves just happen to also permeate the liver. And you can actually diagnose on the way with AI all the liver disease (opens in new tab) that is in—and treatable liver disease—that’s in those patients. 

But if you’re a cardiologist, “Liver? You know, I slept through liver lecture.” [LAUGHTER] And so I do think that, (A) the natural, often guild/dollar-driven silos in medicine are less obvious to AI, despite the fact that they do exist in departments and often in chapters. 

But Morgan’s absolutely right. I can tell you as an endocrinologist, if I have a child in the ICU, the endocrinologist, the nephrologist, and the neurosurgeon will argue about the right thing to do. 

And so in my mind, the truly revolutionary thing to do is to go back to 1994 with Pete Szolovits, the Guardian Angel Project (opens in new tab). What I think you need is a process. And the process is the quarterback. And the quarterback has only one job: take care of the patient. 

And it should be thinking all the time about the patient. What’s the right thing? And can be as school-marmish or not about, “Zak, you’re eating this or that or exercise or sleep,” but also, “Hey, surgeons and endocrinologists, you’re talking about my host, Zak. This is the right way because this problem and this problem and our best evidence is this is the right way to get rid of the fluid. The other ways will kill him.”

And I think you need an authoritative quarterback that has the view of the others but then makes the calls. 

LEE: Is that quarterback going to be AI or human? 

KOHANE: Well, for the very lucky people, it’ll be a human augmented by AI, super concierge

But I think we’re running out of doctors. And so realistically, it’s going to be an AI that will have to be certified in very different ways, along the ways Dave Blumenthal says, essentially, trial by fire. Like putting residents into clinics, we’re going to be putting AIs into clinics. 

But what’s worse, by the way, than the three doctors arguing about care in front of the patient is, what happens so frequently, is then you see them outpatient, and each one of them gives you a different set of decisions to make. Sometimes that actually interact pathologically, unhealthily with each other. And only the very smart nurses or primary care physicians will actually notice that and call, quote, a “family meeting,” or bring everybody in the same room to align them. 

LEE: Yeah, I think this idea of quarterback is really very, very topical right now because there’s so much intensity in the AI space around agents. And in fact, you know, the Microsoft AI team under Mustafa Suleyman and Dominic King, Harsha Nori, and team just recently posted a paper on something called sequential diagnosis, which is basically an AI quarterback that is supposed to smartly consult with other AI specialties. And interestingly, one of the AI agents is sort of the devil’s advocate that’s always criticizing and questioning things. 

GOLDBERG: That’s interesting. 

LEE: And at least on very, very hard, rare cases, it can develop some impressive results. There’s something to this that I think is emerging. 

GOLDBERG: And, Peter, Morgan said something that blew me away even more, which was, well, why do we even need specialists if the reason for a specialist is because there’s so much medical knowledge that no single physician can know all of it, and therefore we create specialists, but that limitation does not exist for AI. 

LEE: Yeah. Yeah. 

GOLDBERG: And so there he was kind of undermining this whole elaborate structure that has grown up because of human limitations that may not ultimately need to be there. 

LEE: Right. So now that gives me a good segue to get back to our economist and get to something that Azeem Azhar said. And so there’s a clip here from Azeem. 

LEE: And, you know, in the same conversation, he also talked about his own management of asthma and the fact that he’s been managing this for several decades and knows more than any other human being, no matter how well medically trained, could possibly know. And it’s also very highly personalized. And it’s not a big leap to imagine AI having that sort of lifelong understanding. 

KOHANE: So in fact, I want to give credit back to our book since you insulted us. [LAUGHTER] You challenged us. You doubted us. We do have at the end of the book a AI which is helping this woman manage her way through life. It’s quarterbacking for the woman all these different services. 

LEE: Yes. 

KOHANE: So there. 

LEE: Ah, you’re right. Yes. In fact, it’s very much, I think, along the lines of the vision that Azeem laid out in our conversation. 

GOLDBERG: Yeah. It also reminded me of the piece Zak wrote about his mother (opens in new tab) at one point when she was managing congestive heart failure and she needed to watch her weight very carefully to see her fluid status. And absolutely, there’s no … I see no reason whatsoever why that couldn’t be done with AI right now. Actually, although back then, Zak, you were writing that it takes much more than an AI [LAUGHS] to manage such a thing, right? 

KOHANE: You need an AI that you can trust. Now, my mother was born in 1927, and she’d learned through the school of hard knocks that you can’t trust too many people, maybe even not your son, MD, PhD [LAUGHTER]. 

But what I’ve been surprised [by] is how, for example, how many people are willing to trust and actually see effective use of AI as mental health counselors, for example. 

GOLDBERG: Yeah 

KOHANE: So it may in fact be that there’s a generational thing going on, and at least there’ll be some very large subset of patients which will be completely comfortable in ways that my mother would have never tolerated. 

LEE: Yeah. Now, I think we’re starting to veer into some of the core AI. 

And so I think maybe one of the most fun conversations I had was in the episode with both Sébastien Bubeck, my former colleague at Microsoft Research, and now he’s at OpenAI, and Bill Gates. And there was so much that was, I thought, interesting there. And there was one point, I think that sort of touches tangentially on what we were just conversing about, that Sébastien said. So let’s hear this snippet. 

LEE: So I thought Sébastien was saying something really profound, but I haven’t been able to quite decide or settle in my mind what it is. What do you make of what Seb just said? 

KOHANE: I think it’s context. I think that it requires an enormous amount of energy, brain energy, to actually correctly provide the context that you want this thing to work on. And it’s only going to really feel like we’re in a different playing field when it’s listening all the time, and it just steps right in. 

There is an advantage that, for example, a good programmer can have in prompting Cursor or any of these tools to do so. But it takes effort. And I think being in the conversation all the time so that you understand the context in the widest possible way is incredibly important. And I think that’s what Seb is getting at, which is if we spoon feed these machines, yes, 90%. 

But then, talking to a human being who then has to interact and gets distracted from whatever flow they’re in and maybe even makes them feel like an early bicycle rider who all of a sudden realizes, “I’m balancing on two wheels—oh no!” And they fall over. You know, there’s that interaction which is negatively synergistic. 

And so I do think it’s a very hard human-computer engineering problem. How do we make these two agents, human and computational, work in an ongoing way in the flow? I don’t think I’m seeing anything that’s particularly new. And the things that you’re beginning to hint about, Peter, in terms of agentic coordination, I think we’ll get to some of that.

LEE: Yeah. Carey, does this give you any pause? The kind of results that … they’re puzzling results. I mean, the idea of doctors with AI seeming at least in this one test—it’s just one test—but it’s odd that it does worse than the AI alone. 

GOLDBERG: Yes. I would want to understand more about the actual conditions of that study. 

From what Bill Gates said, I was most struck by the question of resource-poor environments. That even though this was absolutely one of the most promising, brightest perspectives that we highlighted in the book, we still don’t seem to be seeing a lot of use among the one half of humanity that lacks decent access to healthcare. 

I mean, there are access problems everywhere, including here in the United States. And it is one of the most potentially promising uses of AI. And I thought if anyone would know about it, he would with the work that the Gates Foundation does. 

LEE: You know, I think both you and Bill, I felt, are really simpatico. You know, Bill expressed genuine surprise that more isn’t happening yet. And it really echoed, in fact, maybe even using some of the exact same words that you’ve used. And so two years on, you’ve expressed repeatedly expecting to have seen more out in the field by now. And then I thought Bill was saying something in our conversation very similar. 

GOLDBERG: Yeah. 

LEE: You know, for me, I see it both ways. I see the world of medicine really moving fast in confronting the reality of AI in such a serious way. But at the same time, it’s also hard to escape the feeling that somehow, we should be seeing even more. 

So it’s an odd thing, a little bit paradoxical. 

GOLDBERG: Yeah. I think one thing that we didn’t focus on hardly at all in the book but that we are seeing is these companies rising up, stepping up to the challenge, Abridge and OpenEvidence, and what Morgan describes as a new stack, right. 

So there is that on the flip side. 

LEE: Now, I want to get back to this thing that Seb was saying. And, you know, I had to bring up the issue of sycophancy, which we discussed at our last roundtable also. But it was particularly … at the time that Seb, Bill, and I had our conversation, OpenAI had just gone through having to retract a fresh update of GPT-4o because it had become too sycophantic. 

So I can’t escape the feeling that some of these human-computer interaction issues are related to this tension between you want AI to follow your directions and be faithful to you, but at the same time not agree with you so often that it becomes a fault. 

KOHANE: I think it’s asking the AI to enter into a fundamental human conundrum, which is there are extreme versions of doublethink, and there’s everyday things, everyday asks of doublethink, which is how to be an effective citizen. 

And even if you’re thinking, “Hmm. I’m thinking this. I’m just not going to say it because that would be rude or counterproductive.” Or some of the official doublethinks, where you’re actually told you must say this, even if you think something else. And I think we’re giving a very tough mission for these things: be nice to the user and be useful. 

And, in education, where the thing is not always one in the same. Sometimes you have to give a little tough love to educate someone, and doing that well is both an art and it’s also very difficult. And so, you know, I’m willing to believe that the latest frontier models that have made the news in the last month are very high-performing, but they’re also all highlighting that tension … 

LEE: Yes. 

KOHANE: … that tension between behaving like a good citizen and being helpful. And this gets back to what are the fundamental values that we hope these things are following. 

It’s not, you know, “Are these things going to develop us into the paperclip factory?” It’s more of, “Which of our values are going to be elevated, and which one will be suppressed?” 

LEE: Well, since I criticized our book before, let me pat ourselves on the back this time because, I think, pervasive throughout our book, we were touching on some of these issues. 

In fact, we started the book, you know, with GPT-4 scolding me for wanting it to impersonate Zak. And there was the whole example of asking it to rewrite a poem in a certain way, and it kind of silently just tried to slide, you know, without me knowing, slide by without following through on the whole thing. 

And so that early version of GPT-4 was definitely not sycophantic at all. In fact, it was just as prone to call you an idiot if it thought you were wrong. [LAUGHTER] 

KOHANE: I had some very testy conversations around my endocrine diagnosis with it. [LAUGHTER] 

GOLDBERG: Yeah. Well then, Peter, I would ask you, I mean last time I asked you about, well, hallucinations, aren’t those solvable? And this time I would ask you, well, sycophancy, isn’t that kind of like a dial you can turn? Like, is that not solvable? 

LEE: You know, I think there are several interlocking problems. But if we assume superintelligence, even with superintelligence, medicine is such an inexact science that there will always be situations that are guesses that take into account other factors of a person’s life, other value judgments, exactly as Zak had pointed out in our previous roundtable conversation. 

And so I think there’s always going to be an opening for either differences of opinion or agreeing with you too much. And there are dangers in both cases. And I think they’ll always be present. I don’t know that, at least in something as inexact as medical science, I don’t know that it’ll ever be completely eliminated. 

KOHANE: And it’s interesting because I was trying to think what’s the right balance, but there are patients who want to be told this is what you do. Whereas there’s other patients who want to go through every detail of the reasoning. 

And it’s not a matter of education. It’s really a temperamental, personality issue. And so we’re going to have to, I think, develop personalities … 

LEE: Yeah. 

KOHANE: … that are most effective for those different kinds of individuals. And so I think that is going to be the real frontier. Having human values and behaving in ways that are recognizable and yet effective for certain groups of patients. 

LEE: Yeah. 

KOHANE: And lots of deep questions, including how paternalistic do we want to be? 

LEE: All right, so we’re getting into medical science and hallucination. So that gives me a great segue to the conversations in the episode on biomedical research. And one of the people that I interviewed was Noubar Afeyan from Moderna and Flagship Pioneering. So let’s listen to this snippet. 

LEE: [LAUGHS] So I think that really touches on just the fact that there’s so many unknowns and such lack of precision and exactness in our understanding of human biology and of medicine. Carey, what do you think? 

GOLDBERG: I mean, I just have this emotional reaction, which is that I love the idea of AI marching into biomedical science and everything from getting to the virtual cell eventually to, Zak, I think it was a colleague of yours who recently published about … it was a new medication that had been sort of discovered by AI (opens in new tab), and it was actually testing out up to the phase II level or something, right?

KOHANE: Oh, this is Marinka’s work. 

GOLDBERG: Yeah, Marinka, Marinka Zitnik. And … yeah. So, I mean, I think it avoids a lot of the, sort of, dilemmas that are involved with safety and so on with AI coming into medicine. And it’s just the discovery process, which we all want to advance as quickly as possible. And it seems like it actually has a great deal of potential that’s already starting to be realized. 

LEE: Oh, absolutely. 

KOHANE: I love this topic. First of all, I thought, actually, I think Bill and Seb, actually, had interesting things to say on that very topic, rationales which I had not really considered why, in fact, things might progress faster in the discovery space than in the clinical delivery space, just because we don’t know in clinical medicine what we’re trying to maximize precisely. Whereas for a drug effect, we do know what we’re trying to maximize. 

LEE: Well, in fact, I happened to save that snippet from Bill Gates saying that. So let’s cue that up. 

LEE: So, Zak, isn’t that Bill saying exactly what you’re saying? 

KOHANE: That is my point. I have to say that this is another great bet, that either we’re all going to be surprised or a large group of people will be surprised or disappointed. 

There’s still a lot of people in the sort of medicinal chemist, trialist space who are still extremely skeptical that this is going to work. And we haven’t quite shown them yet that it is. Why have we not shown them? Because we haven’t gone all the way to a phase III study, which showed that the drug behaves as expected to, is effective, and basically doesn’t hurt people. That turns out to require a lot of knowledge. I actually think we’re getting there, but I understand the skepticism. 

LEE: Carey, what are your thoughts? 

GOLDBERG: Yeah. I mean, there will be no way around going through full-on clinical trials for anything to ever reach the market. But at the same time, you know, it’s clearly very promising. And just to throw out something for the pure fun of it, Peter, I saw … one of my favorite tweets recently was somebody saying, you know, isn’t it funny how computer science is actually becoming a lot more like biology in that it’s just becoming empirical. 

It’s like you just throw stuff at the AI and see what it does. [LAUGHTER] And I was like, oh, yeah, that’s what Peter was doing when we wrote the book. I mean, he understood as many innards as anybody can. But at the same time, it was a totally empirical exercise in seeing what this thing would do when you threw things at it. 

LEE: Right. 

GOLDBERG: So it’s the new biology. 

LEE: Well, yeah. So I think we talked in our book about accelerating, you know, biomedical knowledge and medical science. And that actually seems to be happening. And I really had fun talking to Daphne Koller about some of the accomplishments that she’s made. And so here’s a little snippet from Daphne. 

LEE: So, Zak, when I was listening to that, I was reminded of one of the very first examples that you had where, you know, you had a very rare case of a patient, and you’re having to narrow down some pretty complex and very rare genetic conditions. This thing that Daphne says, that seems to be the logical conclusion that everyone who’s thinking hard about AI and biology is coming to. Does it seem more real now two years on? 

KOHANE: It absolutely seems more real. Here’s some sad facts. If you are at a cancer center, you will get targeted therapies if you qualify for it. Outside cancer centers, you won’t. And it’s not that the therapies aren’t available. It’s just that you won’t have people thinking about it in that way. And especially if you have some of the rare and more aggressive cancers, if you’re outside one of those cancer centers, you’re at a significant disadvantage for survival for that reason. And so anything that provides just the “simple,” in quotes, dogged investigation of the targeted therapies for patients, it’s a home run. 

So my late graduate student, Atul Butte, died recently at UCSF, where he was both a professor and the leader of the Bakar Institute, and he was a Zuckerberg Chan Professor of Pediatrics. 

He was diagnosed with a rare tumor two years ago. His wife is a PhD biologist, and when he was first diagnosed, she sent me the diagnosis and the mutations. And I don’t know if you know this, Peter, but this was still when we were writing the book and people didn’t know about GPT-4. 

I put in those mutations into GPT-4 and the diagnosis. And I said, “I’d like to help treat my friend. What’s the right treatment?” And GPT, to paraphrase, GPT-4 said, “Before we start talking about treatment, are you sure this is the right diagnosis? Those mutations are not characteristic for that tumor.” And he had been misdiagnosed. And then they changed the diagnosis therapy and some personnel. 

So I don’t have to hallucinate this. It’s already happened, and we’re going to need this. And so I think targeted therapy for cancers is the most obvious use. And if God forbid one of you has a family member who has cancer, it’s moral malpractice not to look at the genetics and run it past GPT-4 and say, “What are the available therapies?” 

LEE: Yeah. 

KOHANE: I really deeply believe that. 

LEE: Carey, I think one thing you’ve always said is that you’re surprised that we don’t hear more stories along these lines. And I think you threw a quote from Mustafa Suleyman back at me. Do you want to share that? 

GOLDBERG: Yes. Recently, I believe it was a Big Technology interview (opens in new tab), and the reporter asked Mustafa Suleyman, “So you guys are seeing 50 million queries, medical queries, a day [to Copilot and Bing]. You know, how’s that going?” And I think I am a bit surprised that we’re not seeing more stories of all types. Both here’s how it helped me and also here was maybe, you know, a suggestion that was not optimal. 

LEE: Yeah. I do think in our book, we did predict both positive and negative outcomes of this. And it is odd. Atul was very open with his story. And of course, he is such … he was such a prominent leader in the world of medicine. 

But I think I share your surprise, Carey. I expected by now that a lot more public stories would be out. Maybe there is someone writing a book collecting these things, I don’t know. 

KOHANE: Maybe someone called Carey Goldberg should write that book. [LAUGHTER] 

GOLDBERG: Write a book, maybe. I mean, we have Patients Use AI (opens in new tab), which is a wonderful blog by Dave deBronkart, the patient advocate. 

But I wonder if it’s also something structural, like who would be or what would be the institution that would be gathering these stories? I don’t know. 

LEE: Right. 

KOHANE: And that’s the problem. You see, this goes back to the same problem that [Ethan] Mollick was talking about. Individual doctors are using them. The hospital as a whole is not doing that. So it’s not judging the quality, as part of its quality metrics, of how good the AI is performing and what new has happened. And the other audience, namely the patients, have no mechanism. There is no mechanism to go to Better Business Bureau and say, “They screwed up,” or “This was great.” 

LEE: So now I want to get a little more futuristic. And this gets into whether AI is really going to get almost to the ab initio understanding of human biology. And so Eric Topol, who is one of the guests, spoke to this a bit. So let’s hear this. 

ERIC TOPOL: No, I think within 10 years for sure. You know, the group that got assembled, that Steve Quake pulled together, I think has 42 authors in a paper in Cell. The fact that he could get these 42 experts in life science and some in computer science to come together and all agree that not only is this a worthy goal, but it’s actually going to be realized, that was impressive. 

LEE: You know, I have to say Eric’s optimism took me aback. Just speaking as a techie, I think I started off being optimistic: as soon as we can figure out molecular dynamics, biology can be solved. And then you start to learn more about biochemistry, about the human cell, and then you realize, oh, my God, this is just so vast and unknowable. And now you have Eric Topol saying, “Well, in less than 10 years.” 

KOHANE: So what’s delightful about this period is that those of us who are cautious were so incredibly wrong about AI two years ago. [LAUGHTER] That’s a true joy … I mean, absolute joy. It’s great to have your futurism made much more positive. 

But I think that we’re going from, you know, for example, AlphaFold has had tremendous impact. But remember, that was built on years of acquisition of crystallography data that was annotated. And of course, the annotation process becomes less relevant as you go down the pipe, but it started from that. 

LEE: Yes. 

KOHANE: And there’s lots of parts of the cell. So when people talk about virtual cells—I don’t mean to get too technical—mostly they’re talking about perturbation of gene expression. They’re not talking about, “Oh, this is how the liposome and the centrosome interact, and notice how the Golgi bodies bump into each other.” 

There’s a whole bunch of other levels of abstraction we know nothing about. This is a complex factory. And right now, we’re sort of the level from code into loading code into memory. We’re not talking about how the rest of the robots work in that cell, and how the rest of those robots work in the cell turns out to be pretty important to functioning. 

So I’d love to be wrong again. And in 10 years, oh yeah, not only, you know, our first in-human study will be you, Dr. Zak. We’re going put the drug because we fully simulated you. That’d be great. 

LEE: Yes. 

KOHANE: And, by the way, just to give people their due, there probably was a lot of animal research that could be done in silico and that for various political reasons we’re now seeing happen. That’s a good thing. But I think that sometimes it takes a lot of hubris to get us where we need to get, but my horizon is not the same as his. 

LEE: So I guess I have to take this time to brag. Just recently out of our AI for Science team did publish in Science a biological emulator that does pretty long timespan, very, very precise, and very efficient molecular dynamics, biomolecular dynamics emulation. We call it emulation because it’s not simulating every single time step but giving you the final confirmations. 

KOHANE: That’s an amazing result. 

LEE: Yeah. 

KOHANE: But … that is an amazing result. And you’re doing it in some very important interactions. But there’s so much more to do. 

LEE: I know, and it’s single molecules; it’s not even two molecules. There’s so much more to go for here. But on the other hand, Eric is right, you know, 42 experts writing for Cell, you know, that’s not a small matter. 

KOHANE: So I think sometimes you really need to drink your own hallucinogens to actually succeed. Because remember, when the Human Genome Project (opens in new tab) was launched, we didn’t know how to sequence at scale. 

We said maybe we would get there. And then in order to get the right funding and excitement and, I think, focus, we predicted that by early 2000s we’d be transforming medicine. Has not happened yet. Things have happened, but at a much slower pace. And we’re 25 years out. In fact, we’re 35 years out from the launch. 

But again, things are getting faster and faster. Maybe the singularity is going to make a whole bunch of things easier. And GPT-6 will just say, “Zak, you are such a pessimist. Let me show you how it’s done.” 

GOLDBERG: Yeah. 

It really is a pessimism versus optimism. Like is it, I mean, biology is such a bitch, right. [LAUGHTER] Can we actually get there? 

At the same time, everyone was surprised and blown away by the, you know, the quantum leap of GPT-4. Who knows when enough data gets in there if we might not have a similar leap. 

LEE: Yeah. All right. 

So let’s get back to healthcare delivery. Besides Morgan Cheatham, we talked to [a] more junior medical student who’s at the Kaiser Permanente School of Medicine, Daniel Chen. And, you know, I asked him about this question of patients who come in armed [LAUGHS] with a lot of their own information. Let’s hear what he said about this. 

LEE: So, Zak, as far as I can tell, Daniel and Morgan are figuring this out on their own as medical students. I don’t think this is part of the curriculum. Does it need to be? 

KOHANE: It’s missing the bigger point. The incentives and economic forces are such that even if you were Daniel, and things have not changed in terms of incentives, and it’s 2030, he still has to see this many patients in an hour. 

And sitting down, going over that with a patient, let’s say some might need more … in fact, I think computer scientists are enriched for these sort of neurotic “explain [to] me why this works,” when often the answer is, “I have no idea; empirically it does.” 

And patients in some sense deserve that conversation, and we’re taught about joint decision making, but in practice, there’s a lot of skills that are deployed to actually deflect so that you can get through the appointment and see enough patients per hour. 

And that’s why I think that one of the central … another task for AI is how to engage with patients to actually explain to them why their doctor is doing what he’s doing and perhaps ask the one or two questions that you should be asking the doctor in order to reassure you that they’re doing the right thing.

LEE: Yeah. 

KOHANE: I just … right now, we are going to have less doctor time, not more doctor time. 

And so I’ve always been struck by the divide between medicine that we’re taught as it should be practiced as a gentle person’s vocation or sport as opposed to assembly line, heads down “you’ve got to see those patients by the end of the day” because, otherwise, you haven’t seen all the patients at the end of the day. 

LEE: Yeah. Carey, I’ve been dying to ask you this, and I have not asked you this before. When you go see a doctor, are you coming in armed with ChatGPT information? 

GOLDBERG: I haven’t needed to yet, but I certainly would. And also my reaction to the medical student description was, I think we need to distinguish between the last 20 years, when patients would come in armed with Google, and what they’re coming in with now because at least the experiences that I’ve witnessed, it is miles better to have gone back and forth with GPT-4 than with, you know, dredging what you can from Google. And so I think we should make that distinction. 

And also, the other thing that most interested me was this question for medical students of whether they should not use AI for a while so that they can learn … 

LEE: Yes. 

GOLDBERG: … how to think and similarly maybe don’t use the automated scribes for a while so they can learn how to do a note. And at what point should they then start being able to use AI? And I suspect it’s fairly early on that, in fact, they’re going be using it so consistently that there’s not that much they need to learn before they start using the tools. 

LEE: These two students were incredibly impressive. And so I have wondered, you know, if we got a skewed view of things. I mean, Morgan is, of course, a very, very impressive person. And Daniel was handpicked by the dean of the medical school to be a subject of this interview. 

KOHANE: You know, we filter our students, by and large, I mean, there’s exceptions, but students in medical school are so starry eyed. And they are really … they got into medical school—I mean, some of them may have faked it—but a lot of them because they really wanted to do good. 

LEE: Right. 

KOHANE: And they really wanted to help. And so this is very constant with them. And it’s only when they’re in the machine, past medical school, that they realize, oh my God, this is a very, very different story. 

And I can tell you, because I teach a course in computational-enabled medicine, so I get a lot of these nerd medical students, and I’m telling them, “You’re going to experience this. And you’re going to say, ‘I’m not going to able to change medicine until I get enough cred 10, 15 years from now, whereas I could start my own company and immediately change medicine.’” 

And increasingly I’m getting calls in like residency and saying, “Zak, help me. How do I get out of this?” 

GOLDBERG: Wow. 

KOHANE: And so I think there’s a real disillusionment of, like, between what we’re asking for people coming to medical school—we’re looking for a phenotype—and then we’re disappointing them massively, not everywhere, but massively. 

And for me, it’s very sad because among our best and brightest, and then because of economics and expectations and the nature of the beast, they’re not getting to enjoy the most precious part of being a doctor, which is that real human connection, and longitudinality, you know, the connection between the same doctor visit after visit, is more and more of a luxury. 

LEE: Well, maybe this gets us to the last episode, you know, where I talk to a former, you know, state director of public health, Umair Shah, and with Gianrico Farrugia, who’s the CEO of Mayo Clinic. And I think if there’s one theme that I took away from those conversations is that we’re not thinking broadly enough nor big enough. 

And so here’s a little quote of exchange that Umair Shah, who was the former head of public health in the State of Washington and prior to that in Harris County, Texas, and we had a conversation about what techies tend to focus on when they’re thinking about AI and medicine. 

LEE: Yes. And in fact, it’s not even delivery. I think techies—I did this, too—tend to gravitate specifically to diagnosis.

LEE: I have been definitely guilty. I think Umair, of course, was speaking as a former frustrated public health official in just thinking about all the other things that are important to maintain a healthy population. 

Is there some lesson that we should take away? I think our book also focused a lot on things like diagnosis. 

KOHANE: Yeah. Well, first of all, I think we just have to have humility. And I think it’s a really important ingredient. I found myself staring at the increase in lifespan in human beings over the last two centuries and looking for bumps that were attributable. 

I’m in medical school. I’ve already made this major commitment. What are the bumps that are attributable to medicine? And there was one bump that was due to vaccines, a small bump. Another small bump that was due to antibiotics. And the rest of it is nutrition, sanitation, yeah, nutrition and sanitation. 

And so I think doctors can be incredibly valuable, but not all the time. And we’re spending now one-sixth of our GDP on it. The majority of it is not effectively prolonging life. And so the humility has to be the right medicine at the right time. 

But that runs, (A) against a bunch of business models. It runs against the primacy of doctors in healthcare. It was one thing when there were no textbooks; there was no PubMed. You know, the doctor was the repository of all the probably knowledge that we have. But I think your guests were right. We have to think more broadly in the public health way. How do we make knowledge pervasive like sanitation? 

GOLDBERG: Although I would add that since what we’re talking about is AI, it’s harder to see if … and if what you’re talking about is public health, I mean, it was certainly very important to have good data during the pandemic, for example. 

But most of the ways to improve public health, like getting people to stop smoking and eat better and sleep better and exercise more, are not things that AI can help with that much. Whereas diagnosis or trying to improve treatment are places that it could tackle. 

And in fact, Peter, I wanted to put you—oh, wait, Zak’s going to say something—but, Peter, I wanted to put you on the spot. 

LEE: Yeah. 

GOLDBERG: I mean, if you had a medical issue now, and you went to a physician, would you be OK with them not using generative AI? 

LEE: I think if it’s a complex or a mysterious case, I would want them to use generative AI. I would want that second opinion on things. And I would personally be using it. If for no other reason than just to understand what the chart is saying. 

I don’t see, you know, how or why one wouldn’t do that now. 

KOHANE: It’s such a cheap second opinion, and people are making mistakes. And even if there are mistakes on the part of AI, if there’s a collision, discrepancy, that’s worth having a discussion. And again, this is something that we used to do more of when we had more time with the patients; we’d have clinic conferences. 

LEE: Yeah. 

KOHANE: And we don’t have that now. So I do think that there is a role for AI. But I think again, it’s much more of a continual presence, being part of a continued conversation rather than an oracle. 

And I think that’s when you’ll start seeing, when the AI is truly a colleague, and saying, “You know, Zak, that’s the second time you made that mistake. You know, that’s not obesity. That’s the effect of your drugs that you’re giving her. You better back off of it.” And that’s what we need to see happen. 

LEE: Well, and for the business of healthcare, that also relates directly to quality scores, which translates into money for healthcare providers. 

So the last person that we interviewed was Gianrico Farrugia. And, you know, I was sort of wondering, I was expecting to get a story from a CEO saying, “Oh, my God, this has been so disruptive, incredibly important, meaningful, but wow, what a headache.” 

At least Gianrico didn’t expose any of that. Here’s one of the snippets to give you a sense. 

LEE: So I tried pretty hard in that interview to get Gianrico to admit that there was a period of headache and disruption here. And he never, ever gave me that. And so I take him at his word. 

Zak, maybe I should ask you, what about Harvard and the whole Harvard medical ecosystem? 

KOHANE: I would be surprised if there are system-wide measurable gains in health quality right now from AI. And I do have to say that Mayo is one of the most marvelous organizations in terms of team behavior. So if there’s someone who’s gotten the team part of it right, they’ve come the closest, which relates to our prior conversation. They have the quarterback idea … 

LEE: Yes. 

KOHANE: … pretty well down compared to others. 

Nonetheless, I take him at his word, that it hasn’t disrupted them. But I’m also, I have yet to see the evidence that there’s been a quantum leap in quality or efficacy. And I do believe that it’s possible to have a quantum leap in efficacy in the right system. 

So if they haven’t been disrupted, I would venture that they’ve absorbed it, but they haven’t used it to its fullest potential. And the way I could be proven wrong is next year, also the metrics showing that over the last year, they’ve had, you know, decreased readmissions, decreased complications, decreased errors and all that. And if so, God bless them. And we should all be more like Mayo. 

LEE: So I thought a little bit about two other quotes from the interviews that sort of maybe would send us off with some more inspirational kind of view of the future. And so there’s one from Bill Gates and one from Gianrico Farrugia. So what I’d like to do is to play both of those and then maybe we can have our last comments. 

And now Gianrico. 

All right. I think these are both kind of calls to be more assertive about this and more forward leaning. I think two years into the GPT-4 era, those are pretty significant and pretty optimistic calls to action. So maybe just to give you both one last word. What would be one hope that you would have for the world of healthcare and medicine two years from now? 

KOHANE: I would hope for businesses that whoever actually owns them at some holding company level, regardless of who owns them, are truly patient-focused companies, companies where the whole AI is about improving your care, and it’s only trying to maximize your care and it doesn’t care about resource limitations. 

And as I was listening to Bill, and the problem with what he was saying about saving dollars for governments is for many things, we have some very expensive things that work. And if the AI says, “This is the best thing,” it’s going to break your bank. And instead, because of research limitations, we play a human-based fancy footwork to get out of it. 

That’s a hard game to play, and I leave it to the politicians and the public health officials who have to do those trades of utilities. 

In my role as doctor and patient, I’d like to see very informed, authoritative agents acting only on our behalf so that when we go and we seek to have our maladies addressed, the only issue is, what’s the best and right thing for me now? And I think that is both technically realizable. And even in our weird system, there are business plans that will work that can achieve that. That’s my hope for two years from now. 

LEE: Yeah, fantastic. Carey. 

GOLDBERG: Yeah. I second that so enthusiastically. And I think, you know, we have this very glass half full/glass half empty phenomenon two years after the book came out. 

And it’s certainly very nice to see, you know, new approaches to administrative complexity and to prior authorization and all kinds of ways to make physicians’ lives easier. But really what we all care about is our own health and that we would like to be able to optimize the use of this truly glorious technological achievement to be able to live longer and better lives. And I think what Zak just described is the most logical way to do that. 

[TRANSITION MUSIC] 

LEE: Yeah, I think for me, two years from now, I would like to see all of this digital data that’s been so painful, such a burden on every doctor and nurse to record, actually amount to something meaningful in the care of patients. And I think it’s possible. 

KOHANE: Amen. 

GOLDBERG: Yeah. 

LEE: All right, so it’s been quite a journey. We were joking before we’re still on speaking terms after having written a book. [LAUGHS] 

And then, um, I think listeners might enjoy knowing that we debated amongst ourselves what to do about a second edition, which seemed too painful to me, and so I suggested the podcast, which seemed too painful to the two of you [LAUGHTER]. And in the end, I don’t know what would have been easier, writing a book or doing this podcast series, but I do think that we learned a lot. 

Now, last bit of business here. To avoid having the three of us try to write a book again and do this podcast, I leaned on the production team in Microsoft Research and the Microsoft Research Podcast. And I thought it would be good to give an explicit acknowledgment to all the people who’ve contributed to this. 

So it’s a long list of names. I’m going to read through them all. And then I suggest that we all give an applaud [LAUGHTER] to them. And so here we go. 

There’s Neeltje Berger, Tetiana Bukhinska, David Celis Garcia, Matt Corwine, Jeremy Crawford, Kristina Dodge, Chris Duryee, Ben Ericson, Kate Forster, Katy Halliday, Alyssa Hughes, Jake Knapp, Weishung Liu, Matt McGinley, Jeremy Mashburn, Amanda Melfi, Wil Morrill, Joe Plummer, Brenda Potts, Lindsay Shanahan, Sarah Sobolewski, David Sullivan, Stephen Sullivan, Amber Tingle, Caitlyn Treanor, Craig Tuschhoff, Sarah Wang, and Katie Zoller. 

Really a great team effort, and they made it super easy for us. 

GOLDBERG: Thank you. Thank you. Thank you. 

KOHANE: Thank you. Thank you.

GOLDBERG: Thank you. 

[THEME MUSIC] 

LEE: A big thank you again to all of our guests for the work they do and the time and expertise they shared with us. 

And, last but not least, to our listeners, thank you for joining us. We hope you enjoyed it and learned as much as we did. If you want to go back and catch up on any episodes you may have missed or to listen to any again, you can visit aka.ms/AIrevolutionPodcast (opens in new tab).

Until next time.

[MUSIC FADES] 

The post Coauthor roundtable: Reflecting on healthcare economics, biomedical research, and medical education appeared first on Microsoft Research.

Read More

MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation

MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation

Three white line icons on a gradient background transitioning from blue to pink. From left to right: a network or molecule structure with a central circle and six surrounding nodes, a 3D cube, and an open laptop with an eye symbol above it.

A new research framework helps AI agents explore three-dimensional spaces they can’t directly detect. Called MindJourney, the approach addresses a key limitation in vision-language models (VLMs), which give AI agents their ability to interpret and describe visual scenes.  

While VLMs are strong at identifying objects in static images, they struggle to interpret the interactive 3D world behind 2D images. This gap shows up in spatial questions like “If I sit on the couch that is on my right and face the chairs, will the kitchen be to my right or left?”—tasks that require an agent to interpret its position and movement through space. 

People overcome this challenge by mentally exploring a space, imagining moving through it and combining those mental snapshots to work out where objects are. MindJourney applies the same process to AI agents, letting them roam a virtual space before answering spatial questions. 

How MindJourney navigates 3D space

To perform this type of spatial navigation, MindJourney uses a world model—in this case, a video generation system trained on a large collection of videos captured from a single moving viewpoint, showing actions such as going forward and turning left of right, much like a 3D cinematographer. From this, it learns to predict how a new scene would appear from different perspectives.

At inference time, the model can generate photo-realistic images of a scene based on possible movements from the agent’s current position. It generates multiple possible views of a scene while the VLM acts as a filter, selecting the constructed perspectives that are most likely to answer the user’s question.

These are kept and expanded in the next iteration, while less promising paths are discarded. This process, shown in Figure 1, avoids the need to generate and evaluate thousands of possible movement sequences by focusing only on the most informative perspectives.

Figure 1. Given a spatial reasoning query, MindJourney searches through the imagined 3D space using a world model and improves the VLM's spatial interpretation through generated observations when encountering a new  challenges.
Figure 1. Given a spatial reasoning query, MindJourney searches through the imagined 3D space using a world model and improves the VLM’s spatial interpretation through generated observations when encountering new challenges. 

 

To make its search through a simulated space both effective and efficient, MindJourney uses a spatial beam search—an algorithm that prioritizes the most promising paths. It works within a fixed number of steps, each representing a movement. By balancing breadth with depth, spatial beam search enables MindJourney to gather strong supporting evidence. This process is illustrated in Figure 2.

MindJourney pipeline diagram
Figure 2. The MindJourney workflow starts with a spatial beam search for a set number of steps before answering the query. The world model interactively generates new observations, while a VLM interprets the generated images, guiding the search throughout the process.

By iterating through simulation, evaluation, and integration, MindJourney can reason about spatial relationships far beyond what any single 2D image can convey, all without the need for additional training. On the Spatial Aptitude Training (SAT) benchmark, it improved the accuracy of VLMs by 8% over their baseline performance.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.


Building smarter agents  

MindJourney showed strong performance on multiple 3D spatial-reasoning benchmarks, and even advanced VLMs improved when paired with its imagination loop. This suggests that the spatial patterns that world models learn from raw images, combined with the symbolic capabilities of VLMs, create a more complete spatial capability for agents. Together, they enable agents to infer what lies beyond the visible frame and interpret the physical world more accurately. 

It also demonstrates that pretrained VLMs and trainable world models can work together in 3D without retraining either one—pointing toward general-purpose agents capable of interpreting and acting in real-world environments. This opens the way to possible applications in autonomous robotics, smart home technologies, and accessibility tools for people with visual impairments. 

By converting systems that simply describe static images into active agents that continually evaluate where to look next, MindJourney connects computer vision with planning. Because exploration occurs entirely within the model’s latent space—its internal representation of the scene—robots would be able to test multiple viewpoints before determining their next move, potentially reducing wear, energy use, and collision risk. 

Looking ahead, we plan to extend the framework to use world models that not only predict new viewpoints but also forecast how the scene might change over time. We envision MindJourney working alongside VLMs that interpret those predictions and use them to plan what to do next. This enhancement could enable agents more accurately interpret spatial relationships and physical dynamics, helping them to operate effectively in changing environments.

The post MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation appeared first on Microsoft Research.

Read More

Dion: the distributed orthonormal update revolution is here

Dion: the distributed orthonormal update revolution is here

Three white icons on a gradient background transitioning from blue to green. From left to right: a network of interconnected nodes, a speedometer with the needle pointing right, and a flowchart with squares and a diamond shape.

Training AI models requires choosing an optimizer and for nearly a decade, Adam( (opens in new tab)W) (opens in new tab) has been the optimizer of choice. Given that durability and success, it was fair to doubt that any further improvement was possible. And yet, last December, a new optimizer called Muon (opens in new tab) showed serious promise by powering a nanoGPT speedrun (opens in new tab). This proved out, with multiple AI labs (e.g., Kimi-AI (opens in new tab) and Essential-AI (opens in new tab)) reporting 2x scale improvements and the release of the 1T parameter Kimi K2 (opens in new tab) model. Restated: you can train a model to similar performance with half as many GPUs.

There’s one fly in the ointment: Muon requires large matrix multiplications in the optimizer, which requires heavy communication in large models at the scale where FSDP and TP parallelization becomes desirable. Going back to the inspiration for Muon, the key idea is an orthonormal update, which sparked the search for more scalable alternative linear algebras realizing the same goal. That’s exactly what Dion is. We have open-sourced this new optimizer to enable anyone to train large models more efficiently at scale.  

What’s an orthonormal update?

Illustration of matrix parameters
Figure1. Illustration of matrix parameters

At the core of Transformers, a set of input activations is multiplied by a learned weight matrix to produce a new set of output activations. When the weight matrix is updated during training, the resulting change in the output activations generally depends on the direction of the input activations. As a result, the learning rate must be chosen conservatively to accommodate the input direction that induces the largest change. Orthonormalized updates alter this behavior by (approximately) making the change in output activations invariant to the direction of the input. This is achieved by enforcing orthonormality (opens in new tab) on the update matrix, thereby equalizing its effect across all input directions.

What is Dion?

While Muon has shown strong empirical results, scaling it to very large models poses challenges. As reported by Essential AI (opens in new tab), applying Muon to large architectures like LLaMA-3 becomes compute-bound—and potentially communication-bound—due to the cost of the Newton–Schulz orthonormalization steps (opens in new tab).

Pseudocode of the centralized version of Dion
Figure 2. Pseudocode of the centralized version of Dion

This is where Dion enters. At a high level, Dion introduces a new axis for scalability: the rank. Specifically, for a given rank r, Dion orthonormalizes only the top r of the singular vector space, reducing communication and compute overhead while preserving performance. Empirically, we observe that the necessary rank for good performance grows much more slowly than the number of parameters in larger models.

Dion implements orthonormalization using amortized power iteration (opens in new tab)Power iteration typically pulls out the largest singular value by repeated matrix multiplication. By amortizing this process over optimization steps—applied to the slowly-evolving momentum matrix—we reduce the cost to just two matrix multiplications per step. Incorporating a QR decomposition allows us to extract an approximate orthonormal basis spanning the top singular directions, rather than just the leading one. This amortized power iteration is fully compatible with standard distributed training techniques such as FSDP and tensor parallelism. Here, we show a simple centralized version, but the technique works for more complex forms of parallelization as presented in the paper. In other words, we can orthogonalize a matrix without ever seeing a full row or column of it

Low-rank approximation would ordinarily introduce error, but Dion overcomes this through an error feedback mechanism. This keeps the residual of low rank approximation in the momentum matrix so that any systematic gradient structure not initially captured accumulates to eventually be applied in a future update.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


How does it work?

Something very strange happened in our experiments. Usually, adding an extra constraint on the way an algorithm works can be expected to decrease overall performance. And indeed, at the 120M parameter scale of the speedrun, we see Dion’s update taking more time than Muon, while not yielding any significant gains. But at larger scales, we observed a different trend: Dion began to outperform Muon.

Wall-clock time speedup of Dion for 3B model training
Figure 3. Wall-clock time speedup of Dion for 3B model training

Why would adding a constraint improve the update rule? The answer lies in what the constraint enforces. Dion achieves a much closer approximation to true orthonormalization than Muon. This precision, initially subtle, becomes increasingly important as the number of singular vectors grows. Over increasing model scale and training steps, this small advantage accumulates—leading to a measurable improvement in performance.

This edge further grows with batch size—with larger batches the update quality tends to degrade, but notably more slowly with Dion than Muon (and Muon is already a significant improvement over AdamW).

Scaling of Dion across different batch sizes
Figure 4. Scaling of Dion across different batch sizes

Here you can see how the number of steps to reach a pretraining loss compared to AdamW varies as batch size grows with full rank and ¼ rank Dion (in orange) and Muon (in blue).   

In our experiments, these benefits extend to various post-training regimes as well.

We also experimented with rank, discovering empirically that larger models tolerate smaller rank well.

Low-rank Dion across different model sizes
Figure 5. Low-rank Dion across different model sizes

Projecting this trend out to the scale of the LLaMA-3 (opens in new tab) 405B parameter models suggests that Dion is fully effective even with rank fractions as low as 1/16 or 1/64 for large dense models like LLaMA-3.    

Using hardware timings of the individual update steps suggests a story that looks this:

Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon.
Figure 6. Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon.

We’ve open-sourced a PyTorch FSDP2 + Tensor Parallel (TP) implementation of Dion, available via a simple pip install. Our goal is to make faster training with Dion accessible to everyone. As a bonus, the repository also includes a PyTorch FSDP2 implementation of Muon.

Acknowledgements

We thank Riashat Islam and Pratyusha Sharma for their helpful feedback on the writing and presentation.

The post Dion: the distributed orthonormal update revolution is here appeared first on Microsoft Research.

Read More

Reimagining healthcare delivery and public health with AI

Reimagining healthcare delivery and public health with AI

Illustrated headshots of Peter Lee, Umair Shah, Gianrico Farrugia

In November 2022, OpenAI’s ChatGPT kick-started a new era in AI. This was followed less than a half year later by the release of GPT-4. In the months leading up to GPT-4’s public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee. 

In this episode, healthcare leaders Dr. Umair Shah (opens in new tab) and Dr. Gianrico Farrugia (opens in new tab) join Lee to discuss AI’s impact on the business of public health and healthcare delivery, the healthcare-research connection, and the patient experience. Shah, a healthcare strategic consultant and former state secretary of health, explores the role of public health in the larger ecosystem and why it might not get the attention it needs or deserves and how AI could be leveraged to assist in data analysis, to help better engage with people on matters of public health, and to help narrow gaps between care delivery and public health responses during health emergencies. Farrugia, president and CEO of Mayo Clinic, traces AI’s path from predictive to generative and discusses how that progress has helped usher in a new healthcare architecture for Mayo Clinic and its partners, one powered by the goal of longer, healthier lives for patients, and how AI is also changing Mayo Clinic’s research and the education it provides, including the offering of masters and PhDs in AI and other emerging technologies. 

Transcript 

[MUSIC] 

[BOOK PASSAGE] 

PETER LEE: “In US healthcare, quality ratings are increasingly used to tie the improvement in patient health outcomes to the reimbursement rates that healthcare providers can receive. The ability of GPT-4 to understand these systems and give concrete advice … has a chance to make it easier for providers to achieve success in both dimensions.” 

[END OF BOOK PASSAGE] [THEME MUSIC]

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.


[THEME MUSIC FADES] 

The book passage I read at the top is from Chapter 7, “The Ultimate Paperwork Shredder.”

Public health officials and healthcare system leaders influence the well-being and health of people at the population level. They help shape people’s perceptions and responses to public health emergencies, as well as to chronic disease. They help determine the type, quality, and availability of treatment. All this is critical for maintaining good public health, as well as aligning better health and financial outcomes. That, of course, is the main goal of the concept of value-based care. AI can definitely have significant ramifications for achieving this. 

Joining us today to talk about how leaders in public health and healthcare systems are thinking about and acting on this new generation of AI is Dr. Umair Shah and Dr. Gianrico Farrugia. 

Dr. Umair Shah is a nationally recognized health leader and innovator. He led one of America’s top-rated pandemic responses as Washington State’s secretary of health, a position he held from 2020 to 2025. Umair previously directed Harris County Public Health in Texas, overseeing large-scale emergency response for the nation’s third-largest county, while building an emergency-care career spanning 20-plus years. He now advises organizations on health innovation and strategy as founder and principal of Rickshaw Health. 

Dr. Gianrico Farrugia is the president and CEO of Mayo Clinic, the world’s top-ranked hospital for seven consecutive years, and a pioneer in technology-forward, platform-based healthcare. Under his leadership, Mayo has built and deployed the Mayo Clinic Platform. The platform enables Mayo and its partners to gain practical insights from a comprehensive repository of longitudinal de-identified clinical data spanning four continents. Gianrico is also a Mayo Clinic physician and professor and an author. 

Umair and Gianrico are CEO-level leaders representing some of the best of the worlds of public health, healthcare delivery, medical research, and medical education. 

[TRANSITION MUSIC] 

Here is my interview with Dr. Umair Shah: 

LEE: Umair, it’s really great to have you here. 

UMAIR SHAH: Peter, it’s my pleasure. I’ve been looking forward to this conversation, and I hope you are well today. 

LEE: [LAUGHS] I am doing extremely well.

So, you know, what I’d like to do in these conversations is first just to start, a little bit about you.

SHAH: Sure. 

LEE: You served actually during a really tumultuous time as the secretary of health in the State of Washington. But you recently stepped away from that and you started your own firm, Rickshaw Health. So can we start there? What’s that all about? 

SHAH: Yeah, no, absolutely. First of all, you know, I would say that the transition from Texas to Washington could not have been more geopolitically different, [LAUGHTER] as you can imagine.

LEE: Sure. 

SHAH: You know, if you like the red-blue paradigms, you couldn’t be more, you know, red and you couldn’t be more blue, I think. 

LEE: Yes. 

SHAH: But what happened is, back in November this past year, as I saw some of the playout of continuation of this red-blue dynamic, I made the decision to step down. And Jan. 15, I stepped down, as you mentioned, and I spent some time really thinking about what I wanted to do next and was looking at a number of opportunities. 

And then a moment in time, there were some things happening in our—my wife and our family’s personal lives that sort of made me think that I wanted to focus a little bit more on family. And I felt the universe was saying, “Stay still.” [LAUGHTER] 

And I launched Rickshaw Health (opens in new tab) and the notion that, as you know, Peter, rickshaws are oftentimes known across the globe as these modes of transport that reliably get you through ever-changing streets and traffic patterns and all sorts of ecosystems that are evolving at all times. And they get you to the other side and they get you also with a sense of exhilaration. Like when I took my boys to Karachi, and we were—you know, they jumped in a rickshaw and the, you know, open air [LAUGHTER] and they felt this incredible excitement. 

And so Rickshaw Health was speaking to the three wheels of a rickshaw that symbolize the three children that we have and the real notion of how do we bring balance and agility and performance to the forefront and then move in an ever—just like streets—ever-changing healthcare environment that is constantly evolving, and we too must evolve with it. And that’s what Rickshaw Health is all about, is taking clients to that next level of trying to navigate, especially at this time, a very, very different landscape than even several months ago. So, excited about it. 

LEE: Yeah, absolutely. You know, you made this transition from Texas to the State of Washington. And for people who listen to this podcast and don’t know, the particular part of Texas where you were—Harris County—is really big, very, very important in that state. That’s just not, you know, the normal county in Texas. 

SHAH: Yeah. [LAUGHS] 

LEE: It’s actually … it’s actually known as quite a forward-looking place, technologically. 

SHAH: That’s right. 

LEE: So what was, you know, the transition like, then, going from, you know, possibly the most, sort of, maybe advanced county in the State of Texas, a large place, to the State of Washington? 

SHAH: Yeah, you know, Harris County is the third-largest county in the US. So it had close to five million. And now it’s probably … it’s exceeded the five million people, and a very diverse, very forward-looking, as you mentioned, technologically very, very much looking at what’s the next horizon, and home to Texas Medical Center [TMC] as well, which is … 

LEE: Right. 

SHAH: … the largest medical center. Of course, it had to be Texas. So it can’t be the largest in the state or the country [LAUGHTER]—the largest in the world, right. 

And TMC also had a number of different initiatives related to startups and venture capital and VC. And so they had launched something called TMCX. And that was a real opportunity—and I know you’re familiar with it—an opportunity to really look at how do you incubate all sorts of different innovations and bringing private sector, public sector as well as healthcare delivery alongside these startups to really look at the landscape. 

And so when I left Houston and came to Washington, I realized that obviously, I was in the backyard … I mean, you know, you all at Microsoft Research and the work that you’re all doing is part of an ecosystem of advanced innovation that’s occurring in the Pacific Northwest that, you know, when we see all the players that are here, all the, you know, ones that do so many different things, but they’re doing them with an eye towards technology, advancements, and adoptions, it’s been quite amazing. 

When I made that transition, it was really about, you know, the vaccines and what was happening with, you know, with COVID and fighting the—you know, remember, this was the state that had the first case in the continental United States, had the first outbreak, and the first [lab-confirmed] death. And fast-forward a few years later, we had the fifth-lowest death rate in the US. And that was because we all came together to do so much.

LEE: Yeah, well maybe that gets us into a question that I ask a lot of our guests, which is, you know, and maybe let’s, since we’re on your time as the secretary of health in Washington State, [start] with that job. I ask, how would you explain to your mother what you do every day? 

SHAH: [LAUGHS] I laugh because that’s been such a fascinating conversation in public health because we have oftentimes been—it’s been really hard to describe what that is. 

LEE: Yeah. 

SHAH: And, you know, there are so many metaphors and, you know, analogies that we’ve used. I’ve always wondered why we do not have more television shows or sitcoms or dramas that are about the public health workforce or the work that we do in the field, because you have, you know, all sorts of healthcare delivery ones, right. 

LEE: Yup. 

SHAH: As a practicing physician for 20 years, I realized that people knew what doctors did; they knew what nurses did, right. They intimately touch the healthcare system. 

LEE: Yes. 

SHAH: They understood, you know, that an ambulance picks you up at your home or somewhere else, transports you … gets you to the emergency department. The emergency department, they do some things to you or within the four walls of that ER, and then you’re either admitted, sent home, and several days, weeks, whatever later, you get home if you’re admitted, and you start your, you know, post-hospital stay at home or your rehab or what have you. And that all is known to people. 

But when you ask your mother, your grandmother, or your, you know, your uncle, or your brother, your neighbor, your coworker about what is public health, they have a very quizzical look on their face of what that is. 

LEE: Right. 

SHAH: And so what I’ve … 

LEE: You know, just one thing I’ve learned is: it’s not just all the people you mentioned. Even healthcare professionals sometimes have that quizzical look. 

SHAH: Yeah, good point. That’s right. Good point. And a lot of it is because we don’t get exposed to it or trained in it. You know, we think about public health when we’re in our training. And, you know, I’m sure you had a very similar piece of this is that, you know, you see it as, oh, that’s the health department that takes care of, you know, STDs, or it takes care, you know, it does the immunizations, or, you know, maybe they do some water quality, or maybe they do mosquitoes [mosquito control], and things like that. But the reality is, we do all of those things and more. 

So my metaphor has been that we are the offensive line of a football team, and the healthcare delivery is the quarterback. So everybody focuses on, you know, from a few years back, everybody knows Tom Brady, right. 

LEE: Yeah. [LAUGHS] 

SHAH: He won the Super Bowls, everybody knows what … but if you asked people who was number 75 on the offensive line of the New England Patriots … 

LEE: Right. 

SHAH: … or name your favorite football team. And the answer would be: you would not be able to likely answer that question. You would know Tom Brady, the quarterback, and that’s healthcare delivery, the ER doc or the hospitalist or the nurse or the, you know, the medical assistant, or the people that are doing all the work in the field that are the ones that are more visible, but the invisible workforce of the offensive line, that’s who we don’t know. And yet these are the people that are blocking and sweating and doing all things to complement the work and make sure the quarterback is successful. 

And here’s where the metaphor breaks down, that when Tom Brady wins the Super Bowl, we continue to invest in the offensive line because we recognize the value of it and we want the quarterback to be successful the next season. But in public health or in society, we do the exact opposite. 

When tuberculosis rates come down, we say, well, you know what? We’ve solved the problem; we don’t need it anymore. 

LEE: Right. 

SHAH: Or you have another, you know, environmental issue that’s no longer there, you say, “We don’t need it anymore.” And we disinvest from public health or that offensive line. And then you start to see those rates go back up. 

And so my answer to Mom and Grandma and Dad and Grandpa is we are critical to your health because we touch you every single day. And so please invest in us. 

LEE: Yeah. And, you know, I think I’m going to want to get a little deeper on that in just a few minutes here, because, I think especially during the pandemic, that issue of not understanding the importance of that offensive lineman actually really came to the forefront. 

And so I’d like to get into that. But the, kind of, second, kind of, standard thing I’ve been probing with people is still just focusing on you and your background is what touchpoints or experiences you’ve had with AI in the past. 

And not everyone has. Like, it maybe isn’t too surprising that doctors and healthcare developers, tech developers, have lots of contact with AI, but would the top dog, you know, at a public health agency ever have had significant contact with AI? What about you? 

SHAH: You know, it’s interesting. Several years ago, I was in the audience with the [then] FEMA director, [Rich Serino], who just did such an incredible job. And I remember he made this comment at that time. And, Peter, this may have been like … I don’t know—I’m dating myself—10, 15, maybe even 20 years ago, and he said, “Everybody in the audience, there’s this, you know, app called Twitter.” And, you know, “How many people in the audience have ever sent a tweet or know about this?” And I don’t know, maybe—it was a public health audience—maybe about 15% of the people raised their hands. 

He said, “I challenge you to right now, pick up your phone, download the app, and go ahead and send a tweet right now.” 

And I remember I sent my first tweet at that time. And it was so thought provoking for me was that he was saying you need to be engaged in social media, but the other 85% of the audience had not even done that or had … 

LEE: Right. 

SHAH: … even understood the importance of social media at that time. Or maybe they understood, but they had restrictions on how to utilize, right. 

So that has stayed with me because that’s very much about this revolution of AI that I know that public health and population health practitioners like myself who have been in the trenches and understand the importance of it, they really believe in the importance or think they know the importance. 

But NACCHO, the National Association of County and City Health Officials, had done a survey of local health agencies. And about two-thirds, if not three-quarters, of local health agencies reported that they had an AI capacity that was low or lower than ideal. 

LEE: Hmm. Yeah. Yeah. 

SHAH: And that is very much where I come from. When I was in public sector and at the state health agency, our transformation was very much about how do we advance the work, and how do we utilize this in a population health standpoint? 

And I was fortunate to have a chief of innovation at Washington State Department of Health, Les Becker, who understood the value of AI. And as you know, we did also hold a AI science convening that … 

LEE: Yep. Yeah. 

SHAH: … your team was there with University of Washington. And that was really an opportunity for us to say that AI is here. It’s not tomorrow. It’s not next year. It’s not the future. It’s already here. We need to embrace it. 

But here’s the problem, Peter, far too few people in our field understand just how to embrace it. 

LEE: Right. 

SHAH: So I have become a markedly more champion of AI. One, since I read your book. So I think there’s that. So thank you for writing it. But two, since I really recognize that when I became a solo or a primary-few practitioner in my own realm, I needed to force-amplify the work that I was doing. 

And when I look back, and I continue to stay in touch with my colleagues in the field of public health, what they’re also struggling with is that you have an epidemiologist who’s got a mound of information—data, statistics, etc.—that they are going through, and they’re doing everything in their power to get that processed and analyzed. 

LEE: Yep. Yep. 

SHAH: AI can take 80% of that and do it. And that epidemiologist can now turn to more of an overseer and a gatekeeper and to really recognize the patterns … 

LEE: Yep. 

SHAH: … and let AI be able to do the, you know, grunt work. And similarly, as you know, measles—with the outbreaks that we’ve seen, especially in Texas but elsewhere—you’ve got an opportunity where our communications people who are saying, “Look, we’re about to have, or we know we’re about to announce that there’s a measles outbreak in, you know, in our community or our state or what have you—our region.” 

And they can have AI go through different press briefings and/or press releases and say, “Give me the state of the art on how I should communicate this message to the community.” 

LEE: Hmm. 

SHAH: And bam! You can do that. And now you can oversee that work, as well. And then the third example is that we are always looking at how do we find ways to have a deeper connection with those who come to our, you know, our websites or come to our engagement tools—with bots and things like that. AI can really accelerate that work, as well. So there’s so many use cases that AI has for population health or public health. 

LEE: Yeah. 

SHAH: But I think the challenge is that we just don’t have enough adoption because they’re … one, we’ve had funding cuts, but two is that there is this real hesitation on, what is it that we can do? And I argue—the last thing I’ll say about this, Peter—is that I argue that AI is happening right now. The discussions, the technology advancements, the work, the policy work, all that’s happening right now. If public health practitioners are not at the table, if they’re not part of the, … 

LEE: Right. 

SHAH: … “What does this look like? How does it work in our field?” … guess what? It’s going to be done to us and for us rather than with us. And if we do not get with that and get to the table, then unfortunately it may not be exactly what we want it to be at the end of the day. 

LEE: I find it really interesting that you are using the terms “public health” and “population health” … 

SHAH: Yeah. 

LEE: … pretty much interchangeably here. And I think that that’s something that I think touches on an assumption that was both implicit and explicit in the book that we wrote, which is: we were making some predictions that our ability to extract insights and knowledge from population health data would be enhanced through the use of AI. And I think that it looks to me like that has been more challenging and has come along more slowly over the past two years. But what is your view? 

SHAH: Yeah, I think part of, and I think you and I have had this conversation, you know, in bits and pieces. I think one of the real challenges is that when even tech companies, and you can name all of them, when they look at what they’re doing in the AI space, they gravitate towards healthcare delivery. 

LEE: Yes. 

SHAH: Right? That’s, it’s … 

LEE: And in fact, it’s not even delivery. I think techies—I did this, too—tend to gravitate specifically to diagnosis. 

SHAH: Yes, that’s right. That’s right. You know, I think that’s a really good point. And, you know, when you look at sepsis or you look at pneumonia or try to figure out ways that, you know, radiologists or x-rays or CT scans can be read, it’s, I mean, there are so many use cases that are within the healthcare sector. And I think that gets back to this inequity that we have when we look at population health or, you know, this broad, um, swath of land that is, oftentimes, left behind or unexplored, and you have healthcare delivery. Now, healthcare delivery we know gets 95 cents or 96 cents of every dollar. So it makes sense why, right. But we also know that, at the end of the day, we’re looking at value-based outcomes, and you cannot be successful in the healthcare delivery system unless we are truly looking at prevention and what’s happening in the community and the population. 

LEE: Right. 

SHAH: And that’s why I use it interchangeably, but I know that “public health” has got a very specific term, and “population health” is a different set of ways of looking at the world. The reason that people try to shy away from pop health in essence is that you could talk about population health as being my population of patients in a clinic. It could be my health systems population. It could be an insurance company saying, these are the lives covered, right. So it becomes, what is population? When we think of public health, we think of the entirety of the population, right. In the State of Washington, eight million people. Harris County, five million people. Or in the US, 300—whatever the number of millions of people that—we think of the entire population. And what is it that actually impacts the health and well-being of that population is really what that’s about. 

Yet here’s the challenge. When we then talk to those of our partners and our colleagues in the tech field, there are two things happening. One is, there’s a motivation because of the amount of dollars that are in [the] healthcare sector. And number two is, because it’s more familiar, right. 

LEE: Right. 

SHAH: And so there are very few practitioners similar to me that are out there, that are in the pop health who kind of know healthcare delivery because they’ve also seen patients, but they’re also—they worked at that federal, state, local level, community level—they’ve, you know, they’ve done you know various different kinds of environments. 

And they say, “Look, I’ve got a perspective to really help a tech company or somebody see the rest of it,” but you have to have both partners coming together to see that. And I think that’s one of the real challenges that we have. 

LEE: Yeah. 

And so now I’m going to want to go into specific problems, … 

SHAH: Yeah. Sure. 

LEE: … and maybe COVID is a good thing to focus on—the breadth of problems that had to get solved in pandemic response and where the gaps between healthcare delivery and public health were really exposed. 

And so the first problem that I remember really keenly that just seemed so vexing was understanding where the PPE was, the personal protective equipment … 

SHAH: Hmm. Yeah. 

LEE: and where it needed to be. 

SHAH: Yes. 

LEE: And so that turned out … you would think just getting masks and gowns and gloves to the right places at the right times or even understanding where they are so that, you know … and being able to predict, you know, what hospitals, what clinics are most likely to get a big influx of patients during the height of the pandemic would be something that would be straightforward to solve, but that turned out to be an extremely difficult problem. 

But how did it look from where you were sitting? Because you were sitting at the helm having to deal with these problems. 

SHAH: Yeah, we were constantly chasing data and information. And oftentimes, you know, because a lot of these data systems in the public health sector have been underinvested in over the decades, then, you know, you had our biggest emergency crisis of our time, and a lot of public health agencies were either getting, you know, thrown a whole host of resources or had to create things on the fly. 

And whether that was at Harris County or in the State of Washington, I will tell you that what I saw was that, you know, a lot of agencies across the country were still using fax machines, you know, to get data that were coming in. 

And I remember actually—it’s kind of a funny story—there was a fax machine that was highlighted down in our agency in Texas. And we actually had this fax machine, had mounds of, you know, data … sorry, papers that were next to … faxes that were coming in and all these things. 

And you would have, you know, Mr. Peter Lee listed as a patient. And then the next, you know, transmission would have Pete Lee. And then the next transmission would have Peter Lee, but instead of L-E-E, it was L-E-A-H or something, or L-I or something, right. And it was just … 

LEE: Right. 

SHAH: … or you had a date of birth missing, or you had, you know, an address that was off. And what we realized is that over time, a lot of the data that were coming in were just incomplete data, and being able to chase that was really hard. 

And so, you know, I think AI has that potential to really organize it, and to stratify it, and to especially get you to a point of at least cleaning it up. So I don’t think it’s just that AI … AI doesn’t just save time; it saves lives. Truly used … 

LEE: Hmm. Yeah. 

SHAH: … that’s, I think, where we’re talking here. 

And so when you have PPE and things of that nature, as you talked about, here in the State of Washington or what we were trying to do to get vaccines out or everything we’re doing to try to get communication messages to the public. And we did a fantastic job of that, although not ideal. 

I mean, there are so many things that I could point to that we could have done better—all of us in the field of public health and healthcare delivery alike. 

I will tell you that the one thing that stays with me is that if we had those tools then, and we had them in place then, and we had invested in them at that time in advance of, I think there was a real opportunity for us to be able to move ahead and even be better at how we affected the health outcomes of the very populations that we were trying to get to. 

And I think it’s [that] AI allows us to shift from reactive to proactive systems, catching health issues before they escalate and allow us to really communicate with empathy at scale

LEE: Right. 

SHAH: And when we can do those things, whether it’s opioids or whether it’s, you know, something that’s happening related to an infectious disease, or, you know, even this, the new agenda with Make America Healthy Again—which by the way, as you know, we had a Be Well, WABe Well, Washington … 

LEE: Right. Yes. 

SHAH: … very much that was about, you know, looking at, you know, physical health and nutritional health and emotional well-being and social connectedness—that there is a real opportunity for us to address the very drivers of ill health. And when we can do that, and AI can help us accelerate that, I think we truly have the ability to drive down costs and increase the value that’s returned to all of us. 

LEE: What is your assessment of public health agencies’ readiness to use technology like AI? Because if there’s one thing AI is good at, it’s predicting things. Are they [public health agencies] in a better position to predict things now? 

SHAH: You know, I think it’s a tale of two cities. 

I think on the one hand, we’re better because we have the tools. On the other hand, we’ve lost the capacity to be able to utilize those tools. So, you know, it’s a plus and a minus. 

Many, many years ago, there was the buzzword of what we called syndromic surveillance. And, Peter, you know this term well. 

It was like you would have, you know, a whole host of accumulation of data points in, let’s say, a hospital setting or an emergency department … 

LEE: Yup. Yup. 

SHAH: … where, you know, you’d have runny nose, you’d have cough, you’d have a fever, and you would take that, what was happening and people presenting to the emergency department, with what was happening in the area pharmacies where people were going to get Kleenexes and tissues … 

LEE: Yep. 

SHAH: … and buying over-the-counter, you know, medication, and things of that nature, Tylenol, etc. 

And you would say … you would put those two things together, and you would come up with a quote-unquote “syndrome,” and you would say our ability to say there was an alert to that syndrome allows us to say something uh-oh is going on in the community, and we got many, many advancements related to wastewater surveillance over the last several years as you know … 

LEE: Yep. Yep. Well, also, wasn’t patient number one in the United States discovered also because of the Seattle Flu Study, or at least that sort of syndromic surveillance. 

SHAH: That’s right. 

LEE: They weren’t even looking for COVID. They were just taking, you know, snot samples from people. 

SHAH: That’s right. That’s right. That’s right. 

And so that’s the kind of thing that you, you know, we underappreciate. Is you have to have a smart, intelligent, agile practitioner, right. 

So if I think about down in Dallas when Ebola was, you know, the gentleman who was, you know, the index case for Ebola was sent out of the emergency department and came back several days later. 

And it was the nurse who picked up this time because the practitioner, the provider, the healthcare provider, the doc missed it. And I wouldn’t want to say in a negative way. It was just, like, not obvious. You aren’t thinking of Ebola in the middle of Texas. And it was the nurse who picked up: there’s something wrong here

And what AI has the ability to do is to pick up those symptoms … 

LEE: Yeah. 

SHAH: … or those patterns and be able to recognize the importance of those and be able to then alert the practitioner. So what I … we call it artificial intelligence—it almost becomes artificial wisdom. 

LEE: Hmm. Yeah, interesting. So that actually reminds me of my next question, which is another thing that I watched you and public health officials do is try to play “what if” games. 

So, for example, I think one decision you were involved in had to do with, you know, what would be the impact if we put a ban on large gatherings like concerts or movie theaters or imposed an 8 PM curfew on restaurants, and you were trying to play “what if” games. Like, what would be the impact on the spread of the pandemic there? 

So now, again, today with AI, would that aspect of what you did play out differently than it did during the pandemic? 

SHAH: As you know, COVID was the most studied condition on the planet at one point. And it was, you know, things that usually we would learn over years or months, we were learning in weeks or days or hours. 

And I remember in Houston, I would say something in the morning, and I would always try to give the caveat, “This is the best information we know right now,” because it kept changing, whether it was around masks or whether it was around, you know, the way the virus was operating, whether it was around … 

I remember even … I was just watching something recently where I was asked to comment about whether spiders could transmit COVID-19. You know, just questions that were just evolving, evolving, evolving. And the information was evolving. By morning, you would say something. By evening, it would change. 

And why I say that is that it would have been great in the pandemic if we could have said, if you could give us all the information that’s happening across the globe, synthesize that information, and be able to help us forecast the right decisions that we should be making and help us model that information so we could decide: if you did a curfew, or if you did, you know, a mask, or if you could, you know, change something else related to policy—what are the impacts of it? 

LEE: Yeah. 

SHAH: What we found constantly in public health was that we were weighing decisions in incomplete data, incomplete information. 

So great now that everybody can armchair quarterback looking back three, five years ago and say, “I would have done it this way,” or “I would have done it that way.” Gosh, I would have as well. But guess what—we didn’t have that information at that time. And so you had to make the best decisions you could with incomplete data. 

But what AI has the potential to do is to help complete the incomplete data. Now, it’s not going to get 100%. 

LEE: Right. 

SHAH: And I think, Peter, you know, the one thing we’ve got to be really mindful [of] is phantom information, or information where it sort of makes up things, or may somehow get you incomplete information, or skews it a certain way. 

This is why we can’t take the person out of it yet. 

LEE: Right. 

SHAH: Now, maybe one day we can. 

I’m not one of those Pollyanna-ish that people will never be replaced. I actually believe that those people who are skilled with AI and the tools will eventually have a competitive advantage over those who are not. Just like if I had a physician who knows how to use their smartphone or knows how to use a word processor or knows how to do a PowerPoint presentation is going to replace the ones that use scantrons … 

LEE: Yeah. Yeah. 

SHAH: … or the ones that write it on pieces of paper—that eventually it makes it more efficient and effective, but we’re not there yet. But I think that the potential is absolutely there. 

LEE: So I have one more question. And you can, kind of, tell I’m trying to expand people’s understanding of just the incredible breadth of what goes on in public health, you know, all of these sorts of different issues. 

And again, just sticking to COVID, but this is a much broader issue. Another thing you had to cope with were significant rise of misinformation … 

SHAH: Yes. 

LEE: … and maybe going along with that, very, very significant inequities in outcomes in the COVID response. And when you think about AI there, I think you can argue it both ways, that it both exacerbates the problem but also gives you new tools to mitigate the problems. 

What is your view? 

SHAH: I think you … I don’t even have to say it … I think you hit on it, is that, you know, it really is two sides of one coin. 

On the one hand, it has the power of really advancing and allowing us to move forward in a way that incredibly accelerates and accentuates, but on the other hand, in the case of inequities, right? So if you have inequitable information data that’s already out in the literature or already out in the, you know, media, or what have you, about a certain population or people or certain kinds of ideas or thoughts, etc., then AI will tend to accumulate that. You’re going to take that information, thinking that’s the best out there, but it may have missed out on information and now you go with it. And that’s a potential problem. 

And I think it’s the same thing on information is that when we have people that are able to classify or misclassify information, I think it really becomes hard because it can accelerate the inequities of trust or inequities of trusted sources of information. It can also close the gap. 

So I think, you know, it’s really up to us and this responsible AI to really think about how we can go about doing this in a way that’s going to allow us to further the advancements but also be careful of those, you know, those kind of places where we’re going to step into that are not going to be well received or successful. 

You know, the one thing that’s really fascinating about this whole conversation is that this is why we’ve got to be at the table, Peter. 

LEE: Yep. Yep. 

SHAH: Because if we’re not at the table, you know, what’s the, you know, or if tech companies that are out there doing this work and aren’t even seeing a field of practitioners that are actually wrestling with the same problems but just cannot actually get to the solutions, we’re just going to continue to accentuate the problems. 

And that’s why I’m a firm proponent of: we’ve got to be at the table. 

And so even when we’ve seen in, and this is going to be a little controversial, but governmental spaces where, you know, policymakers have said, “Look, we are not going to let you do certain things,” or they say to public health practitioners or even healthcare delivery practitioners in certain spaces, “You cannot even play with this. You cannot have it on your phones. You can’t do any … ” 

You know, what I really believe it does is that it takes [an] almost like we put our head in the sand type of approach rather than saying, “What is it that we can do to help improve AI and make it work for all of us?” What we’re doing is we’re essentially saying, “We’re going to let the tech companies and all the other developers come up with the solutions, but it’s not going to be informed by the people in the field.” And that’s dangerous. We have to do both. We have to be working together. 

LEE: Umair, that’s really so well said, and I think a great way to wrap things up. I’ve certainly learned a lot from this conversation. So thank you again. 

SHAH: It’s been a pleasure to be with you this morning. Thank you so much for the time. And I’m looking forward to further conversations. 

[TRANSITION MUSIC] 

I live in the State of Washington and because of that, I’ve been able to watch Umair in action as our state’s former secretary of health. And some of that action was pretty intense to say the least because his tenure as secretary of health spanned the period of the COVID pandemic. 

Now, as a dyed-in-the-wool techie, I have to admit that at the beginning, I don’t think I really understood the scope and importance of the field of public health. But as the conversation with Umair showed, it’s really important and it is arguably both an underfunded and underappreciated part of our healthcare system. 

Now, public health is also very much an area that’s ripe for advancement and transformation through AI. As Umair explained in our discussion, the core of public health is the idea of population health, the idea of extracting new health insights from signals from population-scale data. And already we’re starting to see AI making a difference. 

Now here’s my interview with Dr. Gianrico Farrugia. 

LEE: Gianrico, it’s really great to have you here today. 

GIANRICO FARRUGIA: Peter, thanks for having me. Thanks for making me part of your podcast. 

LEE: You know, what I’d like to do in these conversations is, you know, we’ll definitely want to talk about the overall healthcare system, the state of healthcare, and what AI could or might do to help or even hurt all of that. But I always like to start with a sharper focus just on you specifically. And my first question always is, you know, I think people imagine what a hospital or a health system president and CEO does, but not really. And so how would you explain to your mother what you do every day? 

FARRUGIA: So, Peter, my mother’s 88 years old. She lives in Malta, and she’s visiting at the moment, … 

LEE: Oh, wow. 

FARRUGIA: … which is kind of nice, really. 

LEE: Wow, that is amazing. 

FARRUGIA: I’m proud that she’s still proud of me. So she does ask. I’ll tell her the scope of Mayo Clinic. We serve patients across the globe. We have about 83,000 staff members that work with us, and we’re very proud of the work we do in research, education, and the practice. 

Mayo Clinic is built to serve people with serious disease. So what I tell my mother is that here we are. We’re a healthcare organization that knows what it needs to do: keep patients as the North Star. The needs of the patient come first. We have 83,000 people who want to do that, several thousand physicians and scientists. My job is to look slightly ahead and then share what I’m seeing and then, sort of, smooth the way for others to make sure Mayo remains true to its mission but also true to the fact that at the moment, we are in a category of one. We need to remain there not just from an ego standpoint, but really from a “do good to the world” standpoint. 

At that point, invariably my mother will tell me that I’m working too hard. [LAUGHTER] And then of course, I change the subject, and I ask her what she cooked today because my mother, who’s 88, cooks for the whole family in Malta, and there are usually four generations eating around the table. So I tell her what she does for the family is what I do for the Mayo family. 

LEE: Wow, that’s a great way to put it. And it sounds like you actually have a good chance to have some good genes if she’s still that active at age 88. 

FARRUGIA: I think I chose a little more stressful job that may limit [that]. I will tell you very briefly is that one of the AI algorithms we have estimates biological age from an electrocardiogram. My biological age jumped by 3.7 years when I became CEO. 

LEE: [LAUGHS] Oh no. 

FARRUGIA: I’m hoping it will reverse on the other side. 

LEE: To stick with you just for one more moment here, second question I ask is about your origin story with respect to AI. And typically, for most people, there is AI before ChatGPT and generative AI and then after the generative AI revolution. So can you share a little bit about this? Because it must be the case that you’ve been thinking about this a long time since you’ve really led Mayo Clinic to be so tech forward in this way. 

FARRUGIA: Well, I’ve been, as you said, a physician for way too long. I got my MD degree in ’87. So that sort of dates me. But it also means that I saw a lot of the promise for AI that never seemed to pan out for decades and decades and decades like you did. 

LEE: Yeah. 

FARRUGIA: Around 10 years ago, Mayo could sense that there was something different, that something was changing, that we actually—at that time, predictive AI—could make a big difference. And I think that’s the moment where I and others jumped in and said Mayo Clinic needs to be involved. 

And then about six years ago, when—six and a half years ago—I became CEO, it was clear that there was the right confluence of data, knowledge, tech expertise, that we could deal with what was increasingly bothering me, which is that we knew what was coming from a technology standpoint and we knew the current healthcare system could not deliver on what patients need and want within that current system. And so the answer is, how could a place like Mayo Clinic with our reputation not jump in and say there has to be a better way of doing things? I’ve always said that it is impossible for me to understand that every single government employee is incompetent. Every physician is greedy. Something’s wrong here. 

LEE: Yeah. 

FARRUGIA: And that wrong was the architecture was wrong. And we knew that we could incorporate AI and make it better. So for me, that journey was one of wait, wait, wait. 10 years ago, begin to jump in. Six years ago, really jump in with our platform. And then, of course, in November 2022, things changed again. 

LEE: Yeah. When did this idea of a data platform, what you now call the Mayo Clinic Platform (opens in new tab)—by the way, I refer to this as MCP, … 

FARRUGIA: Yeah, I know. [LAUGHS] 

LEE: [LAUGHS] … which I always smirk a little bit because, of course, for those of us in computer science research, the AI research, MCP has also become quite a hot topic because of the model context protocol version of this. But for Mayo’s MCP, when did that become a serious, defined initiative? 

FARRUGIA: So around the end of ’18, 2018, beginning of 2019. At that point, we knew that we were going to do something differently. We came up with a strategic plan, as I took on the job, that we needed to cure more patients. There’s just not enough cures in the world. There’s too much suffering. And that we had all these chronic diseases that people have accepted are chronic, but really the only reason that disease is chronic is you haven’t cured it. 

And physicians have been afraid to talk about cure because, of course, eventually everybody passes away. 

LEE: Yeah. 

FARRUGIA: But I really pushed hard to say, no, it’s OK to talk about cure. It’s OK to aspire to cure. The second was connect—connecting people with data to create new knowledge. And that’s where it became clear that data were not currently in a format that were particularly useful. By the way, you’ll hear me talk about data in the singular and the plural. I’m old school. I talk about data as plural, but I know that most younger people now use data singular. [LAUGHTER] And I apologize if I’ll go through that. 

And then the third was transform. Let’s use Mayo’s resources to transform healthcare for ourselves and for others. And that’s the concept of, if we are able to use data in a different way, let’s create a different architecture. And that architecture had to be very closely linked to using artificial intelligence in order to create better outcomes for patients. So patients can live not only longer lives but healthier lives. And that’s the genesis of MCP, Mayo Clinic Platform, so I’ll timestamp that as end of 2018, beginning of 2019. 

LEE: So I’m really wanting to delve in in this episode, in this conversation, you know, [into the] mindset of a health system or hospital CEO. And so you’re obviously thinking about, I guess, machine learning and predictive analytics and so on. What were the, kind of, like … in 2018, what were the outcomes that you were dreaming about from this? So if you had this thing, you know, what were the things that you were hoping to be able to show or, kind of, produce as results? 

FARRUGIA: So first of all, I think all of us who work at Mayo Clinic, and this tends to be a bit sugary, but it’s true, strongly feel that we have a responsibility to leave the place better than when we started. And so the Mayo brothers, when they started, did two really important things. The first was that they created the first integrated healthcare system. And the second, they created the first unified record. And that record was, of course, paper at that point. 

Part of that is to say, OK, what does it look like now versus how can we improve what we have if … it’d be blasphemy to say, let’s think of ourselves as the Mayo brothers, but let’s think of ourselves as reasonably smart people at Mayo Clinic, really lucky to be surrounded by very smart people with resources. What will we do? And so we said let’s not aim for the low-hanging fruit. Let’s aim to get at whatever you want to call it, the intractable knot, the hardest problem, and that is clinical care. Let’s improve clinical care. Yes, we can deal with burnout. Yes, we can deal with administrative burden. But let’s not focus on that. Let’s really create an architecture that allows us to tackle better clinical outcomes. 

And by starting there, then everything flows from that. That it’s not really worth doing unless at the end of the day, people are experiencing better health. 

LEE: And so I know a very good colleague and friend of mine, John Halamka (opens in new tab), you ended up hiring. I thought he was a very interesting choice because he is, of course, in terms of technology, quite deep and very expert, but he’s, I think, first and foremost, a doctor. And so I assume you must have had to decide what type of person you would bring in and what kinds of people you would bring in to try to create such a thing. What was your thinking around the choice of someone like John? 

FARRUGIA: It was one of the harder decisions. First of all, [I’m] a physician myself. We tend to want to maintain some control. And so now I am the CEO, [LAUGHTER] and I have to give this baby to somebody else. That’s very hard. Second is Mayo Clinic is really good because it is flat, and we run a lot by committee. But it also means that, therefore, you have to work really hard at change, and you cannot change by fiat. You have to change by convincing people. 

So I just … I’ve always made the point that the right change agent is a servant leader because that’s how change becomes embedded. But it also means you’ve got to have that personality, the Mayo personality. And it became clear when we interviewed [that] there were some people that were really hardcore tech; others that were passionate about social issues. But John really fit that of being, as you said, deep in IT but also himself very aligned with the Mayo Clinic values. It’s as if he was a Mayo Clinic physician even though he wasn’t. 

And that came together, and I felt, we felt, that as we were hiring, that we could do it. And then we did something interesting. We paired John with a … we created the role of a chief medical officer for the platform, which was a longstanding Mayo Clinic physician. And so we brought them together so we could get the past and the present and the future working together. 

LEE: So I’m going to ask you about what has come out of this. But before that, let’s get back to this origin story. So now, all of that is being set up starting around 2018. But then, you know, in 2022, there is generative AI. Now you were already experimenting with transformers, starting with BERT out of Google there. So maybe that’s a couple of years earlier. But still, there has to come a point where things are feeling very disrupted. 

FARRUGIA: Yeah, so, you know, it really wasn’t. It, to me, was a relief because it gave this … we were feeling pretty good about what we’re doing. We were feeling a little impatient, but, in true Mayo fashion, were willing to, sort of, do everything, take its time, take it to the right committees, get the right approvals, and get it done. 

And so when generative AI came, for us, it’s like, I wouldn’t say we told you so, but it’s like, ah, there you go. Here’s another tool. This is what we’ve been talking about. Now we can do it even better. Now we can move even faster. Now we can do more for our patients. It truly never was disruptive. It truly immediately became enabling, which is strange, right, … 

LEE: Yeah. 

FARRUGIA: … because something as disruptive as that instantly became enabling at Mayo Clinic. And I’ll take … as I think about it with you and take a moment to think and reflect on it, I think there were a couple of decisions we made earlier on that really helped us. We made the decision against the advice of any consulting firm to completely decentralize AI at Mayo Clinic six years ago. And we told our clinical department, you need to own this. You need to hire basic scientists in AI. We’ll help you by creating the infrastructure. We’ll help you by doing all the rest. We’ll have the compute. We’ll have the partners. You need to do this on your own. You need to treat this the same way as if a new radiological technique happened or a new surgical technique happened. 

And so there was a lot of expertise already present in a very diffused way that then we were able to layer on generative AI onto that. And we found a very willingness to embrace it. In fact, I would argue initially a bit too willing because as you know, we haven’t quite figured out what’s legitimate use, what’s not use. We all learned together.

LEE: Right. Yep. Yep. 

FARRUGIA: But it was mostly energy, which is really interesting. It was mostly energy.

LEE: Wow. And, you know, it’s an amazing thing to hear because one common theme that we hear is that the initial reaction is oftentimes one of skepticism. In fact, I’ve been very open that even I initially had some skepticism. Was that not present in your mind or on your team’s mind at all at the beginning? 

FARRUGIA: So you’re asking a physician if they are skeptical about something. [LAUGHTER] Yeah. I wonder what the answer to that is. Absolutely. The first hallucination, the first wrong reference. Can you imagine if you write the grant and the wrong reference comes. As you know, … 

LEE: Right. 

FARRUGIA: … earlier on when some references were being made up. So massive amounts of skepticism. But the energy was such there that the people [who] were skeptical were also at the same time saying, “Let’s do a RAG [retrieval augmented generation] to clean up those references. Let’s create …” We were experimenting with discharge summaries, but let’s use AI to police AI, and let’s see what’s going on. So there was more massive skepticism, but the energy was pushing that skepticism into a positive versus into a negative frame. Now, I say that summarizing in hindsight. 

LEE: Yeah. 

FARRUGIA: Day to day, much more complicated than that. But overall, if you just … and remember, I had been at the World Economic Forum many years ago and had said, healthcare needs to run towards AI. 

LEE: Yes. 

FARRUGIA: If healthcare was perfect, we would wait. Healthcare is not perfect by any means, therefore let’s run and embrace AI. And, sort of, that mentality was part of who we were because at the same time, we were also saying the other thing, that we need to be the ones to lead validation. We need to be the ones that set the rules. We need to be participating in the creation of CHAI [Coalition for Health AI] (opens in new tab). We need to be participating as the [National] Academy of Medicine (opens in new tab)

So people did feel that Mayo was being fairly responsible about it, but that urge to, the needs of the patient come first, was the driver that kept people wanting to say, “Not ready yet, but let’s make it ready.” And we now have 320 algorithms in the practice, and they run and we constantly are looking and seeing what else we can do to improve. But as you well know, things evolve and change. And we’re also looking and seeing which ones work and which ones don’t and which ones we have to work together on to make better. 

LEE: Yeah, you know, of course Mayo has such a, you know, such a reputation and is so influential, but in the world of healthcare broadly, let’s just focus on the United States to start. How common is this experience? You know, so if you are at a meeting with fellow CEOs of hospitals and health systems, what is the attitude and what is the, kind of … how common is the approach to all of this? 

FARRUGIA: I think it’s more common now, but going back a few years, I think it’s fair to say that it was scary for people to know how it’s going to change things. Healthcare runs on very narrow margins. It’s very expensive. So your expenses and your revenue are both massive, and they are very close to each other. So anything that changes that balance is really scary. 

Because it’s not like you have the opportunity to erode into a margin or get it right the second time. So I think that is what drove a lot of the initial hesitancy. Was, one, is lack of knowledge and, two, understanding that you didn’t have a lot of room to make a mistake. 

LEE: On the economics of this, when you are embarking on what I suspect is a very expensive initiative like Mayo Clinic Platform, how on earth do you justify that early on? 

FARRUGIA: So again, I’m trying hard to try and remember how things were versus how I think about them now. [LAUGHTER] It goes back to our history. Mayo has always invested in what it thinks is the right thing that is coming. And that’s how we’ve stayed where we are. So the investment really was having an open discussion: is this worth it for our patients? And once that discussion was over, then the board was saying, go, go, go

Now we are lucky in that we have the size that we’re able to hire and absorb. We’re lucky in that the people [who] came before us have been financially astute, and one of our values is stewardship. And we’re lucky that we had a lot of patients at Mayo Clinic who were able to listen, be inspired by, and be willing to help support. And so that gave us the ability to build what we’re doing not only into the long-range plan but actually into the yearly plan. And so we built it into the yearly plan. We set up a center for digital health. We set up the platform. And then we set up the budgets to be able to do that. And the budgets came from assets we’ve had, assets that we would get as the year came by, and then from philanthropy. 

We also had a really powerful calling card. And that’s one advantage I had, and that’s … and I’d been very open when I was speaking to other CEOs that would use it is that right at that very beginning, really, really in 2019, our cardiologists, both the researchers and the clinicians, had come together and had used electrocardiograms to create an AI algorithm. 

The first one was for diagnosing from an electrocardiogram, which is very cheap, very easy to do, left ventricular dysfunction. That’s how hard the left part of the heart contracts. If it doesn’t do well, you get heart failure. And they were able to show that that algorithm was already making nurses better than the physician without the algorithm. And after that went on to show that you could do it from a single strip, really with an area under the curve for that single strip on a watch, that was as good as mammograms or pap smears. And so we already had that proof. 

LEE: Yeah. 

FARRUGIA: That quickly then came into Mayo. We put it into it so that any patient now can benefit from it. And now there are, I think, 14 algorithms just from that same one. 

LEE: Yeah. 

FARRUGIA: So we had a proof of concept thanks to those really far-seeing cardiologists that enabled things to happen a little faster and also, as I talked to other CEOs, enabled me to say, “This actually works. This is the path forward.” I have recently been vocal about also saying, we are at a point now where I believe that for some medical conditions, it is not right to not use AI to help treat them. 

LEE: Wow, that’s so interesting. So I think I want to get into another topic here, which is when you think about the use of AI and data, what are some of the results that maybe are top of mind for you or you think are particularly important? And if you don’t mind, I’d like to see if we can think about this not only in terms of results in terms of patient outcomes but in your other activities, core activities, like research, in the education mission, and then even in the broader impacts on the healthcare system. But maybe we start with on patient outcomes. 

FARRUGIA: Yeah, they’re all linked, right. 

LEE: Yes. 

FARRUGIA: They’re part of the same ecosystem. We think of ourselves as three shields— research, education, and the practice—and that one goes into the other. So, as I said, we have about 320 AI algorithms from the practice. Some run on every patient; some run on some patients. And we have good evidence for what they do. So some specific examples, and then I’ll get into the transformer part of this. 

We have a program called CEDAR [Clinical Detection and Response (tool)], and like most other people, I like acronyms for things. [LAUGHTER] But what it is, in our hospitals with patient consent, we monitor vitals. We monitor in the patient room—not in the ICU [intensive care unit], in the patient room. We monitor all sorts of things. But there’s a camera in the room, and we have a team of intensivists—nurses and physicians—who do not have any patient responsibilities but are just monitoring the algorithms, and when the algorithms are predicting decompensation, they’re able to get into the room. And what we’ve shown, for example, with that algorithm, is we’ve shown we’ve decreased length of stay in the hospital, decreased transfers into the intensive care units, and interestingly, decreased mortality and morbidity, which is not easy to show. I talked about the electrocardiogram as a good example. Of course, everybody knows about the radiology things. 

We’ve created … taken part of this and said, if we can do this in the hospital, why cannot we do it in patients’ homes? So being very active in looking after patients that would come to the ED, emergency room, would normally be admitted, and we say, no, here are the things we can give you. Go home if you want to, and we will safely look after you at your home. And we recently have been, looking at the last two years of data, been able to show that we’re also successfully able to give intravenous chemotherapy in patients’ homes because we can monitor; we can do all the things that we can do. 

Now, with generative AI, that gave us many other opportunities. One biggest opportunity for me has always been digital pathology. When we see how pathology’s currently run with a glass slide, not much has changed … 

LEE: Yeah. 

FARRUGIA: … in many, many, many years, right. 

LEE: Yeah. 

FARRUGIA: And so really we have made a massive push to digitize pathology not just for us but for others. But talking about ourselves, we started by saying, it has to be very cheap to digitize. So we worked and created a company with partners called Pramana (opens in new tab) that allows us to digitize slides relatively cheaply using AI algorithms that can take away the dirt, the fingerprint. And so we end up with 21 million of our slides digitized, and that gives you now a massive opportunity. Worked with another company called Aignostics (opens in new tab) to create a, what we call, Atlas (opens in new tab), which is an LLM that allows us to then build upon it. 

LEE: Yeah. 

FARRUGIA: And we, a hundred and, I think, 120 years ago, invented frozen sections at Mayo Clinic. So what that is, is that while the patient’s still on the table, you can take a piece of tissue, look at it, and tell the surgeon the margins of what you’re trying to resect are clear or not. But as a result of that, because you have to hurry, you get no information as a surgeon about, is it an invasive cancer, is it noninvasive cancer, or other things. So we’ve just found a way to digitize our frozen section practice and will completely go across the enterprise with AI-enabled digitized frozen sections, which then enables us to then do it for anybody across the globe if we need to. 

And then in the genomic space, we’re working to create a true exomic transformer that is short range. And we originally started doing it to see if we can test it against the fact that 40% of people with rheumatoid arthritis don’t respond to the first-line therapy, … 

LEE: Right. 

FARRUGIA: … but you have to wait six months to find out. And we found that we can actually do that. But it has much greater uses, of course. 

And then we’re working with you—I don’t know how much you want to get into this, Peter, or [if] you want to talk about it yourself—MAIRA-2, which is really exciting, about how taking a simple problem—can you create a transformer that is able to detect if lines on the chest are in the right place, breathing tube is in the right place?—and then do it in a way that then can be used for many, many other things. 

And then, Peter, because you asked about education and research, … 

LEE: Yeah. 

FARRUGIA: … imagine what this does now to the education system, right. And so we’ve got to train our physicians differently. We now have an AI curriculum for all our medical students. We offer masters and PhDs in AI. We think it’s essential for the people who want to be able to truly become experts, the same way I became an expert in my area of research. 

And then from a research standpoint, when you think about all the registries that exist in people’s labs, all the spatial genomics, all the epigenomics, all the omics that exist. And if you are able to coalesce them into one big, what we call, an atlas, how that could really spur research at a scale that we haven’t thought of before. And so that is our aim at the moment. 

From a research standpoint, we are, with Vijay Shah, who’s our dean of research, is to say, let’s make the effort of making sure all the data are available to be able to use and enable for us to take advantage of AI. And that is not easy because, of course, people have collected the data. They tend to want to embrace it. 

LEE: Yep. 

FARRUGIA: So there have to be the right incentives, the right privacy, and the right ways of doing it. And we think we’re on the way there, and we’re already seeing some advantages from doing it this way. 

LEE: So we’re running short on time. And so I always like to end with one or two more provocative questions. And, you know, it’s tempting to ask you the provocative question of whether you think AI will ever replace human doctors, but I don’t want to go there with you. In fact, as I thought about our discussion, I was reflecting. We were at a conference together once, and I was on stage in a fireside chat. And then, you know, after the fireside chat, there were audience questions, and I don’t remember any of the questions from the audience except yours

And just to remind you, you know, I think when I was on stage, we were talking about a lot of practical uses of AI to, let’s say, reduce administrative burdens and so on in healthcare. But you got up and you, I won’t say you scolded me, but you more or less said, is it the right idea to use AI to optimize today’s somewhat broken healthcare system, or should we be thinking more boldly about, you know, a more fundamental transformation? 

And so what I thought I would try to close with here is to hear what was really behind that question. You know, what were you trying to get me to think about when you asked that question? 

FARRUGIA: So first of all, darn your great memory [LAUGHTER]. Belated apologies … I probably should have … 

LEE: It was by far the best and most sophisticated and, I think, thought-provoking question of all of the ones that came out of the audience. 

FARRUGIA: What I was trying to get to is actually trying to clarify it in my own head and then in the head of others is that we do not need to have a linear path to get to where we want to get to. And we seemed to be on a linear path, which is, let’s try and reduce administrative burden. Let’s try and truly be a companion to a physician or other provider. Let’s make their problems better, make them feel better about providing healthcare. And then in the next step, we keep going until we get to, now we can call it agentic AI, whatever we want to talk about. And my view was, no, is that let’s start with that aim, the last aim, and do the others because the others will come automatically if you’re working on that harder problem. 

Because one, to get to that harder problem, you’ll find all the other solutions. I was just trying to push that here’s this wonderful tool that’s been given to us. Let’s take advantage of it as quickly as we can. I think we had gotten a little too sensitized to need to say the right things. “Careful, be very careful” versus saying, “Massive opportunity. Do it right, and healthcare will be much better. Go for it.” 

LEE: Well, I think I understand better now where the vision, insight, and frankly, courage to take on something as ambitious and transformational as the Mayo Clinic Platform and really all of your leadership in your tenure as the president and CEO of Mayo Clinic. I think I understand it much better now. 

Gianrico, it’s just always such a privilege to interact with you and now to have a chance to work with you more closely. So thank you for everything that you do and thank you for joining us today. 

FARRUGIA: Thank you for making it so easy, and thanks for giving us this opportunity to do good for the world. 

[TRANSITION MUSIC] 

LEE: Gianrico leads what is arguably the crown jewel of the world’s healthcare systems, and so I feel it’s such a privilege to be able to talk and sometimes even brainstorm with him. 

Our conversation, I think, exposed just how tech forward Gianrico is as he charts the strategies for healthcare delivery well into the future. And as I’ve interacted with many others, what I’ve learned is that this is a common trait among major health system CEOs. Roughly speaking, like we’ve seen in previous episodes where doctors and med students are polymath clinician-technologists, the same thing is true of health system CEOs and other leaders. 

AI in the mind of a health system CEO today is not only a technology that can transform diagnosis and treatment, but it’s also something that can have a huge impact on the business of healthcare delivery, the connection of healthcare to medical research, and the journeys that patients go through as they seek better health. 

These two conversations show that virtually all leaders in health and medicine are confronting head-on the opportunities, challenges, and the reality of AI, and they see a future that is potentially very different than what we have today. 

[THEME MUSIC] 

I’d like to thank Umair and Gianrico again for their time and insights. And to our listeners, thank you for joining us. We hope you’ll tune in to our final episode of the series. My coauthors, Carey and Zak, will be back to examine the takeaways from our most recent conversations. 

Until next time. 

[MUSIC FADES] 

The post Reimagining healthcare delivery and public health with AI appeared first on Microsoft Research.

Read More

Self-adaptive reasoning for science

Self-adaptive reasoning for science

A gradient background transitioning from blue to pink with three white icons: a DNA double helix, a light bulb with rays, and a stylized path with arrows and nodes.

Unlocking self-adaptive cognitive behavior that is more controllable and explainable than reasoning models in challenging scientific domains

Long-running LLM agents equipped with strong reasoning, planning, and execution skills have the potential to transform scientific discovery with high-impact advancements, such as developing new materials or pharmaceuticals. As these agents become more autonomous, ensuring effective human oversight and clear accountability becomes increasingly important, presenting challenges that must be addressed to unlock their full transformative power. Today’s approaches to long-term reasoning are established during the post-training phase, prior to end-user deployment and typically by the model provider. As a result, the expected actions of these agents are pre-baked by the model developer, offering little to no control from the end user.

At Microsoft, we are pioneering a vision for a continually steerable virtual scientist. In line with this vision, we created the ability to have a non-reasoning model develop thought patterns that allow for control and customizability by scientists. Our approach, a cognitive loop via in-situ optimization (CLIO), does not rely on reinforcement learning post-training to develop reasoning patterns yet still yields equivalent performance as demonstrated through our evaluation on Humanity’s Last Exam (HLE). Notably, we increased OpenAI GPT-4.1’s base model accuracy on text-only biology and medicine from 8.55% to 22.37%, an absolute increase of 13.82% (161.64% relative), surpassing o3 (high). This demonstrates that an optimization-based, self-adaptive AI system developed without further post-training can rival post-trained models in domains where adaptability, explainability, and control matter most.

Bar chart that represents the Head-to-head comparison of OpenAI’s GPT-4.1 with CLIO, o3, and GPT-4.1 with no tools on HLE biology and medicine questions
Figure 1. Head-to-head comparison of OpenAI’s GPT-4.1 with CLIO, o3, and GPT-4.1 with no tools on HLE biology and medicine questions

In-situ optimization with internal self-reflection to enable self-adaptive reasoning

Model development has advanced from using reinforcement learning human feedback (RLHF) for answer alignment to external grading in reinforcement learning (RLVR). Recent approaches show promise in the utilization of intrinsic rewards for training reasoning models (RLIR). Traditionally, these reasoning processes are learned during the post-training process before any user interaction. While today’s reasoning models require additional data in the training phase and limit user control during the reasoning generation process, CLIO’s approach enables users to steer reasoning from scratch without additional data. Rather, CLIO generates its own necessary data by creating reflection loops at runtime. These reflection loops are utilized for a wide array of activities that CLIO self-defines, encompassing idea exploration, memory management, and behavior control. Most interesting is CLIO’s ability to leverage prior inferences to adjust future behaviors, handling uncertainties and raising flags for correction when necessary. Through this open architecture approach to reasoning, we alleviate the necessity for further model post-training to achieve desired reasoning behavior. Performing novel scientific discoveries often has no prior established patterns for reasoning, much less a large enough corpus of high-quality data to train on. 

CLIO reasons by continuously reflecting on progress, generating hypotheses, and evaluating multiple discovery strategies. For the HLE test, CLIO was specifically steered to follow the scientific method as a guiding framework. Our research shows that equipping language models with self-adapting reasoning enhances their problem-solving ability. It provides a net benefit in quality for science questions, as well as providing exposure and control to the end user.

Figure 2. CLIO can raise key areas of uncertainty within its self-formulated reasoning process, balancing multiple different viewpoints using graph structures.
Figure 2. CLIO can raise key areas of uncertainty within its self-formulated reasoning process, balancing multiple different viewpoints using graph structures.

Control over uncertainty: Building trust in AI 

Orchestrated reasoning systems like CLIO are valuable for scientific discovery, as they provide features beyond accuracy alone. Capabilities such as explaining the outcomes of internal reasoning are standard in the scientific field and are present in current reasoning model approaches. However, elements like displaying complete work, including final outcomes, internal thought processes, and uncertainty thresholds to support reproducibility or correction, as well as indicating uncertainty, are not yet universally implemented. Current models and systems do not have this same innate humility.  Rather, we are left with models that produce confident results, whether correct or incorrect. When correct, it is valuable. When incorrect, it is dangerous to the scientific process. Hence, understanding a model or system’s uncertainty is a crucial aspect that we have developed natively into CLIO.

On the other end of the spectrum, orchestrated reasoning systems tend to oversaturate the user by raising too many flags. We enable prompt-free control knobs within CLIO to set thresholds for raising uncertainty flags. This allows CLIO to flag uncertainty for itself and the end user at the proper point in time. This also enables scientists to revisit CLIO’s reasoning path with critiques, edit beliefs during the reasoning process, and re-execute them from the desired point in time. Ultimately, this builds a foundational level of trust with scientists to use them in a scientifically defensible and rigorous way. 

How does CLIO perform? 

We evaluate CLIO against text-based biology and medicine questions from HLE. For this domain, we demonstrate a 61.98% relative increase or an 8.56% net increase in accuracy over OpenAI’s o3 and substantially outperform base completion models like OpenAI’s GPT-4.1, while enabling the requisite explainability and control. This technique applies to all models, showing similar increases in OpenAI’s GPT-4o model, which we observe performs poorly on HLE-level questions. On average, GPT-4.1 is not considered competent for HLE scale questions (GraphRAG. This extension of the cognition pattern provides a further 7.90% over a non-ensembled approach.  

Waterfall chart that demonstrates the impact of thinking effort on CLIO’s effectiveness.
Figure 3. The impact of thinking effort on CLIO’s effectiveness.

Furthermore, CLIO’s design offers different knobs of control, for example, how much time to think and which technique to utilize for a given problem. In Figure 3, we demonstrate these knobs of control and their increase on GPT-4.1 and GPT-4o’s performance. In this case, we analyze performance for a subset of biomedical questions, those focused on immunology. CLIO increases GPT-4o’s base performance to be at par with the best reasoning models for immunology questions. We observe a 13.60% improvement over the base model, GPT-4o. This result shows CLIO to be model agnostic, similar to Microsoft AI Diagnostic Orchestrator’s (MAI-DxO) (opens in new tab)‘s approach and corresponding performance boost. 

Implications for science and trustworthy discovery

The future of scientific discovery demands more than reasoning over knowledge and raw computational power alone. Here, we demonstrate how CLIO not only increases model performance but establishes new layers of control for scientists. In our upcoming work, we will demonstrate how CLIO increases tool utility for highly valuable scientific questions in the drug discovery space which requires precise tools designed for the language of science. While our experiments focus on scientific discovery, we believe CLIO can apply in a domain-agnostic fashion. Experts tackling problems in domains such as financial analysis, engineering, and legal services could potentially benefit from AI systems with a transparent, steerable reasoning approach. Ultimately, we envision CLIO as an enduring control-layer in hybrid AI stacks that combine traditional completion and reasoning models, with external memory systems, and advanced tool calling. These continuous checks and balances that CLIO enables will continue to remain valuable even as components within the AI stacks evolve. This combination of intelligent and steerable scientific decision making and tool optimization is the basis of the recently announced Microsoft Discovery platform (opens in new tab).

At Microsoft, we’re committed to advancing AI research that earns the trust of scientists, empowering them to discover new frontiers of knowledge. Our work is a testament to what’s possible when we blend innovation with trustworthiness and a human-centered vision for the future of AI-assisted scientific discovery. We invite the research and scientific community to join us in shaping that future.

Further information:

To learn more details about our approach, please read our pre-print paper published alongside this blog. We are in the process of submitting this work for external peer review and encourage partners to explore the utilization of CLIO in Microsoft Discovery. To learn more about Microsoft’s research on this or contact our team, please reach out to discoverylabs@microsoft.com

Acknowledgements

We are grateful for Jason Zander and Nadia Karim’s support. We extend our thanks to colleagues both inside and outside Microsoft Discovery and Quantum for sharing their insights and feedback, including Allen Stewart, Yasser Asmi, David Marvin, Harsha Nori, Scott Lundberg, and Phil Waymouth. 

The post Self-adaptive reasoning for science appeared first on Microsoft Research.

Read More

Project Ire autonomously identifies malware at scale

Project Ire autonomously identifies malware at scale

Stylized digital illustration of a multi-layered circuit board. A glowing blue microchip sits at the top center, with intricate circuitry radiating outward. Beneath it, four stacked layers transition in color from blue to orange, each featuring circuit-like patterns. Smaller rectangular and circular components are connected around the layers, all set against a dark background with scattered geometric shapes.

Today, we are excited to introduce an autonomous AI agent that can analyze and classify software without assistance, a step forward in cybersecurity and malware detection. The prototype, Project Ire, automates what is considered the gold standard in malware classification: fully reverse engineering a software file without any clues about its origin or purpose. It uses decompilers and other tools, reviews their output, and determines whether the software is malicious or benign.

Project Ire emerged from a collaboration between Microsoft Research, Microsoft Defender Research, and Microsoft Discovery & Quantum, bringing together security expertise, operational knowledge, data from global malware telemetry, and AI research. It is built on the same collaborative and agentic foundation behind GraphRAG (opens in new tab) and Microsoft Discovery (opens in new tab). The system uses advanced language models and a suite of callable reverse engineering and binary analysis tools to drive investigation and adjudication.

As of this writing, Project Ire has achieved a precision (opens in new tab) of 0.98 and a recall (opens in new tab) of 0.83 using public datasets of Windows drivers. It was the first reverse engineer at Microsoft, human or machine, to author a conviction case—a detection strong enough to justify automatic blocking—for a specific advanced persistent threat (APT) malware sample, which has since been identified and blocked by Microsoft Defender. 

Malware classification at a global scale

Microsoft’s Defender platform scans more than one billion monthly (opens in new tab) active devices through the company’s Defender suite of products, which routinely require manual review of software by experts.

This kind of work is challenging. Analysts often face error and alert fatigue, and there’s no easy way to compare and standardize how different people review and classify threats over time. For both of these reasons, today’s overloaded experts are vulnerable to burnout, a well-documented issue in the field.

Unlike other AI applications in security, malware classification lacks a computable validator (opens in new tab). The AI must make judgment calls without definitive validation beyond expert review. Many behaviors found in software, like reverse engineering protections, don’t clearly indicate whether a sample is malicious or benign. 

This ambiguity requires analysts to investigate each sample incrementally, building enough evidence to determine whether it’s malicious or benign despite opposition from adaptive, active adversaries. This has long made it difficult to automate and scale what is inherently a complex and expensive process.

Technical foundation

Project Ire attempts to address these challenges by acting as an autonomous system that uses specialized tools to reverse engineer software. The system’s architecture allows for reasoning at multiple levels, from low-level binary analysis to control flow reconstruction and high-level interpretation of code behavior.

Its tool-use API enables the system to update its understanding of a file using a wide range of reverse engineering tools, including Microsoft memory analysis sandboxes based on Project Freta (opens in new tab), custom and open-source tools, documentation search, and multiple decompilers.  

Reaching a verdict 

The evaluation process begins with a triage, where automated reverse engineering tools identify the file type, its structure, and potential areas of interest. From there, the system reconstructs the software’s control flow graph using frameworks such as angr (opens in new tab) and Ghidra (opens in new tab), building a graph that forms the backbone of Project Ire’s memory model and guides the rest of the analysis.  

Through iterative function analysis, the LLM calls specialized tools through an API to identify and summarize key functions. Each result feeds into a “chain of evidence,” a detailed, auditable trail that shows how the system reached its conclusion. This traceable evidence log supports secondary review by security teams and helps refine the system in cases of misclassification.  

To verify its findings, Project Ire can invoke a validator tool that cross-checks claims in the report against the chain of evidence. This tool draws on expert statements from malware reverse engineers on the Project Ire team. Drawing on this evidence and its internal model, the system creates a final report and classifies the sample as malicious or benign.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Preliminary testing shows promise 

Two early evaluations tested Project Ire’s effectiveness as an autonomous malware classifier. In the first, we assessed Project Ire on a dataset of publicly accessible Windows drivers, some known to be malicious, others benign. Malicious samples came from the Living off the Land Drivers (opens in new tab) database, which includes a collection of Windows drivers used by attackers to bypass security controls, while known benign drivers were sourced from Windows Update. 

This classifier performed well, correctly identifying 90% of all files and flagging only 2% of benign files as threats. It achieved a precision of 0.98 and a recall of 0.83. This low false-positive rate suggests clear potential for deployment in security operations, alongside expert reverse engineering reviews. 

For each file it analyzes, Project Ire generates a report that includes an evidence section, summaries of all examined code functions, and other technical artifacts.  

Figures 1 and 2 present reports for two successful malware classification cases generated during testing. The first involves a kernel-level rootkit, Trojan:Win64/Rootkit.EH!MTB (opens in new tab). The system identified several key features, including jump-hooking, process termination, and web-based command and control. It then correctly flagged the sample as malicious.

Figure 1 Analysis

The binary contains a function named ‘MonitorAndTerminateExplorerThread_16f64’ that runs an infinite loop waiting on synchronization objects and terminates system threads upon certain conditions. It queries system or process information, iterates over processes comparing their names case-insensitively to ‘Explorer.exe’, and manipulates registry values related to ‘Explorer.exe’. This function appears to monitor and potentially terminate or manipulate the ‘Explorer.exe’ process, a critical Windows shell process. Such behavior is suspicious and consistent with malware that aims to disrupt or control system processes.

Another function, ‘HttpGetRequestAndResponse_174a4’, performs HTTP GET requests by parsing URLs, resolving hostnames, opening sockets, sending requests, and reading responses. This network communication capability could be leveraged for command and control or data exfiltration, common in malware.

The binary also includes a function ‘PatchProcessEntryPointWithHook_12b5c’ that patches the entry point of a process by writing a hook or trampoline that redirects execution to a specified address. This technique is commonly used for process injection or hooking, allowing malware to alter process behavior or inject malicious code.

Other functions related to sending IOCTL requests to device drivers were identified, but their maliciousness could not be conclusively determined without additional context.

Overall, the binary exhibits multiple indicators of malicious behavior, including process manipulation, network communication, and code injection techniques, suggesting it is likely malware designed to interfere with system processes and communicate with remote servers.

Figure 1. Project Ire report, sample with SHA256: 86047bb1969d1db455493955fd450d18c62a3f36294d0a6c3732c88dfbcc4f62 (opens in new tab)

The second sample, HackTool:Win64/KillAV!MTB (opens in new tab), was designed to disable antivirus software. Project Ire correctly identified the code that locates and disables antivirus programs, providing evidence that the file was malicious.  

In one section of the code, however, the system misidentified a function as anti-debugging behavior. To maintain accuracy, the system used the validator tool to flag the claim as unsupported. The issue was later resolved by updating decompiler rules, but this example illustrates how Project Ire navigates uncertainty during analysis. Figure 2 shows the corresponding report. 

Figure 2 Analysis

The binary contains several functions indicative of malicious intent. The function register_and_log_known_processes_140001000 logs and registers process names associated with antivirus and security software, such as ‘avp.exe’, ‘avpui.exe’, and ‘360Tray.exe’. It calls another function, TerminateProcessesByNameSubstring_1400010f4, which enumerates system processes and terminates those whose names contain specified substrings. This behavior is typical of malware attempting to disable or evade security software by killing their processes.

Another function, check_and_handle_special_state_14000502c, performs checks on a global variable and triggers software interrupts if certain conditions are not met. While the exact purpose of these interrupts (int 0x29 and int 0x3) is unclear, they could represent an anti-debug or anti-analysis mechanism to detect or interfere with debugging or tampering attempts. However, this assumption could not be fully validated against expert statements.

Other functions include initialization routines and simple logging wrappers, but the core malicious behavior centers on process termination targeting security software. This indicates the binary is designed to compromise system security by disabling protective processes, a hallmark of malware such as trojans or rootkits.

Figure 2. Project Ire report, sample with SHA256: b6cb163089f665c05d607a465f1b6272cdd5c949772ab9ce7227120cf61f971a (opens in new tab)

Real-world evaluation with Microsoft Defender 

The more demanding test involved nearly 4,000 “hard-target” files not classified by automated systems and slated for manual review by expert reverse engineers.

In this real-world scenario, Project Ire operated fully autonomously on files created after the language models’ training cutoff, files that no other automated tools at Microsoft could classify at the time.

The system achieved a high precision score of 0.89, meaning nearly 9 out of 10 files flagged malicious were correctly identified as malicious. Recall was 0.26, indicating that under these challenging conditions, the system detected roughly a quarter of all actual malware.

The system correctly identified many of the malicious files, with few false alarms, just a 4% false positive rate. While overall performance was moderate, this combination of accuracy and a low error rate suggests real potential for future deployment.

Looking ahead 

Based on these early successes, the Project Ire prototype will be leveraged inside Microsoft’s Defender organization as Binary Analyzer for threat detection and software classification.

Our goal is to scale the system’s speed and accuracy so that it can correctly classify files from any source, even on first encounter. Ultimately, our vision is to detect novel malware directly in memory, at scale.

Acknowledgements 

Project Ire acknowledges the following additional developers that contributed to the results in this publication: Dayenne de Souza, Raghav Pande, Ryan Terry, Shauharda Khadka, and Bob Fleck for their independent review of the system.

The system incorporates multiple tools, including the angr framework developed by Emotion Labs (opens in new tab). Microsoft has collaborated extensively with Emotion Labs, a pioneer in cyber autonomy, throughout the development of Project Ire, and thanks them for the innovations and insights that contributed to the successes reported here. 

The post Project Ire autonomously identifies malware at scale appeared first on Microsoft Research.

Read More

VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows

VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows

Alt text: Two white icons on a blue-to-green gradient background—one showing a central figure linked to others, representing a network, and the other depicting lines connecting to a document, symbolizing data flow.

Many applications of language models (LMs) involve generating content based on source material, such as answering questions, summarizing information, and drafting documents. A critical challenge for these applications is that LMs may produce content that is not supported by the source text – a phenomenon known as “closed-domain hallucination.”1

Existing methods for detecting closed-domain hallucination typically compare a given LM output to the source text, implicitly assuming that there is only a single output to evaluate. However, applications of LMs increasingly involve processes with multiple generative steps: LMs generate intermediate outputs that serve as inputs to subsequent steps and culminate in a final output. Many agentic workflows follow this paradigm (e.g., each agent is responsible for a specific document or sub-task, and their outputs are synthesized into a final response).  

In our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability,” we argue that, given the complexity of processes with multiple generative steps, detecting hallucination in the final output is necessary but not sufficient. We also need traceability, which has two components: 

  1. Provenance: if the final output is supported by the source text, we should be able to trace its path through the intermediate outputs to the source. 
  2. Error Localization: if the final output is not supported by the source text, we should be able to trace where the error was likely introduced.

Our paper presents VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for processes with any number of generative steps. We also demonstrate that VeriTrail outperforms baseline methods commonly used for hallucination detection. In this blog post, we provide an overview of VeriTrail’s design and performance.2

VeriTrail’s hallucination detection process

A key idea leveraged by VeriTrail is that a wide range of generative processes can be represented as a directed acyclic graph (DAG). Each node in the DAG represents a piece of text (i.e., source material, an intermediate output, or the final output) and each edge from node A to node B indicates that A was used as an input to produce B. Each node is assigned a unique ID, as well as a stage reflecting its position in the generative process.  

An example of a process with multiple generative steps is GraphRAG. A DAG representing a GraphRAG run is illustrated in Figure 1, where the boxes and arrows correspond to nodes and edges, respectively.3

A GraphRAG run is depicted as a directed acyclic graph. The Stage 1 nodes represent source text chunks. Each Stage 1 node has an edge pointing to a Stage 2 node, which corresponds to an entity or a relationship. Entity 3 was extracted from two source text chunks, so its descriptions are summarized. The summarized entity description forms a Stage 3 node. The Stage 2 and 3 nodes have edges pointing to Stage 4 nodes, which represent community reports. The Stage 4 nodes have edges pointing to Stage 5 nodes, which correspond to map-level answers. The Stage 5 nodes each have an edge pointing to the terminal node, which represents the final answer. The terminal node is the only node in Stage 6.
Figure 1: GraphRAG splits the source text into chunks (Stage 1). For each chunk, an LM extracts entities and relationships (the latter are denoted by “⭤ “), along with short descriptions (Stage 2). If an entity or a relationship was extracted from multiple chunks, an LM summarizes the descriptions (Stage 3). A knowledge graph is constructed from the final set of entities and relationships, and a community detection algorithm, such as Leiden clustering, groups entities into communities. For each community, an LM generates a “community report” that summarizes the entities and relationships (Stage 4). To answer a user’s question, an LM generates “map-level answers” based on groups of community reports (Stage 5), then synthesizes them into a final answer (Stage 6).

VeriTrail takes as input a DAG representing a completed generative process and aims to determine whether the final output is fully supported by the source text. It begins by extracting claims (i.e., self-contained, verifiable statements) from the final output using Claimify. VeriTrail verifies claims in the reverse order of the generative process: it starts from the final output and moves toward the source text. Each claim is verified separately. Below, we include two case studies that illustrate how VeriTrail works, using the DAG from Figure 1. 

Case study 1: A “Fully Supported” claim

An example of VeriTrail's claim verification process where the claim is found “Fully Supported.” A claim extracted from the terminal node, Node 17, is “Legislative efforts have been made to address the high cost of diabetes-related supplies in the US.” In Iteration 1, VeriTrail checks Nodes 15 and 16, which are the source nodes of the terminal node. The sentence “The general assembly in North Carolina is considering legislation to set a cap on insulin prices, which indicates that high insulin prices are a contributing factor to the high cost of diabetes-related supplies in the US” is selected as evidence from Node 15. The tentative verdict is “Fully Supported.” In Iteration 2, VeriTrail checks Nodes 12 and 13, which are the source nodes of Node 15. The sentence “The General Assembly in North Carolina is considering legislation to set a cap on insulin prices” is selected as evidence from Node 13. The verdict remains “Fully Supported.” In Iteration 3, VeriTrail checks Nodes 4, 5, and 11, which are the source nodes of Node 13. The sentence “The General Assembly is the legislative body in North Carolina considering legislation to cap insulin prices” is selected as evidence from Node 4. The verdict is still “Fully Supported.” In Iteration 4, VeriTrail checks Node 1, which is the source node of Node 4. The selected evidence is “‘There’s actually legislation in North Carolina at the General Assembly to set a cap on insulin…’ Stein said.” The corresponding verdict is “Fully Supported.” Since Node 1 represents a raw text chunk, it does not have any source nodes to check. Therefore, verification terminates and the “Fully Supported” verdict is deemed final.
Figure 2: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Fully Supported” claim.

Figure 2 shows an example of a claim that VeriTrail determined was not hallucinated: 

  • In Iteration 1, VeriTrail identified the nodes that were used as inputs for the final answer: Nodes 15 and 16. Each identified node was split into sentences, and each sentence was programmatically assigned a unique ID.
    • An LM then performed Evidence Selection, selecting all sentence IDs that strongly implied the truth or falsehood of the claim. The LM also generated a summary of the selected sentences (not shown in Figure 2). In this example, a sentence was selected from Node 15.
    • Next, an LM performed Verdict Generation. If no sentences had been selected in the Evidence Selection step, the claim would have been assigned a “Not Fully Supported” verdict. Instead, an LM was prompted to classify the claim as “Fully Supported,” “Not Fully Supported,” or “Inconclusive” based on the evidence. In this case, the verdict was “Fully Supported.”
  • Since the verdict in Iteration 1 was “Fully Supported,” VeriTrail proceeded to Iteration 2. It considered the nodes from which at least one sentence was selected in the latest Evidence Selection step (Node 15) and identified their input nodes (Nodes 12 and 13). VeriTrail repeated Evidence Selection and Verdict Generation for the identified nodes. Once again, the verdict was “Fully Supported.” This process – identifying candidate nodes, performing Evidence Selection and Verdict Generation – was repeated in Iteration 3, where the verdict was still “Fully Supported,” and likewise in Iteration 4. 
  • In Iteration 4, a single source text chunk was verified. Since the source text, by definition, does not have any inputs, verification terminated and the verdict was deemed final.

Case study 2: A “Not Fully Supported” claim

An example of VeriTrail's claim verification process where the claim is found “Not Fully Supported.” We assume that the maximum number of consecutive “Not Fully Supported” verdicts was set to 2. A claim extracted from the terminal node, Node 17, is “Challenges related to electric vehicle battery repairability contribute to sluggish retail auto sales in China.” In Iteration 1, VeriTrail checks Nodes 15 and 16, which are the source nodes of the terminal node. Two sentences are selected as evidence. The first sentence is “Challenges with electric vehicle (EV) battery disposal and repair may also contribute to the sluggishness in retail auto sales.” The second sentence is “Junkyards are accumulating discarded EV battery packs, while collision shops face limitations in repairing EV battery packs, which could affect consumer confidence and demand.” These sentences are both from Node 15. The tentative verdict is “Not Fully Supported.” In Iteration 2, VeriTrail checks Nodes 12, 13, and 14. Nodes 12 and 13 are the source nodes of Node 15. Node 14 is the source node of Node 16, which was checked in Iteration 1. The sentence “The electric vehicle market in China is influenced by challenges associated with EV battery disposal and repair” is selected as evidence from Node 12. The verdict remains “Not Fully Supported.” Since two consecutive “Not Fully Supported” verdicts have been reached, which was the maximum, verification terminates and the final verdict is “Not Fully Supported.”
Figure 3: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Not Fully Supported” claim, where the maximum number of consecutive “Not Fully Supported” verdicts was set to 2.

Figure 3 provides an example of a claim where VeriTrail identified hallucination:

  • In Iteration 1, VeriTrail identified the nodes used as inputs for the final answer: Nodes 15 and 16. After Evidence Selection and Verdict Generation, the verdict was “Not Fully Supported.” Users can configure the maximum number of consecutive “Not Fully Supported” verdicts permitted. If the maximum had been set to 1, verification would have terminated here, and the verdict would have been deemed final. Let’s assume the maximum was set to 2, meaning that VeriTrail had to perform at least one more iteration.
  • Even though evidence was selected only from Node 15 in Iteration 1, VeriTrail checked the input nodes for both Node 15 and Node 16 (i.e., Nodes 12, 13, and 14) in Iteration 2. Recall that in Case Study 1 where the verdict was “Fully Supported,” VeriTrail only checked the input nodes for Node 15. Why was the “Not Fully Supported” claim handled differently? If the Evidence Selection step overlooked relevant evidence, the “Not Fully Supported” verdict might be incorrect. In this case, continuing verification based solely on the selected evidence (i.e., Node 15) would propagate the mistake, defeating the purpose of repeated verification.
  • In Iteration 2, Evidence Selection and Verdict Generation were repeated for Nodes 12, 13, and 14. Once again, the verdict was “Not Fully Supported.” Since this was the second consecutive “Not Fully Supported” verdict, verification terminated and the verdict was deemed final.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Providing traceability

In addition to assigning a final “Fully Supported,” “Not Fully Supported,” or “Inconclusive” verdict to each claim, VeriTrail returns (a) all Verdict Generation results and (b) an evidence trail composed of all Evidence Selection results: the selected sentences, their corresponding node IDs, and the generated summaries. Collectively, these outputs provide traceability: 

  1. Provenance: For “Fully Supported” and “Inconclusive” claims, the evidence trail traces a path from the source material to the final output, helping users understand how the output may have been derived. For example, in Case Study 1, the evidence trail consists of Sentence 8 from Node 15, Sentence 11 from Node 13, Sentence 26 from Node 4, and Sentence 79 from Node 1.
  2. Error Localization: For “Not Fully Supported” claims, VeriTrail uses the Verdict Generation results to identify the stage(s) of the process where the unsupported content was likely introduced. For instance, in Case Study 2, where none of the verified intermediate outputs supported the claim, VeriTrail would indicate that the hallucination occurred in the final answer (Stage 6). Error stage identification helps users address hallucinations and understand where in the process they are most likely to occur. 

The evidence trail also helps users verify the verdict: instead of reading through all nodes – which may be infeasible for processes that generate large amounts of text – users can simply review the evidence sentences and summaries. 

Key design features

VeriTrail’s design prioritizes reliability, efficiency, scalability, and user agency. Notable features include: 

  • During Evidence Selection (introduced in Case Study 1), the sentence IDs returned by the LM are checked against the programmatically assigned IDs. If a returned ID does not match an assigned ID, it is discarded; otherwise, it is mapped to its corresponding sentence. This approach guarantees that the sentences included in the evidence trail are not hallucinated.
  • After a claim is assigned an interim “Fully Supported” or “Inconclusive” verdict (as in Case Study 1), VeriTrail verifies the input nodes of only the nodes from which evidence was previously selected – not all possible input nodes. By progressively narrowing the search space, VeriTrail limits the number of nodes the LM must evaluate. In particular, since VeriTrail starts from the final output and moves toward the source text, it tends to verify a smaller proportion of nodes as it approaches the source text. Nodes closer to the source text tend to be larger (e.g., a book chapter should be larger than its summary), so verifying fewer of them helps reduce computational cost.
  • VeriTrail is designed to handle input graphs with any number of nodes, regardless of whether they fit in a single prompt. Users can specify an input size limit per prompt. For Evidence Selection, inputs that exceed the limit are split across multiple prompts. If the resulting evidence exceeds the input size limit for Verdict Generation, VeriTrail reruns Evidence Selection to compress the evidence further. Users can configure the maximum number of Evidence Selection reruns.  
  • The configurable maximum number of consecutive “Not Fully Supported” verdicts (introduced in Case Study 2) allows the user to find their desired balance between computational cost and how conservative VeriTrail is in flagging hallucinations. A lower maximum reduces cost by limiting the number of checks. A higher maximum increases confidence that a flagged claim is truly hallucinated since it requires repeated confirmation of the “Not Fully Supported” verdict. 

Evaluating VeriTrail’s performance

We tested VeriTrail on two datasets covering distinct generative processes (hierarchical summarization4 and GraphRAG), tasks (summarization and question-answering), and types of source material (fiction novels and news articles). For the source material, we focused on long documents and large collections of documents (i.e., >100K tokens), where hallucination detection is especially challenging and processes with multiple generative steps are typically most valuable. The resulting DAGs were much more complex than the examples provided above (e.g., in one of the datasets, the average number of nodes was 114,368).

We compared VeriTrail to three types of baseline methods commonly used for closed-domain hallucination detection: Natural Language Inference models (AlignScore and INFUSE); Retrieval-Augmented Generation; and long-context models (Gemini 1.5 Pro and GPT-4.1 mini). Across both datasets and all language models tested, VeriTrail outperformed the baseline methods in detecting hallucination.5

Most importantly, VeriTrail traces claims through intermediate outputs – unlike the baseline methods, which directly compare the final output to the source material. As a result, it can identify where hallucinated content was likely introduced and how faithful content may have been derived from the source. By providing traceability, VeriTrail brings transparency to generative processes, helping users understand, verify, debug, and, ultimately, trust their outputs.  

For an in-depth discussion of VeriTrail, please see our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability.


1 (opens in new tab) The term “closed-domain hallucination” was introduced by OpenAI in the GPT-4 Technical Report (opens in new tab).

2 VeriTrail is currently used for research purposes only and is not available commercially.

3 We focus on GraphRAG’s global search method.

4 (opens in new tab) In hierarchical summarization, an LM summarizes each source text chunk individually, then the resulting summaries are repeatedly grouped and summarized until a final summary is produced (Wu et al., 2021 (opens in new tab); Chang et al., 2023 (opens in new tab)).

5 The only exception was the mistral-large-2411 model, where VeriTrail had the highest balanced accuracy, but not the highest macro F1 score.

The post VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows appeared first on Microsoft Research.

Read More

Navigating medical education in the era of generative AI

Navigating medical education in the era of generative AI

AI Revolution | Illustrated headshots of Daniel Chen, Peter Lee, and Dr. Morgan Cheatham

In November 2022, OpenAI’s ChatGPT kick-started a new era in AI. This was followed less than a half year later by the release of GPT-4. In the months leading up to GPT-4’s public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee.   

In this episode, Dr. Morgan Cheatham (opens in new tab) and Daniel Chen (opens in new tab), two rising physicians and experts in both medicine and technology, join Lee to explore how generative AI is reshaping medical education. Cheatham, a partner and head of healthcare and life sciences at Breyer Capital and a resident physician at Boston Children’s Hospital, discusses how AI is changing how clinicians acquire and apply medical knowledge at the point of care, emphasizing the need for training and curriculum changes to help ensure AI is used responsibly and that clinicians are equipped to maximize its potential. Chen, a medical student at the Kaiser Permanente Bernard J. Tyson School of Medicine, shares how he and his peers use AI tools as study aids, clinical tutors, and second opinions and reflects on the risks of overreliance and the importance of preserving critical thinking.


Learn more:

Perspectives on the Current and Future State of Artificial Intelligence in Medical Genetics (opens in new tab) (Cheatham)
Publication | May 2025

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models (opens in new tab) (Cheatham) 
Publication | February 2023 

The AI Revolution in Medicine: GPT-4 and Beyond 
Book | Peter Lee, Carey Goldberg, Isaac Kohane | April 2023 

Transcript

[MUSIC]     

[BOOK PASSAGE] 

PETER LEE: “Medicine often uses the training approach when trying to assess multipurpose talent. To ensure students can safely and effectively take care of patients, we have them jump through quite a few hoops, … [and] they need good evaluations once they reach the clinic, passing grades on more exams like the USMLE [United States Medical Licensing Examination]. … [But] GPT-4 gets more than 90 percent of questions on licensing exams correct. … Does that provide any level of comfort in using GPT-4 in medicine?” 

[END OF BOOK PASSAGE]     

[THEME MUSIC]     

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.     

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?      

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.


[THEME MUSIC FADES]  

The book passage I read at the top is from Chapter 4, “Trust but Verify.” In it, we explore how AI systems like GPT-4 should be evaluated for performance, safety, and reliability and compare this to how humans are both trained and assessed for readiness to deliver healthcare. 

In previous conversations with guests, we’ve spoken a lot about AI in the clinic as well as in labs and companies developing AI-driven tools. We’ve also talked about AI in the hands of patients and consumers. But there has been some discussion also about AI’s role in medical training. And, as a founding board member of a new medical school at Kaiser Permanente, I definitely have my own thoughts about this. But today, I’m excited to welcome two guests who represent the next generation of medical professionals for their insights, Morgan Cheatham and Daniel Chen. 

Morgan Cheatham is a graduate of Brown University’s Warren Alpert Medical School with clinical training in genetics at Harvard and is a clinical fellow at Boston Children’s Hospital. While Morgan is a bona fide doctor in training, he’s also amazingly an influential health technology strategist. He was recently named partner and head of healthcare and life sciences at Breyer Capital and has led investments in several healthcare AI companies that have eclipsed multibillion-dollar valuations.  

Daniel Chen is finishing his second year as a medical student at the Kaiser Permanente Bernard J. Tyson School of Medicine. He holds a neuroscience degree from the University of Washington and was a research assistant in the Raskind Lab at the UW School of Medicine, working with imaging and genetic data analyses for biomedical research. Prior to med school, Daniel pursued experiences that cultivated his interest in the application of AI in medical practice and education.  

Daniel and Morgan exemplify the real-world future of healthcare, a student entering his third year of medical school and a fresh medical-school graduate who is starting a residency while at the same time continuing his work on investing in healthcare startups. 

[TRANSITION MUSIC] 

Here is my interview with Morgan Cheatham: 

LEE: Morgan, thanks for joining. Really, really looking forward to this chat. 

MORGAN CHEATHAM: Peter, it’s a privilege to be here with you. Thank you. 

LEE: So are there any other human beings who are partners at big-league venture firms, residents at, you know, a Harvard-affiliated medical center, author, editor for a leading medical journal? I mean, who are your … who’s your cohort? Who are your peers? 

CHEATHAM: I love this question. There are so many people who I consider peers that I look up to who have paved this path. And I think what is distinct about each of them is they have this physician-plus orientation. They are multilingual in terms of knowing the language of medicine but having learned other disciplines. And we share a common friend, Dr. Zak Kohane, who was among the first to really show how you can meld two worlds as a physician and make significant contributions to the intersections thereof.  

I also deeply, in the world of business, respect physicians like Dr. Krishna Yeshwant at Google Ventures, who simultaneously pursued residency training and built what is now, you know, a large and enduring venture firm.  

So there are plenty of people out there who’ve carved their own path and become these multilingual beings, and I aspire to be one. 

LEE: So, you know, one thing I’ve been trying to explore with people are their origins with respect to the technology of AI. And there’s two eras for that. There’s AI before ChatGPT and before, you know, generative AI really became a big thing, and then afterwards.  

So let’s start first before ChatGPT. You know, what was your contact? What was, you know, your knowledge of AI and machine learning? 

CHEATHAM: Sure, so my experiences in computer science date back to high school. I went to Thomas Jefferson, which is a high school in Alexandria, Virginia, that prides itself on requiring students to take computer science in their first year of high school as kind of a required torturous experience. [LAUGHTER] And I remember that fondly. Our final project was Brick Breaker. It was actually, I joke, all hard coded. So there was nothing intelligent about the Brick Breaker that we built. But that was my first exposure.  

I was a classics nerd, and I was really interested in biology and chemistry as a pre-medical student. So I really wouldn’t intersect with this field again until I was shadowing at Inova Hospital, which was a local hospital near me. And it was interesting because, at the time—I was shadowing in the anesthesia department—they were actually implementing their first instance of Epic. 

LEE: Mmm. Wow. 

CHEATHAM: And I remember that experience fondly because the entire hospital was going from analog—they were going from paper-based charts—to this new digital system. And I didn’t quite know in that moment what it would mean for the field or for my career, but I knew it was a big deal because a lot of people had a lot of emotion around what was going on, and it was in that experience that I kind of decided to attach myself to the intersection of computation and medicine. So when I got to undergrad, I was a pre-medical student. I was very passionate about studying the sacred physician-patient relationship and everything that had to go on in that exam room to provide excellent care.  

But there were a few formative experiences: one, working at a physician-founded startup that was using at the time we called it big data, if you remember, … 

LEE: Yup. 

CHEATHAM: … to match the right patient to the right provider at the right time. And it was in that moment that I realized that as a physician, I could utilize technology to scale that sacred one-to-one patient-provider interaction in nonlinear ways. So that was, kind of, the first experience where I saw deployed systems that were using patient data and clinical information in an intelligent format. 

LEE: Yeah. And so you’re a pre-medical student, but you have this computer science understanding. You have an intuition, I guess is the right way to say it, that the clinical data becoming digital is going to be important. So then what happens from there to your path to medical school? 

CHEATHAM: Yeah, so I had a few formative research experiences in my undergraduate years. You know, nothing that ever amounted to a significant publication, but I was toying around with SVMs [support vector machines] for sepsis and working with the MIMIC [Medical Information Mart for Intensive Care] database early days and really just trying to understand what it meant that medical data was becoming digitized.  

And at the same time, again, I was rather unsatisfied doing that purely in an academic context. And I so early craved seeing how this would roll out in the wild, roll out in a clinical setting that I would soon occupy. And that was really what drove me to work at this company called Kyruus [Health] and understand how these systems, you know, scaled. Obviously, that’s something with AI that we’re now grappling with in a real way because it looks much different.  

LEE: Right. Yep. 

CHEATHAM: So the other experience I had, which is less relevant to AI, but I did do a summer in banking. And I mention this because what I learned in the experience was … it was a master class in business. And I learned that there was another scaling factor that I should appreciate as we think about medicine, and that was capital and business formation. And that was also something that could scale nonlinearly.  

So when you married that with technology, it was, kind of, a natural segue for me before going to med school to think about venture capital and partnering with founders who were going to be building these technologies for the long term. And so that’s how I landed on the venture side. 

LEE: And then how long of a break before you started your medical studies? 

CHEATHAM: It was about four years. Originally, it was going to be a two-year deferral, and the pandemic happened. Our space became quite active in terms of new companies and investment. So it was about four years before I went back. 

LEE: I see. And so you’re in medical school. ChatGPT happened while you were in medical school, is that right? 

CHEATHAM: That’s right. That’s right. Right before I was studying for Step 1. So the funny story, Peter, that I like to share with folks is … 

LEE: Yeah. 

CHEATHAM: … I was just embarking on designing my Step 1 study plan with my mentor. And I went to NeurIPS [Conference] for the first time. And that was in 2022, when, of course, ChatGPT was released.  

And for the remainder of that fall period, you know, I should have been studying for these shelf exams and, you know, getting ready …  

LEE: Yeah. 

CHEATHAM: … for this large board exam. And I was fortunate to partner, actually, with one of our portfolio company CEOs who is a physician—he is an MD/PhD—to work on the first paper that showed that ChatGPT could pass the US Medical Licensing Exam (opens in new tab).  

LEE: Yes. 

CHEATHAM: And that was a riveting experience for a number of reasons. I joke with folks that it was both the best paper I was ever, you know, a part of and proud to be a coauthor of, but also the worst for a lot of reasons that we could talk about.  

It was the best in terms of canonical metrics like citations, but the worst in terms of, wow, did we spend six months as a field thinking this was the right benchmark … [LAUGHTER] 

LEE: Right. 

CHEATHAM: … for how to assess the performance of these models. And I’m so encouraged … 

LEE: You shouldn’t feel bad that way because, you know, at that time, I was secretly, you know, assessing what we now know of as GPT-4 in that period. And what was the first thing I tried to do? Step 1 medical exam.  

By the way, just for our listeners who don’t understand about medical education—in the US, there’s a three-part exam that extends over a couple of years of medical school. Step 1, Step 2, Step 3. And Step 1 and Step 2 in particular are multiple-choice exams. 

And they are very high stakes when you’re in medical school. And you really have to have a command of quite a lot of clinical knowledge to pass these. And it’s funny to hear you say what you were just sharing there because it was also the first impulse I had with GPT-4. And in retrospect, I feel silly about that. 

CHEATHAM: I think many of us do, but I’ve been encouraged over the last two years, to your point, that we really have moved our discourse beyond these exams to thinking about more robust systems for the evaluation of performance, which becomes even more interesting as you and I have spoken about these multi-agent frameworks that we are now, you know, compelled to explore further. 

LEE: Yeah. Well, and even though I know you’re a little sheepish about it now, I think in the show notes, we’ll link to that paper because it really was one of the seminal moments when we think about AI, AI in medicine.

And so you’re seeing this new technology, and it’s happening at a moment when you yourself have to confront taking the Step 1 exam. So how did that feel? 

CHEATHAM: It was humbling. It was shocking. What I had worked two years for, grueling over textbooks and, you know, flashcards and all of the things we do as medical students, to see a system emerge out of thin air that was arguably going to perform far better than I ever would, no matter how much …  

LEE: Yeah. 

CHEATHAM: … I was going to study for that exam, it set me back. It forced me to interrogate what my role in medicine would be. And it dramatically changed the specialties that I considered for myself long term.  

And I hope we talk about, you know, how I stumbled upon genetics and why I’m so excited about that field and its evolution in this computational landscape. I had to do a lot of soul searching to relinquish what I thought it meant to be a physician and how I would adapt in this new environment. 

LEE: You know, one of the things that we wrote in our book, I think it was in a chapter that I contributed, I was imagining that students studying for Step 1 would be able to do it more actively.  

Or you could even do sort of a pseudo-Step 3 exam by having a conversation. You provide the presentation of a patient and then have an encounter, you know, where the ChatGPT is the patient, and then you pretend to be the doctor. And then in the example that we published, then you say, “End of encounter.” And then you ask ChatGPT for an assessment of things.  

So, you know, maybe it all came too late for Step 1 for you because you were already very focused and, you know, had your own kind of study framework. But did you have an influence or use of this kind of technology for Step 2 and Step 3? 

CHEATHAM: So even for Step 1, I would say, it [ChatGPT], you know, dropped in November. I took it [Step 1 exam] in the spring, so I was able to use it to study. But the lesson I learned in that moment, Peter, was really about the importance of trust with AI and clinicians or clinicians in training, because we all have the same resources that we use for these board exams, right. UWorld is this question bank. It’s been around forever. If you’re not using UWorld, like, good luck. And so why would you deviate off of a well-trodden path to study for this really important exam?  

And so I kind of adjunctively used GPT alongside UWorld to come up with more personalized explanations for concepts that I wasn’t understanding, and I found that it was pretty good and it was certainly helpful for me.  

Fortunately, I was, you know, able to pass, but I was very intentional about dogfooding AI when I was a medical student, and part of that was because I had been a venture capitalist, and I’d made investments in companies whose products I could actually use.  

And so, you know, Abridge is a company in the scribing space that you and I have talked about.  

LEE: Yeah. 

CHEATHAM: I was so fortunate in the early days of their product to not just be a user but to get to bring their product across the hospital. I could bring the product to the emergency department one week, to neurology another week, to the PICU [pediatric intensive care unit], you know, the next week, and assess the relative performance of, you know, how it handled really complex genetics cases … 

LEE: Yeah. 

CHEATHAM: … versus these very challenging social situations that you often find yourself navigating in primary care. So not only was I emotional about this technology, but I was a voracious adopter in the moment. 

LEE: Yeah, right. And you had a financial interest then on top of that, right? 

CHEATHAM: I was not paid by Abridge to use the product, but, you know, I joke that the team was probably sick of me. [LAUGHTER] 

LEE: No, no, but you were working for a venture firm that was invested in these, right? So all of these things are wrapped up together. You know, you’re having to get licensed as a doctor while doing all of this.  

So I want to get into that investment and new venture stuff there, but let’s stick just for a few more minutes on medical education. So I mentioned, you know, what we wrote in the book, and I remember writing the example, you know, of an encounter. Is that at all realistic? Is anything like that … that was pure speculation on our part. What’s really happening?  

And then after we talk about what’s really happening, what do you think should happen in medical education given the reality of generative AI? 

CHEATHAM: I’ve been pleasantly surprised talking with my colleagues about AI in clinical settings, how curious people are and how curious they’ve been over the last two years. I think, oftentimes, we say, oh, you know, this technology really is stratified by age and the younger clinicians are using it more and the older physicians are ignoring it. And, you know, maybe that’s true in some regards, but I’ve seen, you know, many, you know, senior attendings pulling up Perplexity, GPT, more recently OpenEvidence (opens in new tab), which has been a really essential tool for me personally at the point of care, to come up with the best decisions for our patients.  

The general skepticism arises when people reflect on their own experience in training and they think, “Well, I had to learn how to do it this way.”  

LEE: Yeah. 

CHEATHAM: “And therefore, you using an AI scribe to document this encounter doesn’t feel right to me because I didn’t get to do that.” And I did face some of those critiques or criticisms, where you need to learn how to do it the old-school way first and then you can use an AI scribe.  

And I haven’t yet seen—maybe even taking a step back—I haven’t seen a lot of integration of AI into the core medical curriculum, period.  

LEE: Yeah. 

CHEATHAM: And, as you know, if you want to add something to medical school curriculum, you can get in a long line of people who also want to do that. 

LEE: Yes. Yeah.

CHEATHAM: But it is urgent that our medical schools do create formalized required trainings for this technology because people are already using it.  

LEE: Yes. 

CHEATHAM: I think what we will need to determine is how much of the old way do people need to learn in order to earn the right to use AI at the point of care and how much of that old understanding, that prior experience, is essential to be able to assess the performance of these tools and whether or not they are having the intended outcome.  

I kind of joke it’s like learning cursive, right? 

LEE: Yes. 

CHEATHAM: I’m old enough to have had to learn cursive. I don’t think people really have to learn it these days. When do I use it? Well, when I’m signing something. I don’t even really sign checks anymore, but … 

LEE: Well … the example I’ve used, which you’ve heard, is, I’m sure you were still taught the technique of manual palpation, even though … 

CHEATHAM: Of course. 

LEE: … you have access to technologies like ultrasound. And in fact, you would use ultrasound in many cases.  

And so I need to pin you down. What is your opinion on these things? Do you need to be trained in the old ways? 

CHEATHAM: When it comes to understanding the architecture of the medical note, I believe it is important for clinicians in training to know how that information is generated, how it’s characterized, and how to go from a very broad-reaching conversation to a distilled clinical document that serves many functions. 

Does that mean that you should be forced to practice without an AI scribe for the entirety of your medical education? No. And I think that as you are learning the architecture of that document, you should be handed an AI scribe and you should be empowered to have visits with patients both in an analog setting—where you are transcribing and generating that note—and soon thereafter, I’m talking in a matter of weeks, working with an AI scribe. That’s my personal belief.  

LEE: Yeah, yeah. So you’re going to … well, first off, congratulations on landing a residency at Boston Children’s [Hospital].  

CHEATHAM: Thank you, Peter. 

LEE: I understand there were only two people selected for this and super competitive. You know, with that perspective, you know, how do you see your future in medicine, just given everything that’s happening with AI right now?  

And are there things that you would urge, let’s say, the dean of the Brown Medical School to consider or to change? Or maybe not the dean of Brown but the head of the LCME [Liaison Committee on Medical Education], the accrediting body for US medical schools. What in your mind needs to change? 

CHEATHAM: Sure. I’ll answer your first question first and then talk about the future.  

LEE: Yeah. 

CHEATHAM: For me personally, I fell into the field of genomics. And so my training program will cover both pediatrics as well as clinical genetics and genomics.  

And I alluded to this earlier, but one of the reasons I’m so excited to join the field is because I really feel like the field of genetics not only is focused on a very underserved patient population, but not in how we typically think of underserved. I’m talking about underserved as in patients who don’t always have answers. Patients for whom the current guidelines don’t offer information or comfort or support. 

Those are patients that are extremely underserved. And I think in this moment of AI, there’s a unique opportunity to utilize the computational systems that we now have access to, to provide these answers more precisely, more quickly.  

And so I’m excited to marry those two fields. And genetics has long been a field that has adopted technology. We just think about the foundational technology of genomic sequencing and variant interpretation. And so it’s a kind of natural evolution of the field, I believe, to integrate AI and specifically generative AI. 

If I were speaking directly to the LCME, I mean, I would just have to encourage the organization, as well as medical societies who partner with attending physicians across specialties, to lean in here.

When I think about prior revolutions in technology and medicine, physicians were not always at the helm. We have a unique opportunity now, and you talk about companies like Abridge in the AI space, companies like Viz.ai, Cleerly—I mean, I could go on: Iterative Health … I could list 20 organizations that are bringing AI to the point of care that are founded by physicians.

This is our moment to have a seat at the table and to shape not only the discourse but the deployment. And the unique lens, of course, that a physician brings is that of prioritizing the patient, and with AI and this time around, we have to do that.

LEE: So LCME for our listeners is, I think it stands for the Liaison Committee on Medical Education (opens in new tab). It’s basically the accrediting body for US medical schools, and it’s very high stakes. It’s very, very rigorous, which I think is a good thing, but it’s also a bit of a straitjacket.  

So if you are on the LCME, are there specific new curricular elements that you would demand that LCME, you know, add to its accreditation standards? 

CHEATHAM: We need to unbundle the different components of the medical appointment and think about the different functions of a human clinician to answer that question.  

There are a couple of areas that are top of mind for me, the first being medical search. There are large organizations and healthcare incumbents that have been around for many decades, companies like UpToDate or even, you know, the guidelines that are housed by our medical societies, that need to respond to the information demands of clinicians at the point of care in a new way with AI.  

And so I would love to see our medical institutions teaching more students how to use AI for medical search problems at the point of care. How to not only, you know, from a prompt perspective, ask questions about patients in a high-efficacy way, but also to interpret the outputs of these systems to inform downstream clinical decision-making.  

People are already adopting, as you know, GPT, OpenEvidence, Perplexity, all of these tools to make these decisions now.  

And so by not—again, it’s a moral imperative of the LCME—by not having curriculum and support for clinicians doing that, we run the risk of folks not utilizing these tools properly but also to their greatest potential. 

LEE: Yeah, then, but zooming forward then, what about board certification? 

CHEATHAM: Board certification today is already transitioning to an open-book format for many specialties, is my understanding. And in talking to some of my fellow geneticists, who, you know, that’s a pretty challenging board exam in clinical genetics or biochemical genetics. They are using OpenEvidence during those open-book exams.  

So what I would like to see us do is move from a system of rote memorization and regurgitation of fact to an assessment framework that is adaptive, is responsive, and assesses for your ability to use the tools that we now have at our disposal to make sound clinical decisions. 

LEE: Yeah. We’ve heard from Sara Murray, you know, that when she’s doing her rounds, she consults routinely with ChatGPT. And that was something we also predicted, especially Carey Goldberg in our book, you know, wrote this fictional account.  

Is that the primary real-world use of AI? Not only by clinicians, but also by medical students … are medical students, you know, engaged with ChatGPT or, you know, similar? 

CHEATHAM: Absolutely. I’ve listed some of the tools. I think there, in general, Peter, there is this new clinician stack that is emerging of these tools that people are trying, and I think the cycles of iteration are quick, right. Some folks are using Claude [Claude.ai] one week, and they’re trying Perplexity, or they’re trying OpenEvidence, they’re trying GPT for a different task.  

There’s this kind of moment in medicine that every clinician experiences where you’re on rounds, and there’s that very senior attending. And you’ve just presented a patient to them, and you think you did an elegant job, and you’ve summarized all the information, and you really feel good about your differential, and they ask you, like, the one question you didn’t think to address. [LAUGHTER] 

And I’ll tell you, some of the funniest moments I’ve had using AI in the hospital has been, and let me take a step back, that process of an attending physician interrogating a medical student is called “pimping,” for lack of a better phrase.  

And some of the funniest use cases I’ve had for AI in that setting is actually using OpenEvidence or GPT as defense against pimping. [LAUGHTER] So quickly while they’re asking me the question, I put it in, and I’m actually able to answer it right away. So it’s been effective for that. But I would say, you know, [in] the halls of most of the hospitals where I’ve trained, I’m seeing this technology in the wild.

LEE: So now you’re so tech-forward, but that off-label use of AI, we also, when we wrote our book, we weren’t sure that at least top health systems would tolerate this. Do you have an opinion about this? Should these things be better regulated or controlled by the CIOs of Boston Children’s? 

CHEATHAM: I’m a big believer that transparency encourages good behaviors. 

And so the first time I actually tried to use ChatGPT in a clinical setting, it was at a hospital in Rhode Island. I will not name which hospital. But the site was actually blocked. I wasn’t able to access it from a desktop. That was the hospital’s first response to this technology was, let’s make sure none of our clinicians can access it. It has so much potential for medicine. The irony of that today.  

And it’s since, you know, become unblocked. But I was able to use it on my phone. So, to your point, if there’s a will, there’s a way. And we will utilize this technology if we are seeing perceived value. 

LEE: So, yeah, no, absolutely. So now, you know, in some discussions, one superpower that seems to be common across people who are really leading the charge here is they seem to be very good readers and students.  

And I understand you also as a voracious reader. In fact, you’re even on an editorial team for a major medical journal. To what extent does that help?  

And then from your vantage point at New England Journal of Medicine AI—and I’ll have a chance to ask Zak Kohane as the editor in chief the same question—you know, what’s your assessment as you reflect over the last two years for the submitted manuscripts? Are you overall surprised at what you’re seeing? Disappointed? Any kind of notable hits or misses, just in the steady stream of research papers that are getting submitted to that leading journal? 

CHEATHAM: I would say overall, the field is becoming more expansive in the kinds of questions that people are asking.  

Again, when we started, it was this very myopic approach of: “Can we pass these medical licensing exams? Can we benchmark this technology to how we benchmark our human clinicians?” I think that’s a trap. Some folks call this the Turing Trap, right, of let’s just compare everything to what a human is capable of.  

Instead of interrogating what is the unique, as you all talk about in the book, what are the unique attributes of this new substrate for computation and what new behaviors emerge from it, whether that’s from a workflow perspective in the back office, or—as I’m personally more passionate and as we’re seeing more people focus on in the literature—what are the diagnostic capabilities, right. 

I love Eric Topol’s framework for “machine eyes,” right, as this notion of like, yes, we as humans have eyes, and we have looked at medical images for many, many decades, but these machines can take a different approach to a retinal image, right.  

It’s not just what you can diagnose in terms of an ophthalmological disease but maybe a neurologic disease or, you know, maybe liver disease, right. 

So I think the literature is, in general, moving to this place of expansion, and I’m excited by that. 

LEE: Yeah, I kind of have referred to that as triangulation. You know, one of the things I think a trap that specialists in medicine can fall into, like a cardiologist will see everything in terms of the cardiac system. And … whereas a nephrologist will see things in a certain lens.  

And one of the things that you oftentimes see in the responses from a large language model is that more expansive view. At the same time, you know, I wonder … we have medical specialties for good reason. And, you know, at times I do wonder, you know, if there can be confusion that builds up.  

CHEATHAM: This is an under-discussed area of AI—AI collapses medical specialties onto themselves, right. 

You have the canonical example of the cardiologist, you know, arguing that, you know, we should diuresis and maybe the nephrologist arguing that we should, you know, protect the kidneys. And how do two disciplines disagree on what is right for the patient when in theory, there is an objective best answer given that patient’s clinical status? 

My understanding is that the emergence of medical specialties was a function of the cognitive overload of medicine in general and how difficult it was to keep all of the specifics of a given specialty in the mind. Of course, general practitioners are tasked with doing this at some level, but they’re also tasked with knowing when they’ve reached their limit and when they need to refer to a specialist.  

So I’m interested in this question of whether medical specialties themselves need to evolve.  

And if we look back in the history of medical technology, there are many times where a new technology forced a medical specialty to evolve, whether it was certain diagnostic tools that have been introduced or, as we’re seeing now with GLP-1s, the entire cardiometabolic field … 

LEE: Right. 

CHEATHAM: … is having to really reimagine itself with these new tools. So I think AI will look very similar, and we should not hold on to this notion of classical medical specialties simply out of convention.  

LEE: Right. All right. So now you’re starting your residency. You’re, you know, basically leading a charge in health and life sciences for a leading venture firm. I’d like you to predict what the world of healthcare is going to look like, you know, two years from now, five years from now, 10 years from now. And to frame that, to make it a little more specific, you know, what do you think will be possible that you, as a doctor and an investor, will be able to do two years from now, five years from now, 10 years from now that you can’t do today?  

CHEATHAM: Two years from now, I’m optimistic we’ll have greater adoption of AI by clinicians, both for back-office use cases. So whether that’s the scribe and the generation of the note for billing purposes, but also now thinking about that for patient-facing applications.  

We’re already doing this with drafting of notes. I think we’ll see greater proliferation of those more obvious use cases over the next two years. And hopefully we’re seeing that across hospital systems, not just large well-funded academics, but really reaching our community hospitals, our rural hospitals, our under-resourced settings.  

I think hopefully we’ll see greater conversion. Right now, we have this challenge of “pilotitis,” right. A lot of people are trying things, but the data shows that only one in three pilots are really converting to production use. So hopefully we’ll kind of move things forward that are working and pare back on those that are not. 

We will not solve the problem of payment models in the next two years. That is a prediction I have.  

Over the next five years, I suspect that, with the help of regulators, we will identify better payment mechanisms to support the adoption of AI because it cannot and will not sustain itself simply by asking health systems and hospitals to pay for it. That is not a scalable solution.  

LEE: Yes. Right. Yep. In fact, I think there have to be new revenue-positive incentives if providers are asked to do more in the adoption of technology. 

CHEATHAM: Absolutely. But as we appreciate, some of the most promising applications of AI have nothing to do with revenue. It might simply be providing a diagnosis to somebody, you know, for whom that might drive additional intervention, but may also not.  

And we have to be OK with that because that’s the right thing to do. It’s our moral imperative as clinicians to implement this where it provides value to the patient.

Over the next 10 years, what I—again, being a techno-optimist—am hopeful we start to see is a dissolving of the barrier that exists between care delivery and biomedical discovery.  

This is the vision of the learning health system that was written over 10 years ago, and we have not realized it in practice. I’m a big proponent of ensuring that every single patient that enters our healthcare system not only receives the best care, but that we learn from the experiences of that individual to help the next. 

And in our current system, that is not how it works. But, with AI, that now becomes possible. 

LEE: Well, I think connecting healthcare experiences to medical discovery—I think that that is really such a great vision for the future. And I do agree [that] AI really gives us real hope that we can make it true.  

Morgan, I think we could talk for a few hours more. It’s just incredible what you’re up to nowadays. Thank you so much for this conversation. I’ve learned a lot talking to you. 

CHEATHAM: Peter, thank you so much for your time. I will be clutching my signed copy of The AI Revolution in Medicine for many years to come.  

[TRANSITION MUSIC] 

LEE: Morgan obviously is not an ordinary med school graduate. In previous episodes, one of the things we’ve seen is that people on the leading edge of real-world AI in medicine oftentimes are both practicing doctors as well as technology developers. Morgan is another example of this type of polymath, being both a med student and a venture capitalist. 

One thing that struck me about Morgan is he’s just incredibly hands-on. He goes out, finds leading-edge tools and technologies, and often these things, even though they’re experimental, he takes them into his education and into his clinical experiences. I think this highlights a potentially important point for medical schools, and that is, it might be incredibly important to provide the support—and, let’s be serious, the permission—to students to access and use new tools and technologies. Indeed, the insight for me when I interact with Morgan is that in these early days of AI in medicine, there is no substitute for hands-on experimentation, and that is likely best done while in medical school.

Here’s my interview with Daniel Chen: 

LEE: Daniel, it’s great to have you here. 

DANIEL CHEN: Yeah, it’s a pleasure being here. 

LEE: Well, you know, I normally get started in these conversations by asking, you know, how do you explain to your mother what you do all day? And the reason that that’s a good question is a lot of the people we have on this podcast have fancy titles and unusual jobs, but I’m guessing that your mother would have already a preconceived notion of what a medical student does. So I’d like to twist the question around a little bit for you and ask, what does your mother not realize about how you spend your days at school?  

Or does she get it all right? [LAUGHS] 

CHEN: Oh, no, she is very observant. I’ll say that off the bat. But I think something that she might not realize, is the amount of efforts spent, kind of, outside the classroom or outside the hospital. You know, she’s always, like, saying you have such long days in the hospital. You’re there so early in the morning.  

But what she doesn’t realize is that maybe when I come back from the hospital, it’s not just like, oh, I’m done for the day. Let’s wind down, go to bed. But it’s more like, OK, I have some more practice questions I need to get through; I didn’t get through my studying. Let me write on, like wrap up this research project I’m working on, get that back to the PI [principal investigator]. It’s never ending to a certain extent. Those are some things she doesn’t realize. 

LEE: Yeah, I think, you know, all the time studying, I think, is something that people expect of second-year medical students. And even nowadays at the top medical schools like this one, being involved in research is also expected.  

I think one thing that is a little unusual is that you are actually in clinic, as well, as a second-year student. How has that gone for you? 

CHEN: Yeah, I mean, it’s definitely interesting. I would say I spend my time, especially this year, it’s kind of three things. There’s the preclinical stuff I’m doing. So that’s your classic, you know, you’re learning from the books, though I don’t feel like many of us do have textbooks anymore. [LAUGHTER] 

There’s the clinical aspect, which you mentioned, which is we have an interesting model, longitudinal integrated clerkships. We can talk about that. And the last component is the research aspect, right. The extracurriculars.  

But I think starting out as a second year and doing your rotations, probably early on in, kind of, the clinical medical education, has been really interesting, especially with our format, because typically med students have two years to read up on all the material and, like, get some foundational knowledge. With us, it’s a bit more, we have maybe one year under our belt before we’re thrown into like, OK, go talk to this patient; they have ankle pain, right. But we might have not even started talking about ankle pain in class, right. Well, where do I begin?  

So I think starting out, it’s kind of, like, you know, the classic drinking from a fire hydrant. But you also, kind of, have that embarrassment of you’re talking to the patient like, I have no clue what’s happening [LAUGHTER] or you might have … my differentials all over the place, right.  

But I think the beauty of the longitudinal aspect is that now that we’re, like, in our last trimester, everything’s kind of coming together. Like, OK, I can start to see, you know, here’s what you’re telling me. Here’s what the physical exam findings are. I’m starting to form a differential. Like, OK, I think these are the top three things. 

But in addition to that, I think these are the next steps you should take so we can really focus and hone in on what exact diagnosis this might be. 

LEE: All right. So, of course, what we’re trying to get into is about AI.  

And, you know, the funny thing about AI and the book that Carey, Zak, and I wrote is we actually didn’t think too much about medical education, although we did have some parts of our book where we, well, first off, we made the guess that medical students would find AI to be useful. And we even had some examples, you know, where, you know, you would have a vignette of a mythical patient, and you would ask the AI to pretend to be that patient.  

And then you would have an interaction and have to have an encounter. And so I want to delve into whether any of that is happening. How real it is. But before we do that, let’s get into first off, your own personal contact with AI. So let me start with a very simple question. Do you ever use generative AI systems like ChatGPT or similar? 

CHEN: All the time, if not every day. 

LEE: [LAUGHS] Every day, OK. And when did that start? 

CHEN: I think when it first launched with GPT-3.5, I was, you know, curious. All my friends work in tech. You know, they’re either software engineers, PMs. They’re like, “Hey, Daniel, take a look at this,” and at first, I thought it was just more of a glorified search engine. You know, I was actually looking back.  

My first question to ChatGPT was, what was the weather going to be like the next week, you know? Something very, like, something easily you could have looked up on Google or your phone app, right.  

I was like, oh, this is pretty cool. But then, kind of, fast-forwarding to, I think, the first instance I was using it in med school. I think the first, like, thing that really helped me was actually a coding problem. It was for a research project. I was trying to use SQL.  

Obviously, I’ve never taken a SQL class in my life. So I asked Chat like, “Hey, can you write me this code to maybe morph two columns together,” right? Something that might have taken me hours to maybe Google on YouTube or like try to read some documentation which just goes through my head.

But ChatGPT was able to, you know, not only produce the code, but, like, walk me through like, OK, you’re going to launch SQL. You’re going to click on this menu, [LAUGHTER] put the code in here, make sure your file names are correct. And it worked.  

So it’s been a very powerful tool in that way in terms of, like, giving me expertise in something that maybe I traditionally had no training in. 

LEE: And so while you’re doing this, I assume you had fellow students, friends, and others. And so what were you observing about their contact with AI? I assume you weren’t alone in this. 

CHEN: Yeah, yeah, I think, … I’m not too sure in terms of what they were doing when it first came out, but I think if we were talking about present day, um, a lot of it’s kind of really spot on to what you guys talked about in the book.  

Um, I think the idea around this personal tutor, personal mentor, is something that we’re seeing a lot. Even if we’re having in-class discussions, the lecturer might be saying something, right. And then I might be or I see a friend in ChatGPT or some other model looking up a question.  

And you guys talked about, you know, how it can, like, explain a concept at different levels, right. But honestly, sometimes if there’s a complex topic, I ask ChatGPT, like, can you explain this to me as if I was a 6-year-old?  

LEE: Yeah. [LAUGHS]  

CHEN: Breaking down complex topics. Yeah. So I think it’s something that we see in the pre-clinical space, in lecture, but also even in the clinical space, there’s a lot of teaching, as well. 

Sometimes if my preceptor is busy with patients, but I had maybe a question, I would maybe converse with ChatGPT, like, “Hey, what are your thoughts about this?” Or, like, a common one is, like, medical doctors love to use abbreviations, … 

LEE: Yes.  

CHEN: … and these abbreviations are sometimes only very niche and unique to their specialty, right. [LAUGHTER] 

And I was reading this note from a urogynecologist. [In] the entire first sentence, I think there were, like, 10 abbreviations. Obviously, I compile lists and ask ChatGPT, like, “Hey, in the context of urogynecology, can you define what these could possibly mean,” right? Instead of hopelessly searching in a Google or maybe, embarrassing, asking the preceptor. So in these instances, it’s played a huge role. 

LEE: Yeah. And when you’re doing things like that, it can make mistakes. And so what are your views of the reliability of generative AI, at least in the form of ChatGPT? 

CHEN: Yeah, I think into the context of medicine, right, we fear a lot about the hallucinations that these models might have. And it’s something I’m always checking for. When I talk with peers about this, we find it most helpful when the model gives us a source linking it back. I think the gold standard nowadays in medicine is using something called UpToDate (opens in new tab) that’s written by clinicians, for clinicians. 

But sometimes searching on UpToDate can be a lot of time as well because it’s a lot of information to, like, sort through. But nowadays a lot of us are using something called OpenEvidence, which is also an LLM. But they always cite their citations with, like, published literature, right.  

So I think being able to be conscious of the downfalls of these models and also being able to have the critical understanding of, like, analyzing the actual literature. I think double checking is just something that we’ve been also getting really good at. 

LEE: How would you assess student attitudes—med student attitudes—about AI? Is it … the way you’re coming across is it’s just a natural part of life. But do people have firm opinions, you know, pro or con, when it comes to AI, and especially AI in medicine? 

CHEN: I think it’s pretty split right now. I think there’s the half, kind of, like us, where we’re very optimistic—cautiously optimistic about, you know, the potential of this, right. It’s able to, you know, give us that extra information, of being that extra tutor, right. It’s also able to give us information very quickly, as well.   

But I think the other flip side of what a lot of students hesitate to, which I agree, is this loss of the ability to critically think. Something that you can easily do is, you know, give these models, like, relevant information about the patient history and be like, “Give me a 10-list differential,” right.

LEE: Yes.  

CHEN: And I think it’s very easy as a student to, you know, [say], “This is difficult. Let me just use what the model says, and we’ll go with that,” right. 

So I think being able to separate that, you know, medical school is a time where, you know, you’re learning to become a good doctor. And part of that requires the ability to be observant and critically think. Having these models simultaneously might hinder the ability to do that.  

LEE: Yeah. 

CHEN: So I think, you know, the next step is, like, these models can be great—a great tool, absolutely wonderful. But how do you make sure that it’s not hindering these abilities to critically think? 

LEE: Right. And so when you’re doing your LIC [longitudinal integrated clerkship] work, these longitudinal experiences, and you’re in clinic, are you pulling the phone out of your pocket and consulting with AI? 

CHEN: Definitely. And I think my own policy for this, to kind of counter this, is that the night before when I’m looking over the patient list, the clinic [schedule] of who’s coming, I’m always giving it my best effort first.  

Like, OK, the chief complaint is maybe just a runny nose for a kid in a pediatric clinic. What could this possibly be? Right? At this point, we’ve seen a lot. Like, OK, it could be URI [upper respiratory infection], it could be viral, it could be bacterial, you know, and then I go through the—you know, I try to do my due diligence of, like, going through the history and everything like that, right. 

But sometimes if it’s a more complex case, something maybe a presentation I’ve never seen before, I’ll still kind of do my best coming up with maybe a differential that might not be amazing. But then I’ll ask, you know, ChatGPT like, OK, in addition to these ideas, what do you think?  

LEE: Yeah. 

CHEN: Am I missing something? You know, and usually, it gives a pretty good response. 

LEE: You know, that particular idea is something that I think Carey, Zak, and I thought would be happening a lot more today than we’re observing. And it’s the idea of a second set of eyes on your work. And somehow, at least our observation is that that isn’t happening quite as much by today as we thought it might.  

And it just seems like one of the really safest and most effective use cases. When you go and you’re looking at yourself and other fellow medical students, other second-year students, what do you see when it comes to the “second set of eyes” idea? 

CHEN: I think, like, a lot of students are definitely consulting ChatGPT in that regard because, you know, even in the very beginning, we’re taught to be, like, never miss these red flags, right. So these red flags are always on our differential, but sometimes, it can be difficult to figure out where to place them on that, right.  

So I think in addition to, you know, coming up with these differentials, something I’ve been finding a lot of value [in] is just chatting with these tools to get their rationale behind their thinking, you know.  

Something I find really helpful—I think this is also a part of the, kind of, art of medicine—is figuring out what to order, right, what labs to order.  

LEE: Right.

CHEN: Obviously, you have your order sets that automate some of the things, like in the ED [emergency department], or, like, there are some gold standard imaging things you should do for certain presentations. 

But then you chat to, like, 10 different physicians on maybe the next steps after that, and they give you 10 different answers.  

LEE: Yes.  

CHEN: But there’s never … I never understand exactly why. It’s always like, I’ve just been doing this for all my training, or that’s how I was taught.  

So asking ChatGPT, like, “Why would you do this next?” Or, like, “Is this a good idea?” And seeing the pros and cons has also been really helpful in my learning. 

LEE: Yeah, wow, that’s super interesting. So now, you know, I’d like to get into the education you’re receiving. And, you know, I think it’s fair to say Kaiser Permanente is very progressive in really trying to be very cutting-edge in how the whole curriculum is set up.  

And for the listeners who don’t know this, I’m actually on the board of directors of the school and have been since the founding of the school. And I think one of the reasons why I was invited to be on the board is the school really wanted to think ahead and be cutting edge when it comes to technology.  

So from where I’ve sat, I’ve never been completely satisfied with the amount of tech that has made it into the curriculum. But at the same time, I’ve also made myself feel better about that just understanding that it’s sort of unstoppable, that students are so tech-forward already.  

But I wanted to delve into a little bit here into what your honest opinions are and your fellow students’ opinions are about whether you feel like you’re getting adequate training and background formally as part of your medical education when it comes to things like artificial intelligence or other technologies.  

What do you think? Are you … would you wish the curriculum would change? 

CHEN: Yeah, I think that’s a great question.  

I think from a tech perspective, the school is very good about implementing, you know, opportunities for us to learn. Like, for example, learning how to use Epic, right, or at Kaiser Permanente, what we call HealthConnect, right. These electronic health records. That, my understanding is, a lot of schools maybe don’t teach that.  

That’s something where we get training sessions maybe once or twice a year, like, “Hey, here’s how to make a shortcut in the environment,” right.  

So I think from that perspective, the school is really proactive in providing those opportunities, and they make it very easy to find resources for that, too. I think it … 

LEE: Yeah, I think you’re pretty much guaranteed to be an Epic black belt by the time you [LAUGHS] finish your degree.  

CHEN: Yes, yes.  

But then I think in terms of the aspects of artificial intelligence, I think the school’s taken a more cautiously optimistic viewpoint. They’re just kind of looking around right now.  

Formally in the curriculum, there hasn’t been anything around this topic. I believe the fourth-year students last year got a student-led lecture around this topic.  

But talking to other peers at other institutions, it looks like it’s something that’s very slowly being built into the curriculum, and it seems like a lot of it is actually student-led, you know.  

You know, my friend at Feinberg [School of Medicine] was like we just got a session before clerkship about best practices on how to use these tools.  

I have another friend at Pitt talking about how they’re leading efforts of maybe incorporating some sort of LLM into their in-house curriculum where students can, instead of clicking around the website trying to find the exact slide, they can just ask this tool, like, “OK. We had class this day. They talked about this … but can you provide more information?” and it can pull from that.  

So I think a lot of this, a lot of it is student-driven. Which I think is really exciting because it begs the question, I think, you know current physicians may not be very well equipped with these tools as well, right?  

So maybe they don’t have a good idea of what exactly is the next steps or what does the curriculum look like. So I think the future in terms of this AI curriculum is really student-led, as well. 

LEE: Yeah, yeah, it’s really interesting.  

I think one of the reasons I think also that that happens is [that] it’s not just necessarily the curriculum that lags but the accreditation standards. You know, accreditation is really important for medical schools because you want to make sure that anyone who holds an MD, you know, is a bona fide doctor, and so accreditation standards are pretty strictly monitored in most countries, including the United States.  

And I think accreditation standards are also—my observation—are slow to understand how to adopt or integrate AI. And it’s not meant as a criticism. It’s a big unknown. No one knows exactly what to do and how to do. And so it’s really interesting to see that, as far as I can tell, I’ve observed the same thing that you just have seen, that most of the innovation in this area about how AI should be integrated into medical education is coming from the students themselves.  

It seems, I think, I’d like to think it’s a healthy development. [LAUGHS]

CHEN: Something tells me maybe the students are a bit better at using these tools, as well.  

You know, I talk to my preceptors because KP [Kaiser Permanente] also has their own version … 

LEE: Preceptor, maybe we should explain what that is. 

CHEN: Yeah, sorry. So a preceptor is an attending physician, fully licensed, finished residency, and they are essentially your kind of teacher in the clinical environment.  

So KP has their own version of some ambient documentation device, as well. And something I always like to ask, you know, like, “Hey, what are your thoughts on these tools,” right?  

And it’s always so polarizing, as well, even among the same specialty. Like, if you ask psychiatrists, which I think is a great use case of these tools, right. My preceptor hates it. Another preceptor next door loves it. [LAUGHTER] 

So I think a lot of it’s, like, it’s still, like, a lot of unknowns, like you were mentioning. 

LEE: Right. Well, in fact, I’m glad you brought that up because one thing that we’ve been hearing from previous guests a lot when it comes to AI in clinic is about ambient listening by AI, for example, to help set up a clinical note or even write a clinical note.  

And another big use case that we heard a lot about that seems to be pretty popular is the use of generative AI to respond to patient messages.  

So let’s start with the clinical note thing. First off, do you have opinions about that technology? 

CHEN: I think it’s definitely good.  

I think especially where, you know, if you’re in the family medicine environment or pediatric environment where you’re spending so much time with patients, a note like that is great, right. 

I think coming from a strictly medical student standpoint, I think it’s—honestly, it’d be great to have—but I think there’s a lot of learning when you write the note, you know. There’s a lot of, you know, all of my preceptors talk about, like, when I read your note, you should present it in a way where I can see your thoughts and then once I get to the assessment and plan, it’s kind of funneling down towards a single diagnosis or a handful of diagnoses. And that’s, I think, a skill that requires you to practice over time, right.  

So a part of me thinks, like, if I had this tool where [it] can just automatically give me a note as a first year, then it takes away from that learning experience, you know. 

Even during our first year throughout school, we frequently get feedback from professors and doctors about these notes. And it’s a lot of feedback. [LAUGHTER] It’s like, “I don’t think you should have written that,” “That should be in this section ” … you know, like a medical note or a SOAP note [Subjective, Objective, Assessment, and Plan], where, you know, the subjective is, like, what the patient tells you. Objective is what the physical findings are, and then your assessment of what’s happening, and then your plan. Like, it’s very particular, and then I think medicine is so structured in a way, that’s kind of, like, how everyone does it, right. So kind of going back to the question, I think it’s a great tool, but I don’t think it’s appropriate for a medical student. 

LEE: Yeah, it’s so interesting to hear you say that. I was … one of our previous guests is the head of R&D at Epic, Seth Hain. He said, “You know, Peter, doctors do a lot of their thinking when they write the note.” 

And, of course, Epic is providing ambient, you know, clinical notetaking automation. But he was urging caution because, you know, you’re saying, well, this is where you’re learning a lot. But actually, it’s also a point where, as a doctor, you’re thinking about the patient. And we do probably have to be careful with how we automate parts of that.  

All right. So you’re gearing up for Step 1 of the USMLE [United States Medical Licensing Examination]. That’ll be a big multiple-choice exam. Then Step 2 is similar: very, very focused on advanced clinical knowledge. And then Step 3, you know, is a little more interactive.  

And so one question that people have had about AI is, you know, how do we regulate the use of AI in medicine? And one of the famous papers that came out of both academia and industry was the concept that you might be able to treat AI like a person and have it go through the same licensing. And this is something that Carey, Zak, and I contemplated in our book.  

In the end, at the time we wrote the book, I personally rejected the idea, but I think it’s still alive. And so I’ve wondered if you have any … you know, first off, are you opinionated at all about, what should the requirements be for the allowable use of AI in the kind of work that you’re going to be doing? 

CHEN: Yeah, I think it’s a tough question because, like, where do you draw that line, right? If you apply the human standards of it’s passing exams, then yes, in theory, it could be maybe a medical doctor, as well, right? It’s more empathetic than medical doctors, right? So where do you draw that line?  

I think, you know, part of me thinks it’s maybe it is that human aspect that patients like to connect with, right. And maybe this really is just, like, these tools are just aids in helping, you know, maybe load off some cognitive load, right.  

But I think the other part of me, I’m thinking about this is the next generation who are growing up with this technology, right. They’re interacting with applications all day. Maybe they’re on their iPads. They’re talking to chatbots. They’re using ChatGPT. This is, kind of, the environment they grew up with. Does that mean they also have increased, like, trust in these tools that maybe our generation or the generations above us don’t have that value that human connection? Would they value human connection less?  

You know, I think those are some troubling thoughts that, you know, yes, at end of the day, maybe I’m not as smart as these tools, but I can still provide that human comfort. But if, at the end of the day, the future generation doesn’t really care about that or they perfectly trust these tools because that’s all they’ve kind of known, then where do human doctors stand?  

I think part of that is, there would be certain specialties where maybe the human connection is more important. The longitudinal aspect of building that trust, I think is important. Family medicine is a great example. I think hematology oncology with cancer treatment.  

Obviously, I think anyone’s not going to be thrilled to hear cancer diagnosis, but something tells me that seeing that on a screen versus maybe a physician prompting you and telling you about that tells me that maybe in those aspects, you know, the human nature, the human touch plays an important role there, too. 

LEE: Yeah, you know, I think it strikes me that it’s going to be your generation that really is going to set the pattern probably for the next 50 years about how this goes. And it’s just so interesting because I think a lot will depend on your reactions to things.  

So, for example, you know, one thing that is already starting to happen are patients who are coming in armed, you know, with a differential [LAUGHS], you know, that they’ve developed themselves with the help of ChatGPT. So let me … you must have thought about these things. So, in fact, has it happened in your clinical work already? 

CHEN: Yeah, I’ve seen people come into the ED during my ED shift, like emergency department, and they’ll be like, “Oh, I have neck pain and here are all the things that, you know, Chat told me, ChatGPT told me. What do you think … do I need? I want this lab ordered, that lab ordered.”  

LEE: Right. 

CHEN: And I think my initial reaction is, “Great. Maybe we should do that.” But I think the other reaction is understanding that not everyone has the clinical background of understanding what’s most important, what do we need to absolutely rule out, right?   

So, I think in some regards, I would think that maybe ChatGPT errs on the side of caution, … 

LEE: Yes.  

CHEN: … giving maybe patients more extreme examples of what this could be just to make sure that it’s, in a way, is not missing any red flags as well, right.  

LEE: Right. Yeah.  

CHEN: But I think a lot of this is … what we’ve been learning is it’s all about shared decision making with the patient, right. Being able to acknowledge like, “Yeah, [in] that list, most of the stuff is very plausible, but maybe you didn’t think about this one symptom you have.”  

So I think part of it, maybe it’s a sidebar here, is the idea of prompting, right. You know, they’ve always talked about all these, you know, prompt engineers, you know, how well can you, like, give it context to answer your question? 

LEE: Yeah. 

CHEN: So I think being able to give these models the correct information and the relevant information and keyword relevant, because relevant is, I guess, where your clinical expertise comes in. Like, what do you give the model, what do you not give? So I think that difference between a medical provider versus maybe your patients is ultimately the difference. 

LEE: Let me press on that a little bit more because you brought up the issue of trust, and trust is so essential for patients to feel good about their medical care.  

And I can imagine you’re a medical student seeing a patient for the first time. So you don’t have a trust relationship with that patient. And the patient comes in maybe trusting ChatGPT more than you. 

CHEN: Very valid. No. I mean, I get that a lot, surprisingly, you know. [LAUGHTER] Sometimes [they’re] like, “Oh, I don’t want to see the medical student,” because we always give the patient an option, right. Like, it’s their time, whether it’s a clinic visit.  

But yeah, those patients, I think it’s perfectly reasonable. If I heard a second-year medical student was going to be part of my care team, taking that history, I’d be maybe a little bit concerned, too. Like, are they asking all the right questions? Are they relaying that information back to their attending physician correctly?  

So I think a lot of it is, at least from a medical student perspective, is framing it so the patient understands that this is a learning opportunity for the students. And something I do a lot is tell them like, “Hey, like, you know, at the end of the day, there is someone double-checking all my work.”  

LEE: Yeah. 

CHEN: But for those that come in with a list, I sometimes sit down with them, and we’ll have a discussion, honestly.  

I’ll be like, “I don’t think you have meningitis because you’re not having a fever. Some of the physical exam maneuvers we did were also negative. So I don’t think you have anything to worry about that,” you know.  

So I think it’s having that very candid conversation with the patient that helps build that initial trust. Telling them like, “Hey … ” 

LEE: It’s impressive to hear how even keeled you are about this. You know, I think, of course, and you’re being very humble saying, well, you know, as a second-year medical student, of course, someone might not, you know, have complete trust. But I think that we will be entering into a world where no doctor is going to be, no matter how experienced or how skilled, is going to be immune from this issue. 

So we’re starting to run toward the end of our time together. And I like to end with one or two more provocative questions.  

And so let me start with this one. Undoubtedly, I mean, you’re close enough to tech and digital stuff, digital health, that you’re undoubtedly familiar with famous predictions, you know, by Turing and Nobel laureates that someday certain medical specialties, most notably radiology, would be completely supplanted by machines. And more recently, there have been predictions by others, like, you know, Elon Musk, that maybe even some types of surgery would be replaced by machines.  

What do you think? Do you have an opinion? 

CHEN: I think replace is a strong term, right. To say that doctors are completely obsolete, I think, is unlikely.  

If anything, I think there might be a shift maybe in what it means to be a doctor, right. Undoubtedly, maybe the demands of radiologists are going to go down because maybe more of the simple things can truly be automated, right. And you just have a supervising radiologist whose output is maybe 10 times as maybe 10 single radiologists, right.  

So I definitely see a future where the demand of certain specialties might go down.  

And I think when I talk about a shift of what it means to be a physician, maybe it’s not so much diagnostic anymore, right, if these models get so good at, like, just taking in large amounts of information, but maybe it pivots to being really good at understanding the limitations of these models and knowing when to intervene is what it means to be the kind of the next generation of physicians.  

I think in terms of surgery, yeah, I think it’s a concern, but maybe not in the next 50 years. Like those Da Vinci robots are great. I think out of Mayo Clinic, they were demoing some videos of these robots leveraging computer vision to, like, close portholes, like laparoscopic scars. And that’s something I do in the OR [operating room], right. And we’re at the same level at this point. [LAUGHTER] So at that point, maybe.  

But I think robotics still has to address the understanding of like, what if something goes wrong, right? Who’s responsible? And I don’t see a future where a robot is able to react to these, you know, dangerous situations when maybe something goes wrong. You still have to have a surgeon on board to, kind of, take over. So in that regard, that’s kind of where I see maybe the future going. 

LEE: So last question. You know, when you are thinking about the division of time, one of the themes that we’ve seen in the previous guests is more and more doctors are doing more technology work, like writing code and so on. And more and more technologists are thinking deeply and getting educated in clinical and preclinical work.  

So for you, let’s look ahead 10 years. What do you see your division of labor to be? Or, you know, how would you … what would you tell your mom then about how you spend a typical day? 

CHEN: Yeah, I mean, I think for me, technology is something I definitely want to be involved in in my line of work, whether it’s, you know, AI work, whether it’s improving quality of healthcare through technology.  

My perfect division would be maybe still being able to see patients but also balancing some maybe more of these higher-level kind of larger projects. But I think having that division would be something nice. 

LEE: Yeah, well, I think you would be great just from the little bit I know about you. And, Daniel, it’s been really great chatting with you. I wish you the best of luck, you know, with your upcoming exams and getting past this year two of your medical studies. And perhaps someday I’ll be your patient. 

[TRANSITION MUSIC]  

CHEN: Thank you so much. 

LEE: You know, one of the lucky things about my job is that I pretty regularly get to talk to students at all levels, spanning high school to graduate school. And when I get to talk especially to med students, I’m always impressed with their intelligence, just how serious they are, and their high energy levels. Daniel is absolutely a perfect example of all that.  

Now, it comes across as trite to say that the older generation is less adept at technology adoption than younger people. But actually, there probably is a lot of truth to that. And in the conversation with Daniel, I think he was actually being pretty diplomatic but also clear that he and his fellow med students don’t necessarily expect the professors in their med school to understand AI as well as they do. 

There’s no doubt in my mind that medical education will have to evolve a lot to help prepare doctors and nurses for an AI future. But where will this evolution come from?  

As I reflect on my conversations with Morgan and Daniel, I start to think that it’s most likely to come from the students themselves. And when you meet people like Morgan and Daniel, it’s impossible to not be incredibly optimistic about the next generation of clinicians. 

[THEME MUSIC] 

Another big thank-you to Morgan and Daniel for taking time to share their experiences with us. And to our listeners, thank you for joining us. We have just a couple of episodes left, one on AI’s impact on the operation of public health departments and healthcare systems and another coauthor roundtable. We hope you’ll continue to tune in.  

Until next time. 

[MUSIC FADES] 

The post Navigating medical education in the era of generative AI appeared first on Microsoft Research.

Read More