Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore

Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore

While AI has rapidly advanced in recent years, one challenge remains stubbornly unresolved: how to move promising algorithmic models from controlled experiments into practical, real-world use. The effort to balance algorithmic innovation with real-world application has been a consistent theme in the career of Xinxing Xu, principal researcher at Microsoft Research Asia – Singapore, and also represents one of the foundational pillars of the newly established Singapore lab.

photo of Xinxing Xu standing against a gray background
Xinxing Xu, Principal Researcher, Microsoft Research Asia – Singapore

“Innovative algorithms can only demonstrate their true value when tested with real-world data and in actual scenarios, where they can be continuously optimized through iteration,” he says.

Xu’s commitment to balancing algorithmic innovation with practical application has shaped his entire career. During his PhD studies at Nanyang Technological University, Singapore, Xu focused on emerging technologies like multiple kernel learning methods and multimodal machine learning. Today he’s applying these techniques to real-world use cases like image recognition and video classification.

After completing his doctorate, he joined the Institute of High Performance Computing at Singapore’s Agency for Science, Technology and Research (A*STAR), where he worked on interdisciplinary projects ranging from medical image recognition to AI systems for detecting defects on facade of buildings. These experiences broadened his perspective and deepened his passion for translating AI into real-world impact.

In 2024, Xu joined Microsoft Research Asia where he began a new chapter focused on bridging between academic research and real-world AI applications.

“Microsoft Research Asia is committed to integrating scientific exploration with real-world applications, which creates a unique research environment,” Xu says. “It brings together top talent and resources, and Microsoft’s engineering and product ecosystem strongly supports turning research into impactful technology. The lab’s open and inclusive culture encourages innovation with broader societal impact. It reflects the approach to research I’ve always hoped to contribute to.”

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.


Bringing cross-domain expertise to AI’s real-world frontiers

As a key hub in Microsoft Research’s network across Asia, the Singapore lab is guided by a three-part mission: to drive industry-transforming AI deployment, pursue fundamental breakthroughs in the field, and promote responsible, socially beneficial applications of the technology.

To reach these goals, Xu and his colleagues are working closely with local collaborators, combining cross-disciplinary expertise to tackle complex, real-world challenges.

Xu draws on his experience in healthcare as he leads the team’s collaboration with Singapore’s SingHealth to explore how AI can support precision medicine. Their efforts focus on leveraging SingHealth’s data and expertise to develop AI capabilities aimed at delivering personalized analysis and enhanced diagnostic accuracy to enable better patient outcomes.

Beyond healthcare, the team is also targeting key sectors like finance and logistics. By developing domain-specific foundation models and AI agents, they aim to support smarter decision-making and accelerate digital transformation across industries. “Singapore has a strong foundation in these sectors,” Xu notes, “making it an ideal environment for technology validation and iteration.”

The team is also partnering with leading academic institutions, including the National University of Singapore (NUS) and Nanyang Technological University, Singapore (NTU Singapore), to advance the field of spatial intelligence. Their goal is to develop embodied intelligence systems capable of carrying out complex tasks in smart environments.

As AI becomes more deeply embedded in everyday life, researchers at the Singapore lab are also increasingly focused on what they call “societal AI”—building AI systems that are culturally relevant and trustworthy within Southeast Asia’s unique cultural and social contexts. Working with global colleagues, they are helping to advance a more culturally grounded and responsible approach to AI research in the region.

Microsoft Research Asia – Singapore: Expanding global reach, connecting regional innovation 

Realizing AI’s full potential requires more than technical breakthroughs. It also depends on collaboration—across industries, academia, and policy. Only through this intersection of forces can AI move beyond the lab to deliver meaningful societal value. 

Singapore’s strengths in science, engineering, and digital governance make it an ideal setting for this kind of work. Its collaborative culture, robust infrastructure, international talent pool, and strong policy support for science and technology make it fertile ground for interdisciplinary research. 

This is why Microsoft Research Asia continues to collaborate closely with Singapore’s top universities, research institutions, and industry partners. These partnerships support joint research, talent development, and technical exchange. Building on this foundation, Microsoft Research Asia – Singapore will further deepen its collaboration with NUS, NTU Singapore, and Singapore Management University (SMU) to advance both fundamental and applied research, while equipping the next generation of researchers with real-world experience. In addition, Microsoft Research Asia is fostering academic exchange and strengthening the research ecosystem through summer schools and joint workshops with NUS, NTU Singapore, and SMU. 

The launch of the Singapore lab further marks an important step in expanding the company’s global research footprint, serving as a bridge between regional innovation and Microsoft’s global ecosystem. Through its integrated lab network, Microsoft Research fosters the sharing of technologies, methods, and real-world insights, creating a virtuous cycle of innovation.

“We aim to build a research hub in Singapore that is globally connected and deeply rooted in the local ecosystem,” Xu says. “Many breakthroughs come from interdisciplinary and cross-regional collaboration. By breaking boundaries—across disciplines, industries, and geographies—we can drive research that has lasting impact.”

As AI becomes more deeply woven into industry and everyday life, Xu believes that meaningful research must be closely connected to regional development and social well-being.
“Microsoft Research Asia – Singapore is a future-facing lab,” he says. “While we push technological frontiers, we’re equally committed to the responsibility of technology—ensuring AI can help address society’s most pressing challenges.”

In a world shaped by global challenges, Xu sees collaboration and innovation as essential to real progress. With Singapore as a launchpad, he and his team are working to extend AI’s impact and value across Southeast Asia and beyond.

Xingxing Xu (center) with colleagues at Microsoft Research Asia - Singapore 
Xingxing Xu (center) with colleagues at Microsoft Research Asia – Singapore 

Three essential strengths for the next generation of AI researchers

AI’s progress depends not only on technical breakthroughs but also on the growth and dedication of talent. At Microsoft Research Asia, there is a strong belief that bringing research into the real world requires more than technical coordination—it depends on unlocking the full creativity and potential of researchers.

In Singapore—a regional innovation hub that connects Southeast Asia—Xu and his colleagues are working to push AI beyond the lab and into fields like healthcare, finance, and manufacturing. For young researchers hoping to shape the future of AI, this is a uniquely powerful stage.

To help guide the next generation, Xu shares three pieces of advice:

  • Build a strong foundation – “Core knowledge in machine learning, linear algebra, and probability and statistics is the bedrock of AI research,” Xu says. “A solid theoretical base is essential to remain competitive in a rapidly evolving field. Even today’s hottest trends in generative AI rely on longstanding principles of optimization and model architecture design.” While code generation tools are on the rise, Xu emphasizes that mathematical fundamentals remain essential for understanding and innovating in AI.
  • Understand real-world applications – Technical skills alone aren’t enough. Xu encourages young researchers to deeply engage with the problems they’re trying to solve. Only by tightly integrating technology with its context can researchers create truly valuable solutions.

    “In healthcare, for example, researchers may need to follow doctors in clinics to gain a true understanding of clinical workflows. That context helps identify the best entry points for AI deployment. Framing research problems around real-world needs is often more impactful than just tuning model parameters,” Xu says.

  • Develop interdisciplinary thinking – Cross-disciplinary collaboration is becoming essential to AI innovation. Xu advises young researchers to learn how to work with experts from other fields to explore new directions together. “These kinds of interactions often spark fresh, creative ideas,” he says.

    Maintaining curiosity is just as important. “Being open to new technologies and fields is what enables researchers to continually break new ground and produce original results.”

Xu extends an open invitation to aspiring researchers from all backgrounds to join Microsoft Research Asia – Singapore. “We offer a unique platform that blends cutting-edge research with real-world impact,” he says. “It’s a place where you can work on the frontiers of AI—and see how your work can help transform industries and improve lives.”

To learn more about the current opening, please visit: https://jobs.careers.microsoft.com/global/en/job/1849717/Senior-Researcher (opens in new tab) 

The post Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore appeared first on Microsoft Research.

Read More

Technical approach for classifying human-AI interactions at scale

Technical approach for classifying human-AI interactions at scale

The image features four white icons on a gradient background that transitions from blue on the left to green on the right. The first icon is a network or molecule structure with interconnected nodes. The second icon shows a stylized person in front of a computer screen. The third icon shows an organization tree with one main node and three nodes branching out side by side below it.

As large language models (LLMs) become foundational to modern AI systems, the ability to run them at scale—efficiently, reliably, and in near real-time—is no longer a nice-to-have. It’s essential. The Semantic Telemetry project tackles this challenge by applying LLM-based classifiers to hundreds of millions of sampled, anonymized Bing Chat conversations each week. These classifiers extract signals like user expertise, primary topic, and satisfaction, enabling deeper insight into human-AI interactions and driving continuous system improvement.

But building a pipeline that can handle this volume isn’t just about plugging into an API. It requires a high-throughput, high-performance architecture that can orchestrate distributed processing, manage token and prompt complexity, and gracefully handle the unpredictability of remote LLM endpoints.

In this latest post in our series on Semantic Telemetry, we’ll walk through the engineering behind that system—how we designed for scale from the start, the trade-offs we made, and the lessons we learned along the way. From batching strategies and token optimization and orchestration, we’ll share what it takes to build a real-time LLM classification pipeline.

For additional project background: Semantic Telemetry: Understanding how users interact with AI systems and Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project.

System architecture highlights

The Semantic Telemetry pipeline (opens in new tab) is a highly-scalable, highly-configurable, data transformation pipeline. While it follows a familiar ETL structure, several architectural innovations make it uniquely suited for high-throughput LLM integration:

  • Hybrid compute engine
    The pipeline combines the distributed power of PySpark with the speed and simplicity of Polars, enabling it to scale across large datasets or run lightweight jobs in Spark-less environments—without code changes.
  • LLM-centric transformation layer
    At the core of the pipeline is a multi-stage transformation process tailored for running across multiple LLM endpoints such that:
    • Runs model agnostic. Provides a generic interface for LLMs and adopts model specific interfaces built from a generic interface.
    • Prompt templates are defined using the Prompty language specification for consistency and reuse, with options for users to include custom prompts.
    • Parsing and cleaning logic ensures structured, schema-aligned outputs, even when LLM responses are imperfect such as removing extra characters in output, resolving not-exact label matches (i.e. “create” versus “created”) and relabeling invalid classifications.
Figure 1. Architecture diagram of LLM workflow
Figure 1. Architecture diagram

The pipeline supports multiple classification tasks (e.g., user expertise, topic, satisfaction) through modular prompt templates and configurable execution paths—making it easy to adapt to new use cases or environments.

Engineering challenges & solutions

Building a high-throughput, LLM-powered classification pipeline at scale introduced a range of engineering challenges—from managing latency and token limits to ensuring system resilience. Below are the key hurdles we encountered and how we addressed them.

LLM endpoint latency & variability

Challenge: LLM endpoints, especially those hosted remotely (e.g., Azure OpenAI), introduce unpredictable latency due to model load, prompt complexity, and network variability. This made it difficult to maintain consistent throughput across the pipeline.

Solution: We implemented a combination of:

  • Multiple Azure OpenAI endpoints in rotation to increase throughput and distribute workload. We can analyze throughput and redistribute as needed.
  • Saving output in intervals to write data asynchronously in case of network errors.
  • Utilizing models with higher tokens per minute (TPM) such as OpenAI’s GPT-4o mini. GPT-4o mini had a 2M TPM limit which is a 25x throughput increase from GPT-4 (80K TPM -> 2M TPM)
  • Timeouts and retries with exponential backoff.

Evolving LLM models & prompt alignment

Challenge: Each new LLM release—such as Phi, Mistral, DeepSeek, and successive generations of GPT (e.g., GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o)—brings improvements, but also subtle behavioral shifts. These changes can affect classification consistency, output formatting, and even the interpretation of prompts. Maintaining alignment with baseline expectations across models became a moving target.

Solution: We developed a model evaluation workflow to test prompt alignment across LLM versions:

  • Small-sample testing: We ran the pipeline on a representative sample using the new model and compared the output distribution to a known baseline.
  • Distribution analysis: If the new model’s output aligned closely, we scaled up testing. If not, we iteratively tuned the prompts and re-ran comparisons.
  • Interpretation flexibility: We also recognized that a shift in distribution isn’t always a regression. Sometimes it reflects a more accurate or nuanced classification, especially as models improve.

To support this process, we used tools like Sammo (opens in new tab), which allowed us to compare outputs across multiple models and prompt variants. This helped us quantify the impact of prompt changes and model upgrades and make informed decisions about when to adopt a new model or adjust our classification schema.

Dynamic concurrency scaling for LLM calls

Challenge: LLM endpoints frequently encounter rate limits and inconsistent response times under heavy usage. The models’ speeds can also vary, complicating the selection of optimal concurrency levels. Furthermore, users may choose suboptimal settings due to lack of familiarity, and default concurrency configurations are rarely ideal for every situation. Dynamic adjustments based on throughput, measured in various ways, can assist in determining optimal concurrency levels.

Solution: We implemented a dynamic concurrency control mechanism that proactively adjusts the number of parallel LLM calls based on real-time system behavior:

  • External task awareness: The system monitors the number of parallel tasks running across the pipeline (e.g., Spark executors or async workers) and uses this to inform the initial concurrency level.
  • Success/failure rate monitoring: The system tracks the rolling success and failure rates of LLM calls. A spike in failures triggers a temporary reduction in concurrency, while sustained success allows for gradual ramp-up.
  • Latency-based feedback loop: Instead of waiting for rate-limit errors, measure the response time of LLM calls. If latency increases, reduce concurrency; if latency decreases and success rates remain high, cautiously scale up.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.


Optimization experiments

To further improve throughput and efficiency, we ran a series of optimization experiments. Each approach came with trade-offs that we carefully measured.

Batch endpoints (Azure/OpenAI)

Batch endpoints are a cost-effective, moderately high-throughput way of executing LLM requests. Batch endpoints process large lists of LLM prompts over a 24-hour period, recording responses in a file. They are about 50% cheaper than non-batch endpoints and have separate token limits, enabling increased throughput when used alongside regular endpoints. However, they require at least 24 hours to complete requests and provide lower overall throughput compared to non-batch endpoints, making them unsuitable for situations needing quick results.

Conversation batching in prompts during pipeline runtime

Batching multiple conversations for classification at once can significantly increase throughput and reduce token usage, but it may impact the accuracy of results. In our experiment with a domain classifier, classifying 10 conversations simultaneously led to an average of 15-20% of domain assignments changing between repeated runs of the same prompt. To address this, one mitigation approach is to use a grader LLM prompt: first classify the batch, then have the LLM identify any incorrectly classified conversations, and finally re-classify those as needed. While batching offers efficiency gains, it is important to monitor for potential drops in classification quality.

Combining classifiers in a single prompt

Combining multiple classifiers into a single prompt increases throughput by allowing one call to the LLM instead of multiple calls. This not only multiplies the overall throughput by the number of classifiers processed but also reduces the total number of tokens used, since the conversation text is only passed in once. However, this approach may compromise classification accuracy, so results should be closely monitored.

Classification using text embeddings

An alternative approach is to train custom neural network models for each classifier using only the text embeddings of conversations. This method delivers both cost and time savings by avoiding making multiple LLM requests for every classifier and conversation—instead, the system only needs to request conversation text embeddings once and can reuse these embeddings across all classifier models.

For example, starting with a set of conversations to validate and test the new model, run these conversations through the original prompt-based classifier to generate a set of golden classifications, then obtain text embeddings (using a tool like text-embedding-3-large) for each conversation. These embeddings and their corresponding classifications are used to train a model such as a multi-layer perceptron. In production, the workflow involves retrieving the text embedding for each conversation and passing it through the trained model; if there is a model for each classifier, a single embedding retrieval per conversation suffices for all classifiers.

The benefits of this approach include significantly increased throughput and cost savings—since it’s not necessary to call the LLM for every classifier and conversation. However, this setup can require GPU compute which can increase costs and infrastructure complexity, and the resulting models may not achieve the same accuracy as prompt-based classification methods.

Prompt compression

Compressing prompts by eliminating unnecessary tokens or by using a tool such as LLMLingua (opens in new tab) to automate prompt compression can optimize classification prompts either ahead of time or in real-time. This approach increases overall throughput and results in cost savings due to a reduced number of tokens, but there are risks: changes to the classifier prompt or conversation text may impact classification accuracy, and depending on the compression technique, it could even decrease throughput if the compression process takes longer than simply sending uncompressed text to the LLM.

Text truncation

Truncating conversations to a specific length limits the overall number of tokens sent through an endpoint, offering cost savings and increased throughput like prompt compression. By reducing the number of tokens per request, throughput rises because more requests can be made before reaching the endpoint’s tokens-per-minute (TPM) limit, and costs decrease due to fewer tokens being processed. However, the ideal truncation length depends on both the classifiers and the conversation content, so it’s important to assess how truncation affects output quality before implementation. While this approach brings clear efficiency benefits, it also poses a risk: long conversations may have their most important content cut off, which can reduce classification accuracy.

Conclusion

Building a scalable, high-throughput pipeline for LLM-based classification is far from trivial. It requires navigating a constantly shifting landscape of model capabilities, prompt behaviors, and infrastructure constraints. As LLMs become faster, cheaper, and more capable, they’re unlocking new possibilities for real-time understanding of human-AI interactions at scale. The techniques we’ve shared represent a snapshot of what’s working today. But more importantly, they offer a foundation for what’s possible tomorrow.

The post Technical approach for classifying human-AI interactions at scale appeared first on Microsoft Research.

Read More

AI Testing and Evaluation: Reflections

AI Testing and Evaluation: Reflections

Illustrated headshots of Amanda Craig Deckard and Kathleen Sullivan.

Generative AI presents a unique challenge and opportunity to reexamine governance practices for the responsible development, deployment, and use of AI. To advance thinking in this space, Microsoft has tapped into the experience and knowledge of experts across domains—from genome editing to cybersecurity—to investigate the role of testing and evaluation as a governance tool. AI Testing and Evaluation: Learnings from Science and Industry, hosted by Microsoft Research’s Kathleen Sullivan, explores what the technology industry and policymakers can learn from these fields and how that might help shape the course of AI development.

In the series finale, Amanda Craig Deckard, senior director of public policy in Microsoft’s Office of Responsible AI, rejoins Sullivan to discuss what Microsoft has learned about testing as a governance tool and what’s next for the company’s work in the AI governance space. The pair explores high-level takeaways (i.e., testing is important and challenging!); the roles of rigor, standardization, and interpretability in making testing a reliable governance tool; and the potential for public-private partnerships to help advance not only model-level evaluation but deployment-level evaluation, too.


Learn more:

Learning from other domains to advance AI evaluation and testing 
Microsoft Research Blog | June 2025 

Responsible AI: Ethical policies and practices | Microsoft AI 

Transcript

[MUSIC] 

KATHLEEN SULLIVAN: Welcome to AI Testing and Evaluation: Learnings from Science and Industry. I’m your host, Kathleen Sullivan. 

As generative AI continues to advance, Microsoft has gathered a range of experts—from genome editing to cybersecurity—to share how their fields approach evaluation and risk assessment. Our goal is to learn from their successes and their stumbles to move the science and practice of AI testing forward. In this series, we’ll explore how these insights might help guide the future of AI development, deployment, and responsible use. 

[MUSIC ENDS] 

For our final episode of the series, I’m thrilled to once again be joined by Amanda Craig Deckard, senior director of public policy in Microsoft’s Office of Responsible AI. 

Amanda, welcome back to the podcast!


AMANDA CRAIG DECKARD: Thank you so much.

SULLIVAN: In our intro episode, you really helped set the stage for this series. And it’s been great, because since then, we’ve had the pleasure of speaking with governance experts about genome editing, pharma, medical devices, cybersecurity, and we’ve also gotten to spend some time with our own Microsoft responsible AI leaders and hear reflections from them.

And here’s what stuck with me, and I’d love to hear from you on this, as well: testing builds trust; context is shaping risk; and every field is really thinking about striking its own balance between pre-deployment testing and post-deployment monitoring.

So drawing on what you’ve learned from the workshop and the case studies, what headline insights do you think matter the most for AI governance?

CRAIG DECKARD: It’s been really interesting to learn from all of these different domains, and there are, you know, lots of really interesting takeaways. 

I think a starting point for me is actually pretty similar to where you landed, which is just that testing is really important for trust, and it’s also really hard [LAUGHS] to figure out exactly, you know, how to get it right, how to make sure that you’re addressing risks, that you’re not constraining innovation, that you are recognizing that a lot of the industry that’s impacted is really different. You have small organizations, you have large organizations, and you want to enable that opportunity that is enabled by the technology across the board. 

And so it’s just difficult to, kind of, get all of these dynamics right, especially when, you know, I think we heard from other domains, testing is not some, sort of, like, oh, simple thing, right. There’s not this linear path from, like, A to B where you just test the one thing and you’re done. 

SULLIVAN: Right.

CRAIG DECKARD: It’s complex, right. Testing is multistage. There’s a lot of testing by different actors. There are a lot of different purposes for which you might test. As I think it was Dan Carpenter who talked about it’s not just about testing for safety. It’s also about testing for efficacy and building confidence in the right dosage for pharmaceuticals, for example. And that’s across the board for all of these domains, right. That you’re really thinking about the performance of the technology. You’re thinking about safety. You’re trying to also calibrate for efficiency.

And so those tradeoffs, every expert shared that navigating those is really challenging. And also that there were real impacts to early choices in the, sort of, governance of risk in these different domains and the development of the testing, sort of, expectations, and that in some cases, this had been difficult to reverse, which also just layers on that complexity and that difficulty in a different way. So that’s the super high-level takeaway. But maybe if I could just quickly distill, like, three takeaways that I think really are applicable to AI in a bit more of a granular way.

You know, one is about, how is the testing exactly used? For what purpose? And the second is what emphasis there is on this pre- versus post-deployment testing and monitoring. And then the third is how rigid versus adaptive the, sort of, testing regimes or frameworks are in these different domains. 

So on the first—how is testing used?—so is testing something that impacts market entry, for example? Or is it something that might be used more for informing how risk is evolving in the domain and how broader risk management strategies might need to be applied? We have examples, like the pharmaceutical or medical device industry experts with whom you spoke, that’s really, you know, testing … there is a pre-deployment requirement. So that’s one question. 

The second is this emphasis on pre- versus post-deployment testing and monitoring, and we really did see across domains that in many cases, there is a desire for both pre- and post-deployment, sort of, testing and monitoring, but also that, sort of, naturally in these different domains, a degree of emphasis on one or the other had evolved and that had a real impact on governance and tradeoffs. 

And the third is just how rigid versus adaptive these testing and evaluation regimes or frameworks are in these different domains. We saw, you know, in some domains, the testing requirements were more rigid as you might expect in more of the pharmaceutical or medical devices industries, for example. And in other domains, there was this more, sort of, adaptive approach to how testing might get used. So, for example, in the case of our other general-purpose technologies, you know, you spoke with Alta Charo on genome editing, and in our case studies, we also explored this in the context of nanotechnology. In those general-purpose technology domains, there is more emphasis on downstream or application-context testing that is more, sort of, adaptive to the use scenario of the technology and, you know, having that work in conjunction with testing more at the, kind of, level of the technology itself.

SULLIVAN: I want to double-click on a number of the things we just talked about. But actually, before we go too much deeper, a question on if there’s anything that really surprised you or challenged maybe some of your own assumptions in this space from some of the discussions that we had over the series. 

CRAIG DECKARD: Yeah. You know, I know I’ve already just mentioned this pre- versus post-deployment testing and monitoring issue, but it was something that was very interesting to me and in some ways surprised me or made me just realize something that I hadn’t fully connected before, about how these, sort of, regimes might evolve in different contexts and why. And in part, I couldn’t help but bring the context I have from cybersecurity policy into this, kind of, processing of what we learned and reflection because there was a real contrast for me between the pharmaceutical industry and the cybersecurity domain when I think about the emphasis on pre- versus post-deployment monitoring.

And on the one hand, we have in the pharmaceutical domain a real emphasis that has developed around pre-market testing. And there is also an expectation in some circumstances in the pharmaceutical domain for post-deployment testing, as well. But as we learned from our experts in that domain, there has naturally been a real, kind of, emphasis on the pre-market portion of that testing. And in reality, even where post-market monitoring is required and post-market testing is required, it does not always actually happen. And the experts really explained that, you know, part of it is just the incentive structure around the emphasis around, you know, the testing as a pre-market, sort of, entry requirement. And also just the resources that exist among regulators, right. There’s limited resources, right. And so there are just choices and tradeoffs that they need to make in their own, sort of, enforcement work.

And then on the other hand, you know, in cybersecurity, I never thought about the, kind of, emphasis on things like coordinated vulnerability disclosure and bug bounties that have really developed in the cybersecurity domain. But it’s a really important part of how we secure technology and enhance cybersecurity over time, where we have these norms that have developed where, you know, security researchers are doing really important research. They’re finding vulnerabilities in products. And we have norms developed where they report those to the companies that are in a position to address those vulnerabilities. And in some cases, those companies actually pay, through bug bounties, the researchers. And perhaps in some ways, the role of coordinated vulnerability disclosure and bug bounties has evolved the way that it has because there hasn’t been as much emphasis on the pre-market testing across the board at least in the context of software.

And so you look at those two industries and it was interesting to me to study them to some extent in contrast with each other as this way that the incentives and the resources that need to be applied to testing, sort of, evolve to address where there’s, kind of, more or less emphasis.

SULLIVAN: It’s a great point. I mean, I think what we’re hearing—and what you’re saying—is just exactly this choice … like, is there a binary choice between focusing on pre-deployment testing or post-deployment monitoring? And, you know, I think our assumption is that we need to do both. But I’d love to hear from you on that. 

CRAIG DECKARD: Absolutely. I think we need to do both. I’m very persuaded by this inclination always that there’s value in trying to really do it all in a risk management context. 

And also, we know one of the principles of risk management is you have to prioritize because there are finite resources. And I think that’s where we get to this challenge in really thinking deeply, especially as we’re in the early days of AI governance, and we need to be very thoughtful about, you know, tradeoffs that we may not want to be making but we are because, again, these are finite choices and we, kind of, can’t help but put our finger on the dial in different directions with our choices that, you know, it’s going to be very difficult to have, sort of, equal emphasis on both. And we need to invest in both, but we need to be very deliberate about the roles of each and how they complement each other and who does which and how we use what we learn from pre- versus post-deployment testing and monitoring.

SULLIVAN: Maybe just spending a little bit more time here … you know, a lot of attention goes into testing models upstream, but risk often shows up once they’re wired into real products and workflows. How much does deployment context change the risk picture from your perspective? 

CRAIG DECKARD: Yeah, I … such an important question. I really agree that there has been a lot of emphasis to date on, sort of, testing models upstream, the AI model evaluation. And it’s also really important that we bring more attention into evaluation at the system or application level. And I actually see that in governance conversations, this is actually increasingly raised, this need to have system-level evaluation. We see this across regulation. We also see it in the context of just organizations trying to put in governance requirements for how their organization is going to operate in deploying this technology. 

And there’s a gap today in terms of best practices around system-level testing, perhaps even more than model-level evaluation. And it’s really important because in a lot of cases, the deployment context really does impact the risk picture, especially with AI, which is a general-purpose technology, and we really saw this in our study of other domains that represented general-purpose technology. 

So in the case study that you can find online on nanotechnology, you know, there’s a real distinction between the risk evaluation and the governance of nanotechnology in different deployment contexts. So the chapter that our expert on nanotechnology wrote really goes into incredibly interesting detail around, you know, deployment of nanotechnology in the context of, like, chemical applications versus consumer electronics versus pharmaceuticals versus construction and how the way that nanoparticles are basically delivered in all those different deployment contexts, as well as, like, what the risk of the actual use scenario is just varies so much. And so there’s a real need to do that kind of risk evaluation and testing in the deployment context, and this difference in terms of risks and what we learned in these other domains where, you know, there are these different approaches to trying to really think about and gain efficiencies and address risks at a horizontal level versus, you know, taking a real sector-by-sector approach. And to some extent, it seems like it’s more time intensive to do that sectoral deployment-specific work. And at the same time, perhaps there are efficiencies to be gained by actually doing the work in the context in which, you know, you have a better understanding of the risk that can result from really deploying this technology. 

And ultimately, [LAUGHS] really what we also need to think about here is probably, in the end, just like pre- and post-deployment testing, you need both. Not probably; certainly!

So effectively we need to think about evaluation at the model level and the system level as being really important. And it’s really important to get system evaluation right so that we can actually get trust in this technology in deployment context so we enable adoption in low- and in high-risk deployments in a way that means that we’ve done risk evaluation in each of those contexts in a way that really makes sense in terms of the resources that we need to apply and ultimately we are able to unlock more applications of this technology in a risk-informed way.

SULLIVAN: That’s great. I mean, I couldn’t agree more. I think these contexts, the approaches are so important for trust and adoption, and I’d love to hear from you, what do we need to advance AI evaluation and testing in our ecosystem? What are some of the big gaps that you’re seeing, and what role can different stakeholders play in filling them? And maybe an add-on, actually: is there some sort of network effect that could 10x our testing capacity? 

CRAIG DECKARD: Absolutely. So there’s a lot of work that needs to be done, and there’s a lot of work in process to really level up our whole evaluation and testing ecosystem. We learned, across domains, that there’s really a need to advance our thinking and our practice in three areas: rigor of testing; standardization of methodologies and processes; and interpretability of test results. 

So what we mean by rigor is that we are ensuring that what we are ultimately evaluating in terms of risks is defined in a scientifically valid way and we are able to measure against that risk in a scientifically valid way. 

By standardization, what we mean is that there’s really an accepted and well-understood and, again, a scientifically valid methodology for doing that testing and for actually producing artifacts out of that testing that are meeting those standards. And that sets us up for the final portion on interpretability, which is, like, really the process by which you can trust that the testing has been done in this rigorous and standardized way and that then you have artifacts that result from the testing process that can really be used in the risk management context because they can be interpreted, right. 

We understand how to, like, apply weight to them for our risk-management decisions. We actually are able to interpret them in a way that perhaps they inform other downstream risk mitigations that address the risks that we see through the testing results and that we actually understand what limitations apply to the test results and why they may or may not be valid in certain, sort of, deployment contexts, for example, and especially in the context of other risk mitigations that we need to apply. So there’s a need to advance all three of those things—rigor, standardization, and interpretability—to level up the whole testing and evaluation ecosystem. 

And when we think about what actors should be involved in that work … really everybody, which is both complex to orchestrate but also really important. And so, you know, you need to have the entire value chain involved in really advancing this work. You need the model developers, but you also need the system developers and deployers that are really engaged in advancing the science of evaluation and advancing how we are using these testing artifacts in the risk management process. 

When we think about what could actually 10x our testing capacity—that’s the dream, right? We all want to accelerate our progress in this space. You know, I think we need work across all three of those areas of rigor, standardization, and interpretability, but I think one that will really help accelerate our progress across the board is that standardization work, because ultimately, you’re going to need to have these tests be done and applied across so many different contexts, and ultimately, while we want the whole value chain engaged in the development of the thinking and the science and the standards in this space, we also need to realize that not every organization is necessarily going to have the capacity to, kind of, contribute to developing the ways that we create and use these tests. And there are going to be many organizations that are going to benefit from there being standardization of the methodologies and the artifacts that they can pick up and use.

One thing that I know we’ve heard throughout this podcast series from our experts in other domains, including Timo [Minssen] in the medical devices context and Ciaran [Martin] in the cybersecurity context, is that there’s been a recognition, as those domains have evolved, that there’s a need to calibrate our, sort of, expectations for different actors in the ecosystem and really understand that small businesses, for example, just cannot apply the same degree of resources that others may be able to, to do testing and evaluation and risk management. And so the benefit of having standardized approaches is that those organizations are able to, kind of, integrate into the broader supply chain ecosystem and apply their own, kind of, risk management practices in their own context in a way that is more efficient. 

And finally, the last stakeholder that I think is really important to think about in terms of partnership across the ecosystem to really advance the whole testing and evaluation work that needs to happen is government partners, right, and thinking beyond the value chain, the AI supply chain, and really thinking about public-private partnership. That’s going to be incredibly important to advancing this ecosystem.

You know, I think there’s been real progress already in the AI evaluation and testing ecosystem in the public-private partnership context. We have been really supportive of the work of the International Network of AI Safety and Security Institutes (opens in new tab)[1] (opens in new tab) and the Center for AI Standards and Innovation (opens in new tab) that all allow for that kind of public-private partnership on actually testing and advancing the science and best practices around standards. 

And there are other innovative, kind of, partnerships, as well, in the ecosystem. You know, Singapore has recently launched their Global AI Assurance Pilot (opens in new tab) findings. And that effort really paired application deployers and testers so that consequential impacts at deployment could really be tested. And that’s a really fruitful, sort of, effort that complements the work of these institutes and centers that are more focused on evaluation at the model level, for example.

And in general, you know, I think that there’s just really a lot of benefits for us thinking expansively about what we can accomplish through deep, meaningful public-private partnership in this space. I’m really excited to see where we can go from here with building on, you know, partnerships across AI supply chains and with governments and public-private partnerships. 

SULLIVAN: I couldn’t agree more. I mean, this notion of more engagement across the ecosystem and value chain is super important for us and informs how we think about the space completely. 

If you could invite any other industry to the next workshop, maybe quantum safety, space tech, even gaming, who’s on your wish list? And maybe what are some of the things you’d want to go deeper on? 

CRAIG DECKARD: This is something that we really welcome feedback on if anyone listening has ideas about other domains that would be interesting to study. I will say, I think I shared at the outset of this podcast series, the domains that we added in this round of our efforts in studying other domains actually all came from feedback that we received from, you know, folks we’d engaged with our first study of other domains and multilateral, sort of, governance institutions. And so we’re really keen to think about what other domains could be interesting to study. And we are also keen to go deeper, building on what we learned in this round of effort going forward. 

One of the areas that I am particularly really interested in is going deeper on, what, sort of, transparency and information sharing about risk evaluation and testing will be really useful to share in different contexts? So across the AI supply chain, what is the information that’s going to be really meaningful to share between developers and deployers of models and systems and those that are ultimately using this technology in particular deployment contexts? And, you know, I think that we could have much to learn from other general-purpose technologies like genome editing and nanotechnology and cybersecurity, where we could learn a bit more about the kinds of information that they have shared across the development and deployment life cycle and how that has strengthened risk management in general as well as provided a really strong feedback loop around testing and evaluation. What kind of testing is most useful to do at what point in the life cycle, and what artifacts are most useful to share as a result of that testing and evaluation work?

I’ll say, as Microsoft, we have been really investing in how we are sharing information with our various stakeholders. We also have been engaged with others in industry in reporting what we’ve done in the context of the Hiroshima AI Process, or HAIP, Reporting Framework (opens in new tab). This is an effort that is really just in its first round of really exploring how this kind of reporting can be really additive to risk management understanding. And again, I think there’s real opportunity here to look at this kind of reporting and understand, you know, what’s valuable for stakeholders and where is there opportunity to go further in really informing value chains and policymakers and the public about AI risk and opportunity and what can we learn again from other domains that have done this kind of work over decades to really refine that kind of information sharing. 

SULLIVAN: It’s really great to hear about all the advances that we’re making on these reports. I’m guessing a lot of the metrics in there are technical, but sociotechnical impacts—jobs, maybe misinformation, well-being—are harder to score. What new measurement ideas are you excited about, and do you have any thoughts on, like, who needs to pilot those?

CRAIG DECKARD: Yeah, it’s an incredibly interesting question that I think also just speaks to, you know, the breadth of, sort of, testing and evaluation that’s needed at different points along that AI life cycle and really not getting lost in one particular kind of testing or another pre- or post-deployment and thinking expansively about the risks that we’re trying to address through this testing. 

You know, for example, even with the UK’s AI Security Institute (opens in new tab) that has just recently launched a new program, a new team, that’s focused on societal resilience research. I think it’s going to be a really important area from a sociotechnical impact perspective to bring some focus into as this technology is more widely deployed. Are we understanding the impacts over time as different people and different cultures adopt and use this technology for different purposes? 

And I think that’s an area where there really is opportunity for greater public-private partnership in this research. Because we all share this long-term interest in ensuring that this technology is really serving people and we have to understand the impacts so that we understand, you know, what adjustments we can actually pursue sooner upstream to address those impacts and make sure that this technology is really going to work for all of us and in a way that is consistent with the societal values that we want. 

SULLIVAN: So, Amanda, looking ahead, I would love to hear just what’s going to be on your radar? What’s top of mind for you in the coming weeks?

CRAIG DECKARD: Well, we are certainly continuing to process all the learnings that we’ve had from studying these domains. It’s really been a rich set of insights that we want to make sure we, kind of, fully take advantage of. And, you know, I think these hard questions and, you know, real opportunities to be thoughtful in these early days of AI governance are not, sort of, going away or being easily resolved soon. And so I think we continue to see value in really learning from others, thinking about what’s distinct in the AI context, but also what we can apply in terms of what other domains have learned.

SULLIVAN: Well, Amanda, it has been such a special experience for me to help illuminate the work of the Office of Responsible AI and our team in Microsoft Research, and [MUSIC] it’s just really special to see all of the work that we’re doing to help set the standard for responsible development and deployment of AI. So thank you for joining us today, and thanks for your reflections and discussion.

And to our listeners, thank you so much for joining us for the series. We really hope you enjoyed it! To check out all of our episodes, visit aka.ms/AITestingandEvaluation (opens in new tab), and if you want to learn more about how Microsoft approaches AI governance, you can visit microsoft.com/RAI (opens in new tab)

See you next time! 

[MUSIC FADES] 


[1] (opens in new tab) Since the launch of the International Network of AI Safety Institutes, the UK renamed its institute the AI Security Institute (opens in new tab).

The post AI Testing and Evaluation: Reflections appeared first on Microsoft Research.

Read More

CollabLLM: Teaching LLMs to collaborate with users

CollabLLM: Teaching LLMs to collaborate with users

CollabLLM blog hero | flowchart diagram starting in the upper left corner with an icon of two overlapping chat bubbles; arrow pointing right to an LLM network node icon; branching down to show three simulated users; right arrow to a

Large language models (LLMs) can solve complex puzzles in seconds, yet they sometimes struggle over simple conversations. When these AI tools make assumptions, overlook key details, or neglect to ask clarifying questions, the result can erode trust and derail real-world interactions, where nuance is everything.

A key reason these models behave this way lies in how they’re trained and evaluated. Most benchmarks use isolated, single-turn prompts with clear instructions. Training methods tend to optimize for the model’s next response, not its contribution to a successful, multi-turn exchange. But real-world interaction is dynamic and collaborative. It relies on context, clarification, and shared understanding.

User-centric approach to training 

To address this, we’re exploring ways to train LLMs with users in mind. Our approach places models in simulated environments that reflect the back-and-forth nature of real conversations. Through reinforcement learning, these models improve through trial and error, for example, learning when to ask questions and how to adapt tone and communication style to different situations. This user-centric approach helps bridge the gap between how LLMs are typically trained and how people actually use them.  

This is the concept behind CollabLLM (opens in new tab), recipient of an ICML (opens in new tab) Outstanding Paper Award (opens in new tab). This training framework helps LLMs improve through simulated multi-turn interactions, as illustrated in Figure 1. The core insight behind CollabLLM is simple: in a constructive collaboration, the value of a response isn’t just in its immediate usefulness, but in how it contributes to the overall success of the conversation. A clarifying question might seem like a delay but often leads to better outcomes. A quick answer might appear useful but can create confusion or derail the interaction.

Figure 1 compares two training strategies for Large Language Models: a standard non-collaborative method and our proposed collaborative method (CollabLLM). On the left, the standard method uses a preference/reward dataset with single-turn evaluations, resulting in a model that causes ineffective interactions. The user gives feedback, but the model generates multiple verbose and unsatisfactory responses, requiring many back-and-forth turns. On the right, CollabLLM incorporates collaborative simulation during training, using multi-turn interactions and reinforcement learning. After training, the model asks clarifying questions (e.g., tone preferences), receives focused user input, and quickly generates tailored, high-impact responses.
Figure 1. Diagram comparing two training approaches for LLMs. (a) The standard method lacks user-agent collaboration and uses single-turn rewards, leading to an inefficient conversation. (b) In contrast, CollabLLM simulates multi-turn user-agent interactions during training, enabling it to learn effective collaboration strategies and produce more efficient dialogues.

CollabLLM puts this collaborative approach into practice with a simulation-based training loop, illustrated in Figure 2. At any point in a conversation, the model generates multiple possible next turns by engaging in a dialogue with a simulated user.

Figure 2 illustrates the overall training procedure of CollabLLM. For a given conversational input, the LLM and a user simulator are used to sample conversation continuations. The sampled conversations are then scored using a reward model that utilizes various multiturn-aware rewards, which are then in turn used to update parameters of the LLM.
Figure 2: Simulation-based training process used in CollabLLM

The system uses a sampling method to extend conversations turn by turn, choosing likely responses for each participant (the AI agent or the simulated user), while adding some randomness to vary the conversational paths. The goal is to expose the model to a wide variety of conversational scenarios, helping it learn more effective collaboration strategies.

on-demand event

Microsoft Research Forum Episode 4

Learn about the latest multimodal AI models, advanced benchmarks for AI evaluation and model self-improvement, and an entirely new kind of computer for AI inference and hard optimization.


To each simulated conversation, we applied multiturn-aware reward (MR) functions, which assess how the model’s response at the given turn influences the entire trajectory of the conversation. We sampled multiple conversational follow-ups from the model, such as statements, suggestions, questions, and used MR to assign a reward to each based on how well the conversation performed in later turns. We based these scores on automated metrics that reflect key factors like goal completion, conversational efficiency, and user engagement.

To score the sampled conversations, we used task-specific metrics and metrics from an LLM-as-a-judge framework, which supports efficient and scalable evaluation. For metrics like engagement, a judge model rates each sampled conversation on a scale from 0 to 1.

The MR of each model response was computed by averaging the scores from the sampled conversations, originating from the model response. Based on the score, the model updates its parameters using established reinforcement learning algorithms like Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO).

We tested CollabLLM through a combination of automated and human evaluations, detailed in the paper. One highlight is a user study involving 201 participants in a document co-creation task, shown in Figure 3. We compared CollabLLM to a baseline trained with single-turn rewards and to a second, more proactive baseline prompted to ask clarifying questions and take other proactive steps. CollabLLM outperformed both, producing higher-quality documents, better interaction ratings, and faster task completion times.

Figure 3 shows the main results of our user study on a document co-creation task, by comparing a baseline, a proactive baseline, and CollabLLM. CollabLLM outperformed the two baselines. Relative to the best baseline, CollabLLM yields improved document quality rating (+0.12), interaction rating (+0.14), and a reduction of average time spent by the user (-129 seconds).
Figure 3: Results of the user study in a document co-creation task comparing CollabLLM to a baseline trained with single-turn rewards.

Designing for real-world collaboration

Much of today’s AI research focuses on fully automated tasks, models working without input from or interaction with users. But many real-world applications depend on people in the loop: as users, collaborators, or decision-makers. Designing AI systems that treat user input not as a constraint, but as essential, leads to systems that are more accurate, more helpful, and ultimately more trustworthy.

This work is driven by a core belief: the future of AI depends not just on intelligence, but on the ability to collaborate effectively. And that means confronting the communication breakdowns in today’s systems.

We see CollabLLM as a step in that direction, training models to engage in meaningful multi-turn interactions, ask clarifying questions, and adapt to context. In doing so, we can build systems designed to work with people—not around them.

The post CollabLLM: Teaching LLMs to collaborate with users appeared first on Microsoft Research.

Read More

CollabLLM: Teaching LLMs to collaborate with users

CollabLLM: Teaching LLMs to collaborate with users

CollabLLM blog hero | flowchart diagram starting in the upper left corner with an icon of two overlapping chat bubbles; arrow pointing right to an LLM network node icon; branching down to show three simulated users; right arrow to a

Large language models (LLMs) can solve complex puzzles in seconds, yet they sometimes struggle over simple conversations. When these AI tools make assumptions, overlook key details, or neglect to ask clarifying questions, the result can erode trust and derail real-world interactions, where nuance is everything.

A key reason these models behave this way lies in how they’re trained and evaluated. Most benchmarks use isolated, single-turn prompts with clear instructions. Training methods tend to optimize for the model’s next response, not its contribution to a successful, multi-turn exchange. But real-world interaction is dynamic and collaborative. It relies on context, clarification, and shared understanding.

User-centric approach to training 

To address this, we’re exploring ways to train LLMs with users in mind. Our approach places models in simulated environments that reflect the back-and-forth nature of real conversations. Through reinforcement learning, these models improve through trial and error, for example, learning when to ask questions and how to adapt tone and communication style to different situations. This user-centric approach helps bridge the gap between how LLMs are typically trained and how people actually use them.  

This is the concept behind CollabLLM (opens in new tab), recipient of an ICML (opens in new tab) Outstanding Paper Award (opens in new tab). This training framework helps LLMs improve through simulated multi-turn interactions, as illustrated in Figure 1. The core insight behind CollabLLM is simple: in a constructive collaboration, the value of a response isn’t just in its immediate usefulness, but in how it contributes to the overall success of the conversation. A clarifying question might seem like a delay but often leads to better outcomes. A quick answer might appear useful but can create confusion or derail the interaction.

Figure 1 compares two training strategies for Large Language Models: a standard non-collaborative method and our proposed collaborative method (CollabLLM). On the left, the standard method uses a preference/reward dataset with single-turn evaluations, resulting in a model that causes ineffective interactions. The user gives feedback, but the model generates multiple verbose and unsatisfactory responses, requiring many back-and-forth turns. On the right, CollabLLM incorporates collaborative simulation during training, using multi-turn interactions and reinforcement learning. After training, the model asks clarifying questions (e.g., tone preferences), receives focused user input, and quickly generates tailored, high-impact responses.
Figure 1. Diagram comparing two training approaches for LLMs. (a) The standard method lacks user-agent collaboration and uses single-turn rewards, leading to an inefficient conversation. (b) In contrast, CollabLLM simulates multi-turn user-agent interactions during training, enabling it to learn effective collaboration strategies and produce more efficient dialogues.

CollabLLM puts this collaborative approach into practice with a simulation-based training loop, illustrated in Figure 2. At any point in a conversation, the model generates multiple possible next turns by engaging in a dialogue with a simulated user.

Figure 2 illustrates the overall training procedure of CollabLLM. For a given conversational input, the LLM and a user simulator are used to sample conversation continuations. The sampled conversations are then scored using a reward model that utilizes various multiturn-aware rewards, which are then in turn used to update parameters of the LLM.
Figure 2: Simulation-based training process used in CollabLLM

The system uses a sampling method to extend conversations turn by turn, choosing likely responses for each participant (the AI agent or the simulated user), while adding some randomness to vary the conversational paths. The goal is to expose the model to a wide variety of conversational scenarios, helping it learn more effective collaboration strategies.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.


To each simulated conversation, we applied multiturn-aware reward (MR) functions, which assess how the model’s response at the given turn influences the entire trajectory of the conversation. We sampled multiple conversational follow-ups from the model, such as statements, suggestions, questions, and used MR to assign a reward to each based on how well the conversation performed in later turns. We based these scores on automated metrics that reflect key factors like goal completion, conversational efficiency, and user engagement.

To score the sampled conversations, we used task-specific metrics and metrics from an LLM-as-a-judge framework, which supports efficient and scalable evaluation. For metrics like engagement, a judge model rates each sampled conversation on a scale from 0 to 1.

The MR of each model response was computed by averaging the scores from the sampled conversations, originating from the model response. Based on the score, the model updates its parameters using established reinforcement learning algorithms like Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO).

We tested CollabLLM through a combination of automated and human evaluations, detailed in the paper. One highlight is a user study involving 201 participants in a document co-creation task, shown in Figure 3. We compared CollabLLM to a baseline trained with single-turn rewards and to a second, more proactive baseline prompted to ask clarifying questions and take other proactive steps. CollabLLM outperformed both, producing higher-quality documents, better interaction ratings, and faster task completion times.

Figure 3 shows the main results of our user study on a document co-creation task, by comparing a baseline, a proactive baseline, and CollabLLM. CollabLLM outperformed the two baselines. Relative to the best baseline, CollabLLM yields improved document quality rating (+0.12), interaction rating (+0.14), and a reduction of average time spent by the user (-129 seconds).
Figure 3: Results of the user study in a document co-creation task comparing CollabLLM to a baseline trained with single-turn rewards.

Designing for real-world collaboration

Much of today’s AI research focuses on fully automated tasks, models working without input from or interaction with users. But many real-world applications depend on people in the loop: as users, collaborators, or decision-makers. Designing AI systems that treat user input not as a constraint, but as essential, leads to systems that are more accurate, more helpful, and ultimately more trustworthy.

This work is driven by a core belief: the future of AI depends not just on intelligence, but on the ability to collaborate effectively. And that means confronting the communication breakdowns in today’s systems.

We see CollabLLM as a step in that direction, training models to engage in meaningful multi-turn interactions, ask clarifying questions, and adapt to context. In doing so, we can build systems designed to work with people—not around them.

The post CollabLLM: Teaching LLMs to collaborate with users appeared first on Microsoft Research.

Read More

AI Testing and Evaluation: Learnings from cybersecurity

AI Testing and Evaluation: Learnings from cybersecurity

Illustrated images of Kathleen Sullivan, Ciaran Martin, and Tori Westerhoff for the Microsoft Research podcast

Generative AI presents a unique challenge and opportunity to reexamine governance practices for the responsible development, deployment, and use of AI. To advance thinking in this space, Microsoft has tapped into the experience and knowledge of experts across domains—from genome editing to cybersecurity—to investigate the role of testing and evaluation as a governance tool. AI Testing and Evaluation: Learnings from Science and Industry, hosted by Microsoft Research’s Kathleen Sullivan, explores what the technology industry and policymakers can learn from these fields and how that might help shape the course of AI development.

In this episode, Sullivan speaks with Professor Ciaran Martin (opens in new tab) of the University of Oxford about risk assessment and testing in the field of cybersecurity. They explore the importance of differentiated standards for organizations of varying sizes, the role of public-private partnerships, and the opportunity to embed security into emerging technologies from the outset. Later, Tori Westerhoff (opens in new tab), a principal director on the Microsoft AI Red Team, joins Sullivan to talk about identifying vulnerabilities in AI products and services. Westerhoff describes AI security in terms she’s heard cybersecurity professionals use for their work—a team sport—and points to cybersecurity’s establishment of a shared language and understanding of risk as a model for AI security.

Transcript

[MUSIC]

KATHLEEN SULLIVAN: Welcome to AI Testing and Evaluation: Learnings from Science and Industry. I’m your host, Kathleen Sullivan.

As generative AI continues to advance, Microsoft has gathered a range of experts—from genome editing to cybersecurity—to share how their fields approach evaluation and risk assessment. Our goal is to learn from their successes and their stumbles to move the science and practice of AI testing forward. In this series, we’ll explore how these insights might help guide the future of AI development, deployment, and responsible use.

[MUSIC ENDS]

Today, I’m excited to welcome Ciaran Martin to the podcast to explore testing and risk assessment in cybersecurity. Ciaran is a professor of practice in the management of public organizations at the University of Oxford. He had previously founded and served as chief executive of the National Cyber Security Centre within the UK’s intelligence, security, and cyber agency.

And after our conversation, we’ll talk to Microsoft’s Tori Westerhoff, a principal director on Microsoft’s AI Red Team, about how we should think about these insights in the context of AI.

Hi, Ciaran. Thank you so much for being here today.


CIARAN MARTIN: Well, thanks so much for inviting me. It’s great to be here.

SULLIVAN: Ciaran, before we get into some regulatory specifics, it’d be great to hear a little bit more about your origin story, and just take us to that day—who tapped you on the shoulder and said, “Ciaran, we need you to run a national cyber center! Do you fancy building one?”

MARTIN: You could argue that I owe my job to Edward Snowden. Not an obvious thing to say. So the National Cyber Security Centre, which didn’t exist at the time—I was invited to join the British government’s cybersecurity effort in a leadership role—is now a subset of GCHQ. That’s the digital intelligence agency. The equivalent in the US obviously is the NSA [National Security Agency]. It had been convulsed by the Snowden disclosures. It was an unprecedented challenge.

I was a 17-year career government fixer with some national security experience. So I was asked to go out and help with the policy response, the media response, the legal response. But I said, look, any crisis, even one as big as this, is over one way or the other in six months. What should I do long term? And they said, well, we were thinking of asking you to try to help transform our cybersecurity mission. So the National Cyber Security Centre was born, and I was very proud to lead it, and all in all, I did it for seven years from startup to handing it on to somebody else.

SULLIVAN: I mean, it’s incredible. And just building on that, people spend a significant portion of their lives online now with a variety of devices, and maybe for listeners who are newer to cybersecurity, could you give us the 90-second lightning talk? Kind of, what does risk assessment and testing look like in this space?

MARTIN: Well, risk assessment and testing, I think, are two different things. You can’t defend everything. If you defend everything, you’re defending nothing. So broadly speaking, organizations face three threats. One is complete disruption of their systems. So just imagine not being able to access your system. The second is data protection, and that could be sensitive customer information. It could be intellectual property. And the third is, of course, you could be at risk of just straightforward being stolen from. I mean, you don’t want any of them to happen, but you have to have a hierarchy of harm.

SULLIVAN: Yes.

MARTIN: So that’s your risk assessment.

The testing side, I think, is slightly different. One of the paradoxes, I think, of cybersecurity is for such a scientific, data-rich subject, the sort of metrics about what works are very, very hard to come by. So you’ve got boards and corporate leadership and senior governmental structures, and they say, “Look, how do I run this organization safely and securely?” And a cybersecurity chief within the organization will say, “Well, we could get this capability in.” Well, the classic question for a leadership team to ask is, well, what risk and harm will this reduce, by how much, and what’s the cost-benefit analysis? And we find that really hard.

So that’s really where testing and assurance comes in. And also as technology changes so fast, we have to figure out, well, if we’re worried about post-quantum cryptography, for example, what standards does it have to meet? How do you assess whether it’s meeting those standards? So it’s a huge issue in cybersecurity and one that we’re always very conscious of. It’s really hard.

SULLIVAN: Given the scope of cybersecurity, are there any differences in testing, let’s say, for maybe a small business versus a critical infrastructure operator? Are there any, sort of, metrics we can look at in terms of distinguishing risk or assessment?

MARTIN: There have to be. One of the reasons I think why we have to be is that no small business can be expected to take on a hostile nation-state that’s well equipped. You have to be realistic.

If you look at government guidance, certainly in the UK 15 years ago on cybersecurity, you were telling small businesses that are living hand to mouth, week by week, trying to make payments at the end of each month, we were telling them they needed sort of nation-state-level cyber defenses. That was never going to happen, even if they could afford it, which they couldn’t. So you have to have some differentiation. So again, you’ve got assessment frameworks and so forth where you have to meet higher standards. So there absolutely has to be that distinction. Otherwise, you end up in a crazy world of crippling small businesses with just unmanageable requirements which they’re never going to meet.

SULLIVAN: It’s such a great point. You touched on this a little bit earlier, as well, but just cybersecurity governance operates in a fast-moving technology and threat environment. How have testing standards evolved, and where do new technical standards usually originate?

MARTIN: I keep saying this is very difficult, and it is. [LAUGHTER] So I think there are two challenges. One is actually about the balance, and this applies to the technology of today as well as the technology of tomorrow. This is about, how do you make sure things are good enough without crowding out new entrants? You want people to be innovative and dynamic. You want disruptors in this business.

But if you say to them, “Look, well, you have to meet these 14 impossibly high technical standards before you can even sell to anybody or sell to the government,” whatever, then you’ve got a problem. And I think we’ve wrestled with that, and there’s no perfect answer. You just have to try and go to … find the sweet spot between two ends of a spectrum. And that’s going to evolve.

The second point, which in some respects if you’ve got the right capabilities is slightly easier but still a big call, is around, you know, those newer and evolving technologies. And here, having, you know, been a bit sort of gloomy and pessimistic, here I think is actually an opportunity. So one of the things we always say in cybersecurity is that the internet was built and developed without security in mind. And that was kind of true in the ’90s and the noughties, as we call them over here.

But I think as you move into things like post-quantum computing, applied use of AI, and so on, you can actually set the standards at the beginning. And that’s really good because it’s saying to people that these are the things that are going to matter in the post-quantum age. Here’s the outline of the standards you’re going to have to meet; start looking at them. So there’s an opportunity actually to make technology safer by design, by getting ahead of it. And I think that’s the era we’re in now.

SULLIVAN: That makes a lot of sense. Just building on that, do businesses and the public trust these standards? And I guess, which standard do you wish the world would just adopt already, and what’s the real reason they haven’t?

MARTIN: Well, again, where do you start? I mean, most members of the public quite rightly haven’t heard of any of these standards. I think public trust and public capital in any society matters. But I think it is important that these things are credible.

And there’s quite a lot of convergence between, you know, the top-level frameworks. And obviously in the US, you know, the NIST [National Institute of Standards and Technology] framework is the one that’s most popular for cybersecurity, but it bears quite a strong resemblance to the international one, ISO[/IEC] 27001, and there are others, as well. But fundamentally, they boil down to kind of five things. Do a risk assessment; work out what your crown jewels are. Protect your perimeter as best you can. Those are the first two.

The third one then is when your perimeter’s breached, be able to detect it more times than not. And when you can’t do that, you go to the fourth one, which is, can you mitigate it? And when all else fails, how quickly can you recover and manage it? I mean, all the standards are expressed in way more technical language than that, but fundamentally, if everybody adopted those five things and operated them in a simple way, you wouldn’t eliminate the harm, but you would reduce it quite substantially.

SULLIVAN: Which policy initiatives are most promising for incentivizing companies to undertake, you know, these cybersecurity testing parameters that you’ve just outlined? Governments, including the UK, have used carrots and sticks, but what do you think will actually move the needle?

MARTIN: I think there are two answers to that, and it comes back to your split between smaller businesses and critically important businesses. In the critically important services, I think it’s easier because most industries are looking for a level playing field. In other words, they realize there have to be rules and they want to apply them to everyone.

We had a fascinating experience when I was in government back in around 2018 where the telecom sector, they came to us and they said, we’ve got a very good cooperative relationship with the British government, but it needs to be put on a proper legal footing because you’re just asking us nicely to do expensive things. And in a regulated sector, if you actually put in some rules—and please develop them jointly with us; that’s the crucial part—then that will help because it means that we’re not going to our boards and saying, or our shareholders, and saying that we should do this, and they’re saying, “Well, do you have to do it? Are our competitors doing it?” And if the answer to that is, yes, we have to, and, yes, our competitors are doing it, then it tends to be OK.

The harder nut to crack is the smaller business. And I think there’s a real mystery here: why has nobody cracked a really good and easy solution for small business? We need to be careful about this because, you know, you can’t throttle small businesses with onerous regulation. At the same time, we’re not brilliant, I think, in any part of the world at using the normal corporate governance rules to try and get people to figure out how to do cybersecurity.

There are initiatives there that are not the sort of pretty heavy stick that you might have to take to a critical function, but they could help. But that is a hard nut to crack. And I look around the world, and, you know, I think if this was easy, somebody would have figured it out by now. I think most of the developed economies around the world really struggle with cybersecurity for smaller businesses.

SULLIVAN: Yeah, it’s a great point. Actually building on one of the comments you made on the role of, kind of, government, how do you see the role of private-public partnerships scaling and strengthening, you know, robust cybersecurity testing?

MARTIN: I think they’re crucial, but they have to be practical. I’ve got a slight, sort of, high horse on this, if you don’t mind, Kathleen. It’s sort of … [LAUGHS]

SULLIVAN: Of course.

MARTIN: I think that there are two types of public-private partnership. One involves committees saying that we should strengthen partnerships and we should all work together and collaborate and share stuff. And we tried that for a very long time, and it didn’t get us very far. There are other types.

We had some at the National Cyber Security Centre where we paid companies to do spectacularly good technical work that the market wouldn’t provide. So I think it’s sort of partnership with a purpose. I think sometimes, and I understand the human instinct to do this, particularly in governments and big business, they think you need to get around a table and work out some grand strategy to fix everything, and the scale of the … not just the problem but the scale of the whole technology is just too big to do that.

So pick a bit of the problem. Find some ways of doing it. Don’t over-lawyer it. [LAUGHTER] I think sometimes people get very nervous. Oh, well, is this our role? You know, should we be doing this, that, and the other? Well, you know, sometimes certainly in this country, you think, well, who’s actually going to sue you over this, you know? So I wouldn’t over-programmatize it. Just get stuck practically into solving some problems.

SULLIVAN: I love that. Actually, [it] made me think, are there any surprising allies that you’ve gained—you know, maybe someone who you never expected to be a cybersecurity champion—through your work?

MARTIN: Ooh! That’s a … that’s a… what a question! To give you a slightly disappointing answer, but it relates to your previous question. In the early part of my career, I was working in institutions like the UK Treasury long before I was in cybersecurity, and the treasury and the British civil service in general, but the treasury in particular sort of trained you to believe that the private sector was amoral, not immoral, amoral. It just didn’t have values. It just had bottom line, and, you know, its job essentially was to provide employment and revenue then for the government to spend on good things that people cared about. And when I got into cybersecurity and people said, look, you need to develop relations with this cybersecurity company, often in the US, actually. I thought, well, what’s in it for them?

And, sure, sometimes you were paying them for specific services, but other times, there was a real public spiritedness about this. There was a realization that if you tried to delineate public-private boundaries, that it wouldn’t really work. It was a shared risk. And you could analyze where the boundaries fell or you could actually go on and do something about it together. So I was genuinely surprised at the allyship from the cybersecurity sector. Absolutely, I really, really was. And I think it’s a really positive part of certainly the UK cybersecurity ecosystem.

SULLIVAN: Wonderful. Well, we’re coming to the end of our time here, but is there any maybe last thoughts or perhaps requests you have for our listeners today?

MARTIN: I think that standards, assurance, and testing really matter, but it’s a bit like the discussion we’re having over AI. Get all these things to take you 80, 90% of the way and then really apply your judgment. There’s been some bad regulation under the auspices of standards and assurance. First of all, it’s, have you done this assessment? Have you done that? Have you looked at this? Well, fine. And you can tick that box, but what does it actually mean when you do it? What bits that you know in your heart of hearts are really important to the defense of your organization that may not be covered by this and just go and do those anyway. Because sure it helps, but it’s not everything.

SULLIVAN: No. Great, great closing sentiment. Well, Ciaran, thank you for joining us today. This has been just a super fun conversation and really insightful. Just really enjoyed the conversation. Thank you.

MARTIN: My pleasure, Kathleen, thank you.

[TRANSITION MUSIC]

SULLIVAN: Now, I’m happy to introduce Tori Westerhoff. As a principal director on the Microsoft AI Red Team, Tori leads all AI security and safety red team operations, as well as dangerous capability testing, to directly inform C-suite decision-makers.

So, Tori, welcome!

TORI WESTERHOFF: Thanks. I am so excited to be here.

SULLIVAN: I’d love to just start a little bit more learning about your background. You’ve worn some very intriguing hats. I mean, cognitive neuroscience grad from Yale, national security consultant, strategist in augmented and virtual reality … how do those experiences help shape the way you lead the Microsoft AI Red Team?

WESTERHOFF: I always joke this is the only role I think will always combine the entire patchwork LinkedIn résumé. [LAUGHS]

I think I use those experiences to help me understand the really broad approach that AI Red Team—artist also known as AIRT; I’m sure I’ll slip into our acronym—how we frame up the broad security implications of AI. So I think the cognitive neuroscience element really helped me initially approach AI hacking, right. There’s a lot of social engineering and manipulation within chat interfaces that are enabled by AI. And also, kind of, this, like, metaphor for understanding how to find soft spots in the way that you see human heuristics show up, too. And so I think that was actually my personal “in” to getting hooked into AI red teaming generally.

But my experience in national security and I’d also say working through the AR/VR/metaverse space at the time where I was in it helped me balance both how our impact is framed, how we’re thinking about critical industries, how we’re really trying to push our understanding of where security of AI can help people the most. And also do it in a really breakneck speed in an industry that’s evolving all of the time, that’s really pushing you to always be at the bleeding edge of your understanding. So I draw a lot of the energy and the mission criticality and the speed from those experiences as we’re shaping up how we approach it.

SULLIVAN: Can you just give us a quick rundown? What does the Red Team do? What actually, kind of, is involved on a day-to-day basis? And then as we think about, you know, our engagements with large enterprises and companies, how do we work alongside some of those companies in terms of testing?

WESTERHOFF: The way I see our team is almost like an indicator light that works really part and parcel with product development. So the way we’ve organized our expert red teaming efforts is that we work with product development before anything ships out to anyone who can use it. And our job is to act as expert AI manipulators, AI hackers. And we are supposed to take the theories and methods and new research and harness it to find examples of vulnerabilities or soft spots in products to enable product teams to harden those soft spots before anything actually reaches someone who wants to use it.

So if we’re the indicator light, we are also not the full workup, right. I see that as measurement and evals. And we also are not the mechanic, which is that product development team that’s creating mitigations. It’s platform-security folks who are creating mitigations at scale. And there’s a really great throughput of insights from those groups back into our area where we love to inform about them, but we also love to add on to, how do we break the next thing, right? So it’s a continuous cycle.

And part of that is just being really creative and thinking outside of a traditional cybersecurity box. And part of that is also really thinking about how we pull in research—we have a research function within our AI Red Team—and how we automate and scale. This year, we’ve pulled a lot of those assets and insights into the Azure [AI] Foundry AI Red Teaming Agent (opens in new tab). And so folks can now access a lot of our mechanisms through that. So you can get a little taste of what we do day to day in the AI Red Teaming Agent.

SULLIVAN: You recently—actually, with your team—published a report that outlined lessons from testing over a hundred generative AI products. But could you share a bit about what you learned? What were some of the important lessons? Where do you see opportunities to improve the state of red teaming as a method for probing AI safety?

WESTERHOFF: I think the most important takeaway from those lessons is that AI security is truly a team sport. You’ll hear cybersecurity folks say that a lot. And part of the rationale there is that the defense in depth and integrating and a view towards AI security through the entire development of AI systems is really the way that we’re going to approach this with intentionality and responsibility.

So in our space, we really focus on novel harm categories. We are pushing bleeding edge, and we also are pushing iterative and, like, contextually based red teaming in product dev. So outside of those hundred that we’ve done, there’s a community [LAUGHS] through the entire, again, multistage life cycle of a product that is really trying to push the cost of attacking those AI systems higher and higher with all of the expertise they bring. So we may be, like, the experts in AI hacking in that line, but there are also so many partners in the Microsoft ecosystem who are thinking about their market context or they really, really know the people who love their products. How are they using it?

And then when you bubble out, you also have industry and government who are working together to push towards the most secure AI implementation for people, right? And I think our team in particular, we feel really grateful to be part of the big AI safety and security ecosystem at Microsoft and also to be able to contribute to the industry writ large.

SULLIVAN: As you know, we had a chance to speak with Professor Ciaran Martin from the University of Oxford about the cybersecurity industry and governance there. What are some of the ideas and tools from that space that are surfacing in how we think about approaching red teaming and AI governance broadly?

WESTERHOFF: Yeah, I think it’s such a broad set of perspectives to bring in, in the AI instance. Something that I’ve noticed interjecting into security at the AI junction, right, is that cybersecurity has so many decades of experience of working through how to build trustworthy computing, for example, or bring an entire industry to bear in that way. And I think that AI security and safety can learn a lot of lessons of how to bring clarity and transparency across the industry to push universal understanding of where the threats really are.

So frameworks coming out of NIST, coming out of MITRE that help us have a universal language that inform governance, I think, are really important because it brings clarity irrespective of where you are looking into AI security, irrespective of your company size, what you’re working on. It means you all understand, “Hey, we are really worried about this fundamental impact.” And I think cybersecurity has done a really good job of driving towards impact as their organizational vector. And I am starting to see that in the AI space, too, where we’re trying to really clarify terms and threats. And you see it in updates of those frameworks, as well, that I really love.

So I think that the innovation is in transparency to folks who are really innovating and doing the work so we all have a shared language, and from that, it really creates communal goals across security instead of a lot of people being worried about the same thing and talking about it in a different way.

SULLIVAN: Mm-hmm. In the cybersecurity context, Ciaran really stressed matching risk frameworks to an organization’s role and scale. Microsoft plays many roles, including building models and shipping applications. How does your red teaming approach shift across those layers? 

WESTERHOFF: I love this question also because I love it as part of our work. So one of the most fascinating things about working on this team has been the diversity of the technology that we end up red teaming and testing. And it feels like we’re in the crucible in that way. Because we see AI applied to so many different architectures, tech stacks, individual features, models, you name it.

Part of my answer is that we still care about the highest-impact things. And so irrespective of the iteration, which is really fascinating and I love, I still think that our team drives to say, “OK, what is that critical vulnerability that is going to affect people in the largest ways, and can we battle test to see if that can occur?”

So in some ways, the task is always the same. I think in the ways that we change our testing, we customize a lot to the access to systems and data and also people’s trust almost as different variables that could affect the impact, right.

So a good example is if we’re thinking through agentic frameworks that have access to functions and tools and preferential ability to act on data, it’s really different to spaces where that action may not be feasible, right. And so I think the tailoring of the way to get to that impact is hyper-custom every time we start an engagement. And part of it is very thesis driven and almost mechanizing empathy.

You almost need to really focus on how people could use, or misuse, in such a way that you can emulate it before to a really great signal to product development, to say this is truly what people could do and we want to deliver the highest-impact scenarios so you can solve for those and also solve the underlying patterns, actually, that could contribute to maybe that one piece of evidence but also all the related pieces of evidence. So singular drive but like hyper-, hyper-customization to what that piece of tech could do and has access to.

SULLIVAN: What are some of the unexplored testing approaches or considerations from cybersecurity that you think we should encourage AI technologists, policymakers, and other stakeholders to focus on?

WESTERHOFF: I do love that AI humbles us each and every day with new capabilities and the potential for new capabilities. It’s not just saying, “Hey, there’s one test that we want to try,” but more, “Hey, can we create a methodology that we feel really, really solid about so that when we are asked a question we haven’t even thought of, we feel confident that we have the resources and the system?”

So part of me is really intrigued by the process that we’re asked to make without knowing what those capabilities are really going to bring. And then I think tactically, AIRT is really pushing on how we create new research methodologies. How are we investing in, kind of, these longer-term iterations of red teaming? So we’re really excited about pushing out those insights in an experimental and longer-term way.

I think another element is a little bit of that evolution of how industry standards and frameworks are updating to the AI moment and really articulating where AI is either furthering adversarial ability to create those harms or threats or identifying where AI has a net new harm. And I think that demystifies a little bit about what we talked about in terms of the lessons learned, that fundamentally, a lot of the things that we talk about are traditional security vulnerabilities, and we are standing on kind of that cybersecurity shoulder. And I’m starting to see those updates translate in spaces that are already considered trustworthy and kind of the basis on which not only cybersecurity folks build their work but also business decision-makers make decisions on those frameworks.

So to me, integration of AI into those frameworks by those same standards means that we’re evolving security to include AI. We aren’t creating an entirely new industry of AI security and that, I think, really helps anchor people in the really solid foundation that we have in cybersecurity anyways.

I think there’s also some work around how the cyber, like, defenses will actually benefit from AI. So we think a lot about threats because that’s our job. But the other side of cybersecurity is offense. And I’m seeing a ton of people come out with frameworks and methodologies, especially in the research space, on how defensive networks are going to be benefited from things like agentic systems.

Generally speaking, I think the best practice is to realize that we’re fundamentally still talking about the same impacts, and we can use the same avenues, conversations, and frameworks. We just really want them to be crisply updated with that understanding of AI applications.

SULLIVAN: How do you think about bringing others into the fold there? I think those standards and frameworks are often informed by technologists. But I’d love for you to expand [that to] policymakers or other kind of stakeholders in our ecosystem, even, you know, end consumers of these products. Like, how do we communicate some of this to them in a way that resonates and it has an impactful meaning?

WESTERHOFF: I’ve found the AI security-safety space to be one of the more collaborative. I actually think the fact that I’m talking to you today is probably evidence that a ton of people are bringing in perspectives that don’t only come from a long-term cybersecurity view. And I see that as a trend in how AI is being approached opposed to how those areas were moving earlier. So I think that speed and the idea of conversations and not always having the perfect answer but really trying to be transparent with what everyone does know is kind of a communal energy in the communities, at least, where we’re playing. [LAUGHS] So I am pretty biased but at least the spaces where we are.

SULLIVAN: No, I think we’re seeing that across the board. I mean, I’d echo [that] sitting in research, as well, like, that ability to have impact now and at speed to getting the amazing technology and models that we’re creating into the hands of our customers and partners and ecosystem is just underscored.

So on the note of speed, let’s shift gears a little bit to just a quick lightning round. I’d love to get maybe some quick thoughts from you, just 30-second answers here. I’ll start with one.

Which headline-grabbing AI threat do you think is mostly hot air?

WESTERHOFF: I think we should pay attention to it all. I’m a red team lead. I love a good question to see if we can find an answer in real life. So no hot air, just questions.

SULLIVAN: Is there some sort of maybe new tool that you can’t wait to sneak into the red team arsenal?

WESTERHOFF: I think there are really interesting methodologies that break our understanding of cybersecurity by looking at the intersection between different layers of AI and how you can manipulate AI-to-AI interaction, especially now when we’re looking at agentic systems. So I would say a method, not a tool.

SULLIVAN: So maybe ending on a little bit of a lighter note, do you have a go-to snack during an all-night red teaming session?

WESTERHOFF: Always coffee. I would love it to be a protein smoothie, but honestly, it is probably Trader Joe’s elote chips. Like the whole bag. [LAUGHTER] It’s going to get me through. I’m going to not love that I did it.

[MUSIC]

SULLIVAN: Amazing. Well, Tori, thanks so much for joining us today, and just a huge thanks also to Ciaran for his insights, as well.

WESTERHOFF: Thank you so much for having me. This was a joy.

SULLIVAN: And to our listeners, thanks for tuning in. You can find resources related to this podcast in the show notes. And if you want to learn more about how Microsoft approaches AI governance, you can visit microsoft.com/RAI.

See you next time! 

[MUSIC FADES]

The post AI Testing and Evaluation: Learnings from cybersecurity appeared first on Microsoft Research.

Read More

How AI will accelerate biomedical research and discovery

How AI will accelerate biomedical research and discovery

Illustrated images of Peter Lee, Daphne Koller, Noubar Afeyan, and Dr. Eric Topol for the Microsoft Research Podcast

In November 2022, OpenAI’s ChatGPT kick-started a new era in AI. This was followed less than a half year later by the release of GPT-4. In the months leading up to GPT-4’s public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee.

In this episode, Daphne Koller (opens in new tab), Noubar Afeyan (opens in new tab), and Dr. Eric Topol (opens in new tab), leaders in AI-driven medicine, join Lee to explore the rapidly evolving role of AI across the biomedical and healthcare landscape. Koller, founder and CEO of Insitro, shares how machine learning is transforming drug discovery, especially target identification for complex diseases like ALS, by uncovering biological patterns across massive datasets. Afeyan, founder and CEO of Flagship Pioneering and co-founder and chairman of Moderna, discusses how AI is being applied across biotech research and development, from protein design to autonomous science platforms. Topol, executive vice president of Scripps Research and founder and director of the Scripps Research Translational Institute, highlights how AI can today help mitigate and prevent the core diseases that erode our health and the possibility of realizing a virtual cell. Through his conversations with the three, Lee investigates how AI is reshaping the discovery, deployment, and delivery of medicine. 

Transcript 

[MUSIC] [BOOK PASSAGE] 

PETER LEE: “Can GPT-4 indeed accelerate the progression of medicine … ? It seems like a tall order, but if I had been told six months ago that it could rapidly summarize any published paper, that alone would have satisfied me as a strong contribution to research productivity. … But now that I’ve seen what GPT-4 can do with the healthcare process, I expect a lot more in the realm of research.” 

[END OF BOOK PASSAGE] [THEME MUSIC]

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.

[THEME MUSIC FADES]

The book passage I read at the top was from “Chapter 8: Smarter Science,” which was written by Zak.

In writing the book, we were optimistic about AI’s potential to accelerate biomedical research and help get new and much-needed treatments and drugs to patients sooner. One area we explored was generative AI as a designer of clinical trials. We looked at generative AI’s adeptness at summarizing helping speed up pre-trial triage and research. We even went so far as to predict the arrival of a large language model that can serve as a central intellectual tool. 

For a look at how AI is impacting biomedical research today, I’m excited to welcome Daphne Koller, Noubar Afeyan, and Eric Topol. 


Daphne Koller is the CEO and founder of Insitro, a machine learning-driven drug discovery and development company that recently made news for its identification of a novel drug target for ALS and its collaboration with Eli Lilly to license Lilly’s biochemical delivery systems. Prior to founding Insitro, Daphne was the co-founder, co-CEO, and president of the online education platform Coursera.

Noubar Afeyan is the founder and CEO of Flagship Pioneering, which creates biotechnology companies focused on transforming human health and environmental sustainability. He is also co-founder and chairman of the messenger RNA company Moderna. An entrepreneur and biochemical engineer, Noubar has numerous patents to his name and has co-founded many startups in science and technology.

Dr. Eric Topol is the executive vice president of the biomedical research non-profit Scripps Research, where he founded and now directs the Scripps Research Translational Institute. One of the most cited researchers in medicine, Eric has focused on promoting human health and individualized medicine through the use of genomic and digital data and AI. 

These three are likely to have an outsized influence on how drugs and new medical technologies soon will be developed.

[TRANSITION MUSIC] 

Here’s my interview with Daphne Koller:

LEE: Daphne, I’m just thrilled to have you join us. 

DAPHNE KOLLER: Thank you for having me, Peter. It’s a pleasure to be here. 

LEE: Well, you know, you’re quite well-known across several fields. But maybe for some audience members of this podcast, they might not have encountered you before. So where I’d like to start is a question I’ve been asking all of our guests.

How would you describe what you do? And the way I kind of put it is, you know, how do you explain to someone like your parents what you do for a living? 

KOLLER: So that answer obviously has shifted over the years.

What I would say now is that we are working to leverage the incredible convergence of very powerful technologies, of which AI is one but not the only one, to change the way in which we discover and develop new treatments for diseases for which patients are currently suffering and even dying. 

LEE: You know, I think I’ve known you for a long time. 

KOLLER: Longer than I think either of us care to admit. 

LEE: [LAUGHS] In fact, I think I remember you even when you were still a graduate student. But of course, I knew you best when you took up your professorship at Stanford. And I always, in my mind, think of you as a computer scientist and a machine learning person. And in fact, you really made a big name for yourself in computer science research in machine learning.

But now you’re, you know, leading one of the most important biotech companies on the planet. How did that happen?

KOLLER: So people often think that this is a recent transition. That is, after I left Coursera, I looked around and said, “Hmm. What should I do next? Oh, biotech seems like a good thing,” but that’s actually not the way it transpired.

This goes all the way back to my early days at Stanford, where, in fact, I was, you know, as a young faculty member in machine learning, because I was the first machine learning hire into Stanford’s computer science department, I was looking for really exciting places in which this technology could be deployed, and applications back then, because of scarcity of data, were just not that inspiring.

And so I looked around, and this was around the late ’90s, and realized that there was interesting data emerging in biology and medicine. My first application actually was in, interestingly, in epidemiology—patient tracking and tuberculosis. You know, you can think of it as a tiny microcosm of the very sophisticated models that COVID then enabled in a much later stage.

LEE: Right. 

KOLLER: And so initially, this was based almost entirely on just technical interest. It’s kind of like, oh, this is more interesting as a question to tackle than spam filtering. But then I became interested in biology in its own right, biology and medicine, and ended up having a bifurcated existence as a Stanford professor where half my lab continued to do core computer science research published in, you know, NeurIPS and ICML. And the other half actually did biomedical research that was published in, you know, Nature Cell [and] Science. So that was back in, you know, the early, early 2000s, and for most of my Stanford career, I continued to have both interests.

And then the Coursera experience kind of took me out of Stanford and put me in an industry setting for the first time in my life actually. But then when my time at Coursera came to an end, you know, I’d been there for five years. And if you look at the timeline, I left Stanford in early 2012, right as the machine learning revolution was starting. So I missed the beginning.

And it was only in like 2016 or so that, as I picked my head up over the trenches, like, “Oh my goodness, this technology is going to change the world.” And I wanted to deploy that big thing towards places where it would have beneficial impact on the world, like to make the world a better place.

LEE: Yeah. 

KOLLER: And so I decided that one of the areas where I could make a unique, differentiated impact was in really bringing AI and machine learning to the life sciences, having spent, you know, the majority of my career at the boundary of those two disciplines. And notice I say “boundary” with deliberation because there wasn’t very much of an intersection.

LEE: Right. 

KOLLER: I felt like I could do something that was unique. 

LEE: So just to stick on you for a little bit longer, you know, we have been sort of getting into your origin story about what we call AI today—but machine learning, so deep learning. 

And, you know, there has always been a kind of an emotional response for people like you and me and now the general public about their first encounters with what we now call generative AI. I’d love to hear what your first encounter was with generative AI and how you reacted to this. 

KOLLER: I think my first encounter was actually an indirect one. Because, you know, the earlier generations of generative AI didn’t directly touch our work at Insitro (opens in new tab)

And yet at the same time, I had always had an interest in computer vision. That was a large part of my non-bio work when I was at Stanford. 

And so some of my earlier even presentations, when I was trying to convey to people back in 2016 how this technology was going to transform the world, I was talking about the incredible progress in image recognition that had happened up until that point. 

So my first interaction was actually in the generative AI for images, where you are able to go the other way … 

LEE: Yes. 

KOLLER: … where you can take a verbal description of an image and create—and this was back in the days when the images weren’t particularly photorealistic, but still a natural language description to an image was magic given that only two or three years before that, we were barely able to look at an image and write a short phrase saying, “This is a dog on the beach.” And so that arc, that hockey curve, was just mind blowing to me. 

LEE: Did you have moments of skepticism? 

KOLLER: Yeah, I mean the early, you know, early versions of ChatGPT, where it was more like parlor tricks and poking it a little bit revealed all of the easy ways that one could break it and make it do really stupid things. I was like, yeah, OK, this is kind of cute, but is it going to actually make a difference? Is it going to solve a problem that matters? 

And I mean, obviously, I think now everyone agrees that the answer is yes, although there are still people who are like, yeah, but maybe it’s around the edges. I’m not among them, by the way, but … yeah, so initially there were like, “Yeah, this is cute and very impressive, but is it going to make a difference to a problem that matters?” 

LEE: Yeah. So now, maybe this is a good time to get into what you’ve been doing with ALS [amyotrophic lateral sclerosis]. You know, there’s a knee-jerk reaction from the technology side to focus on designing small molecules, on predicting, you know, their properties, you know, maybe binding affinity or aspects of ADME [absorption, distribution, metabolism, and excretion], you know, like absorption or dispersion or whatever. 

And all of that is very useful, but if I understand the work on ALS, you went to a much harder place, which is to actually identify and select targets. 

KOLLER: That’s right. 

LEE: So first off, just for the benefit of the standard listeners of this podcast, explain what that problem is in general. 

KOLLER: No, for sure. And I think maybe I’ll start by just very quickly talking about the drug discovery and development arc, …

LEE: Yeah.

KOLLER: … which, by and large, consists of three main phases. That’s the standard taxonomy. The first is what’s called sometimes target discovery or identifying a therapeutic hypothesis, which looks like: if I modulate this target in this disease, something beneficial will happen. 

Then, you have to take that target and turn it into a molecule that you can actually put into a person. It could be a small molecule. It could be a large molecule like an antibody, whatever. And then you have that construct, that molecule. And the last piece is you put it into a person in the context of a clinical trial, and you measure what has happened. And there’s been AI deployed towards each of those three stages in different ways. 

The last one is mostly like an efficiency gain. You know, the trial is kind of already defined, and you want to deploy technology to make it more efficient and effective, which is great because those are expensive operations. 

LEE: Yep. 

KOLLER: The middle one is where I would say the vast majority of efforts so far has been deployed in AI because it is a nice, well-defined problem. It doesn’t mean it’s easy, but it’s one where you can define the problem. It is, I need to inhibit this protein by this amount, and the molecule needs to be soluble and whatever and go past the blood-brain barrier. And you know probably within a year and a half or so, or two, if you succeeded or not. 

The first stage is the one where I would say the least amount of energy has gone because when you’re uncovering a novel target in the context of an indication, you don’t know that you’ve been successful until you go all the way to the end, which is the clinical trial, which is what makes this a long and risky journey. And not a lot of people have the appetite or the capital to actually do that. 

However, in my opinion, and that of, I think, quite a number of others, it is where the biggest impact can be made. And the reason is that while pharma has its deficiencies, making good molecules is actually something they’re pretty good at. 

It might take them longer than it should, maybe it’s not as efficient as it could be, but at the end of the day, if you tell them to drug A target, pharma is actually pretty good at generating those molecules. However, when you put those molecules into the clinic, 90% of them fail. And the reason they fail is not by and large because the molecule wasn’t good. In the majority of cases, it’s because the target you went after didn’t do anything useful in the context of the patient population in which you put it. 

And so in order to fix the inefficiency of this industry, which is incredible inefficiency, you need to address the problem at the root, and the root is picking the right targets to go after. And so that is what we elected to do. 

It doesn’t mean we don’t make molecules. I mean, of course, you can’t just end up with a target because a target is not actionable. You need to turn it into a molecule. And we absolutely do that. And by the way, the partnership with Lilly (opens in new tab) is actually one where they help us make a molecule. 

LEE: Yes. 

KOLLER: I mean, it’s our target. It’s our program. But Lilly is deploying its very state-of-the-art molecule-making capabilities to help us turn that target into a drug. 

LEE: So let’s get now into the machine learning of this. Again, this just strikes me as such a difficult problem to solve. 

KOLLER: Yeah. 

LEE: So how does machine learning … how does AI help you? 

KOLLER: So I think when you look at how people currently select targets, it’s a combination of oftentimes at this point, with an increasing respect for the power of human genetics, some search for a genetic association, oftentimes with a human-defined, highly subjective, highly noisy clinical outcome, like some ICD [International Classification of Diseases] code. 

And those are often underpowered and very difficult to deconvolute the underlying biology. You combine that with some mechanistic interrogation in a highly reductionist model system looking at a small number of readouts, biochemical readouts, that a biologist thinks are relevant to the disease. Like does this make this, whatever, cholesterol go up or amyloid beta go down? Or whatever. And then you take that as the second stage, and you pick, based on typically human intuition about, Oh, this one looks good to me, and then you take that forward. 

What we’re doing is an attempt to be as unbiased and holistic as possible. So, first of all, rather than rely on human-defined clinical endpoints, like this person has been diagnosed with diabetes or fatty liver, we try and measure as much as we can a holistic physiological state and then use machine learning to find structure, patterns in that human physiological readouts, imaging readouts, and omics readouts from blood, from tissue, different kinds of imaging, and say, these are different vectors that this disease takes, this group of individuals, and here’s a different group of individuals that maybe from a diagnostical perspective are all called the same thing, but they are actually exhibiting a very different biology underlying it. 

And so that is something that doesn’t emerge when a human being takes a reductionist view to looking at this high-content data, and oftentimes, they don’t even look at it and produce an ICD code. 

LEE: Right. Yep. 

KOLLER: The same approach, actually even the same code base, is taken in the cellular data. So we don’t just say, “Well, the thing that matters is, you know, the total amount of lipid in the cell or whatever.” Rather, we say, “Let’s look at multiple readouts, multiple ways of looking at the cells, combine them using the power of machine learning.” And again, looking at imaging readouts where a human’s eyes just glaze over looking at even a few dozen cells, far less a few hundreds of millions of cells, and understand what are the different biological processes that are going on. What are the vectors that the disease might take you in this direction, in this group of cells, or in that direction? 

And then importantly, we take all of that information from the human side, from the cellular side, across these different readouts, and we combine them using an integrative approach that looks at the combined weight of evidence and says, these are the targets that I have the greatest amount of conviction about by looking across all of that information. Whereas we know, and we know this, I’m sure you’ve seen this analysis done for clinicians, a human being typically is able to keep three or four things in their head at the same time. 

LEE: Right. 

KOLLER: A really good human being who’s really expert at what they do can maybe get to six to eight. 

LEE: Yeah. 

KOLLER: The machine learning has no problem doing a few hundred. 

LEE: Right. 

KOLLER: And so you put that together, and that allows you, to your earlier question, really select the targets around which you have the highest conviction. And then those are the ones that we then prioritize for interrogation in more expensive systems like mice and monkeys and then at the end of the day pick the small handful that one can afford to actually take into clinical trials. 

LEE: So now, Insitro recently received $25 million in milestone payments from Bristol Myers Squibb (opens in new tab) after discovering and selecting a novel drug target for ALS. Can you tell us a little bit more about that? 

KOLLER: We are incredibly excited about the first novel target, and there is a couple of others just behind it in line that seem, you know, quite efficacious, as well, that truly seem to reverse, albeit in a cellular system, what we now understand to be ALS pathology across multiple different dimensions. There’s been obviously many attempts made to try and address ALS, which by the way, horrible, horrible disease, worse than most cancers. It kills you almost inevitably in three to five years in a particularly horrific way. 

And what we have in our hands is a target that seems to revert a lot of the pathologies that are associated with the disease, which we now understand has to do with the mis-splicing of multiple proteins within the cell and creating defective versions of those proteins that are just not operational. And we are seeing reversion of many of those. 

So can I tell you for sure it’ll work in a human? No, there’s many steps between now and then. But we couldn’t be more excited about the opportunity to provide what we hope will be a disease-modifying intervention for these patients who really desperately need something. 

LEE: Well, it’s certainly been making waves in the biotech and biomedical world. 

KOLLER: Thank you. 

LEE: So we’ll be really watching very closely. 

So, you know, I think just reflecting on, you know, what we missed and what we got right in our book, I think in our book, we did have the insight that there would be an ability to connect, say, genotypic and phenotypic data and, you know, just broadly the kinds of clinical measurements that get made on real patients and that these things could be brought together. And I think the work that you’re doing really illustrates that in a very, very sophisticated, very ambitious way. 

But the fact that this could be connected all the way down to the biology, to the biochemistry, I think we didn’t have any clue what would happen, at least not this quickly. 

KOLLER: Well, I think the … 

LEE: And I realize, you’ve been at this for quite a few years, but still, it’s quite amazing. 

KOLLER: The thread that connects them is human genetics. And I think that has, to us, been, sort of, the, kind of, the connective tissue that allows you to translate across different systems and say, “What does this gene do? What does this gene do in this organ and in that organ? What does it do in this type of cell and in that type of cell?” 

And then use that as sort of the thread, if you will, that follows the impact of modulating this gene all the way from the simple systems where you can do the experiment to the complex systems where you can’t do the experiment until the very end, but you have the human genetics as a way of looking at the statistics and understanding what the impact might be. 

LEE: So I’d like to now switch gears and take … I want to take two steps in the remainder of this conversation towards the future. So one step into that future, of course, we’re living through now, which is just all of the crazy pace of work and advancement in generative AI generally, you know, just the scale of transformers, of post-training, and now inference scale and reasoning models and so on. And where do you see all of that going with respect to the goals that you have and that Insitro has? 

KOLLER: So I think first and foremost is the parallel, if you will, to the predictions that you focused on in your book, which is this will transform a lot of the core data processing tasks, the information tasks. And sure, the doctors and nurses is one thing. But if you just think of clinical trial operations or the submission of regulatory documents, these are all kind of simple data … they’re not simple, obviously, but they’re data processing tasks. They involve natural language. That’s not going to be our focus, but I hope that others will use that to make clinical trials faster, more efficient, less expensive. 

There’s already a lot of progress that’s happening on the molecular design side of things and taking hypotheses and turning them quickly and effectively into molecules. As I said, this is part of our work that we absolutely do and we don’t talk about it very much, simply because it’s a very crowded landscape and a lot of companies are engaged on that. But I think it’s really important to be able to take biological insights and turn them into new molecules. 

And then, of course, the transformer models and their likes play a very significant role in that sort of turning insights into molecules because you can have foundation models for proteins. There are increasing efforts to create foundation models for other categories of molecules. And so that will undoubtedly accelerate the process by which you can quickly generate different molecular hypotheses and test them and learn from what you did so that you can do fewer iterations … 

LEE: Right. 

KOLLER: … before you converge on a successful molecule. 

I do think that arguably the biggest impact as yet to be had is in that understanding of core human biology and what are the right ways to intervene in it. And that plays a role in a couple different ways. First of all, it certainly plays a role in which … if we are able to understand the human physiological state and, you know, the state of different systems all the way down to the cell level, that will inform our ability to pick hypotheses that are more likely to actually impact the right biologies underneath. 

LEE: Yep. Yeah. 

KOLLER: And the more data we’re able to collect about humans and about cells, the more successful our models will be at representing that human physiological state or the cell biological state and making predictions reliably on the impact of these interventions. 

The other side of it, though, and this comes back, I think, to themes that were very much in your book, is this will impact not only the early stages of which hypotheses we interrogate, which molecules we move forward, but also hopefully at the end of the day, which molecule we prescribe to which patient. 

LEE: Right. 

KOLLER: And I think there’s been obviously so much narrative over the years about precision medicine, personalized medicine, and very little of that has come to fruition, with the exception of, you know, certain islands in oncology, primarily on genetically driven cancers. 

But I think the opportunity is still there. We just haven’t been able to bring it to life because of the lack of the right kind of data. And I think with the increasing amount of human, kind of, foundational data that we’re able to acquire, things that are not sort of distilled through the eye of a clinician, for example, … 

LEE: Yes. 

KOLLER: … but really measurements of human pathology, we can start to get to some of that precision, carving out of the human population and then get to a world where we can prescribe the right medicine to the right patient and not only in cancer but also in other diseases that are also not a single disease. 

LEE: All right, so now to wrap up this time together, I always try to ask one more provocative last question. One of the dreams that comes naturally to someone like me or any of my colleagues, probably even to you, is this idea of, you know, wouldn’t it be possible someday to have a foundation model for biology or for human biology or foundation model for the human cell or something along these lines? 

And in fact, there are, of course, you and I are both aware of people who are taking that idea seriously and chasing after it. I have people in our labs that think hard about this kind of thing. Is it a reasonable thought at all? 

KOLLER: I have learned over the years to avoid saying the word never because technology proceeds in ways that you often don’t expect. And so will we at some point be able to measure the cell in enough different ways across enough different channels at the same time that you can piece together what a cell does? I think that is eminently feasible, not today, but over time. 

I don’t think it’s feasible using today’s technology, although the efforts to get there may expose where the biggest opportunities lie to, you know, build that next layer. So I think it’s good that people are working on really hard problems. I would also point out that even if one were to solve that really challenging problem of creating a model of a cell, there is thousands of different types of cells within the human body. 

They’re very different. They also talk to each other … 

LEE: Yep. 

KOLLER: … both within the cell type and across different cell types. So the combinatorial complexity of that system is, I think, unfathomable to many people. I mean, I would say to all of us. 

LEE: Yeah. 

KOLLER: And so even from that very lofty goal, there is multiple big steps that would need to be taken to a mechanistic model of the full organism. So will we ever get there? Again, you know, I don’t see a reason why this is impossible to do. So I think over time, technology will get better and will allow us to build more and more elaborate models of more and more complex systems. 

Patients can’t wait …

LEE: Right. Yeah. 

KOLLER: … for that to happen in order for us to get them better medicines. So I think there is a great basic science initiative on that side of things. And, in parallel, we need to make do with the data that we have or can collect or can print. We print a lot of data in our internal wet labs and get to drugs that are effective even though they don’t benefit from having a full-blown mechanistic model. 

LEE: Last question: where do you think we’ll be in five years? 

KOLLER: Phew. If I had answered that question five years ago, I would have been very badly embarrassed at the inaccuracy of my answer. [LAUGHTER] So I will not answer it today either. 

I will say that the thing about exponential curves is that they are very, very tricky, and they move in unexpected ways. I would hope that in five years, we will have made a sufficient investment in the generation of scientific data that we will be able to move beyond data that was generated entirely by humans and therefore insights that are derivative of what people already know to things that are truly novel discoveries. 

And I think in order to do that in, you know, math, maybe because math is entirely conceptual, maybe you can do that today. Math is effectively a construct of the human mind. I don’t think biology is a construct of the human mind, and therefore one needs to collect enough data to really build those models that will give rise to those novel insights. 

And that’s where I hope we will have made considerable progress in five years. 

LEE: Well, I’m with you. I hope so, too. Well, you know, thank you, Daphne, so much for this conversation. I learn a lot talking to you, and it was great to, you know, connect again on this. And congratulations on all of this success. It’s really groundbreaking. 

KOLLER: Thank you very much, Peter. It was a pleasure chatting with you, as well. 

[TRANSITION MUSIC] 

LEE: I still think of Daphne first and foremost as an AI researcher. And for sure, her research work in machine learning continues to be incredibly influential to this day. But it’s her work on AI-enhanced drug development that now is on the verge of making a really big difference on some of the most difficult diseases afflicting people today. 

In our book, Carey, Zak, and I predicted that AI might be a meaningful accelerant in biomedical research, but I don’t know that we foresaw the incredible potential specifically in drug development. 

Today, we’re seeing a flurry of activity at companies, universities, and startups on generative AI systems that aid and maybe even completely automate the design of new molecules as drug candidates. But now, in our conversation with Daphne, seeing AI go even further than that to do what one might reasonably have assumed to be impossible, to identify and select novel drug targets, especially for a neurodegenerative disease like ALS, it’s just, well, mind blowing. 

Let’s continue our deep dive on AI and biomedical research with this conversation with Noubar Afeyan: 

LEE: Noubar, thanks so much for joining. I’m really looking forward to this conversation. 

NOUBAR AFEYAN: Peter, thanks. Thrilled to be here. 

LEE: While I think most of the listeners to this podcast have heard of Flagship Pioneering (opens in new tab), it’s still worth hearing from you, you know, what is Flagship? And maybe a little bit about your background. And finally, you found a way to balance science and business creation. And so, you know, your approach and philosophy to all of that. 

AFEYAN: Well, great. So maybe I’ll just start out by way of quick background. You know, my … and since we’re going talk about AI, I’ll also highlight my first contact with the topic of AI. So as an undergraduate in 1980 up at McGill University, I was an engineering student, but I was really captivated by, at that time, the talk on the campus around the expert system, heuristic-based, rule-based kind of programs. 

LEE: Right. 

AFEYAN: And so actually I had the dubious distinction of writing my one and only college newspaper article. [LAUGHTER] That was a short career. And it was all about how artificial intelligence would be impacting medicine, would be impacting, you know, speech capture, translation, and some of the ideas that were there that it’s interesting to see now 45 years later re-emerge with some of the new learning-based models. 

My journey after college ended up taking me into biotechnology. In the early ’80s, I came to MIT to do a PhD. At the time, the field was brand new. I ended up being the first PhD graduate from MIT in this combination biology and engineering degree. And since then, I’ve basically been—so since 1987—a founder, a technologist in the space of biotechnology for human health and as well for planetary health. 

And then in 1999/2000 formed what is now Flagship Pioneering, which essentially was an attempt to bring together the three elements of what we know are important in startups. That is scientific capital, human capital, and financial capital. Right now, startups get that from different places. The science in our fields mostly come from academia, research hospitals. The human capital comes from other startups … 

LEE: Yeah. 

AFEYAN: … or large companies or some academics leave. And then the financial capital is usually venture capital, but there’s also now more and more other deeper pockets of money. 

What we thought was, what if all that existed in one entity and instead of having to convince each other how much they should believe the other if we just said, “Let’s use that power to go work on much further out things”? But in a way where nobody would believe it in the beginning, but we could give ourselves a little bit of time to do impactful big things. 

Twenty-five years later, that’s the road we’ve stayed on. 

LEE: OK. So let’s get into AI. Now, you know, what I’ve been asking guests is kind of an origin story. And there’s the origin story of contact with AI, you know, before the emergence of generative AI and afterwards. I don’t think there’s much of a point to asking you the pre-ChatGPT. But … so let’s focus on your first encounter with ChatGPT or generative AI. When did that happen, and what went through your head? 

AFEYAN: Yeah. So, if you permit me, Peter, just for very briefly, let me actually say I had the interesting opportunity over the last 25 years to actually stay pretty close to the machine learning world … 

LEE: Yeah. Yeah. 

AFEYAN: … because one, as you well know, among the most prolific users of machine learning has been the bioinformatics computational biology world because it’s been so data rich that anything that can be done, people have thrown at these problems because unlike most other things, we’re not working on man-made data. We’re looking at data that comes from nature, the complexity of which far exceeds our ability to comprehend. 

So you could imagine that any approach to statistically reduce complexity, get signal out of scant data—that’s a problem that’s been around. 

The other place where I’ve been exposed to this, which I’m going to come back to because that’s where it first felt totally different to me, is that some 25 years ago, actually the very first company we started was a company that attempted to use evolutionary algorithms to essentially iteratively evolve consumer-packaged goods online. Literally, we tried to, you know, consider features of products as genes and create little genomes of them. And by recombination and mutation, we could create variety. And then we could get people through panels online—this was 2002/2003 timeframe—we could essentially get people through iterative cycles of voting to create a survival of the fittest. And that’s a company that was called Affinnova. 

The reason I say that is that I knew that there’s a much better way to do this if only: one, you can generate variety … 

LEE: Yeah. 

AFEYAN: … without having to prespecify genes. We couldn’t do that before. And, two, which we’ve come back to nowadays, you can actually mimic how humans think about voting on things and just get rid of that element of it. 

So then to your question of when does this kind of begin to feel different? So you could imagine that in biotechnology, you know, as an engineer by background, I always wanted to do CAD, and I picked the one field in which CAD doesn’t exist, which is biology. Computer-aided design is kind of a notional thing in that space. But boy, have we tried. For a long time, …

LEE: Yep. 

AFEYAN: … people would try to do, you know, hidden Markov models of genomes to try to figure out what should be the next, you know, base that you may want to or where genes might be, etc. But the notion of generating in biology has been something we’ve tried for a while. And in the late teens, so kind of 2018, ’17, ’18, because we saw deep learning come along, and you could basically generate novelty with some of the deep learning models … and so we started asking, “Could you generate a protein basically by training a correspondence table, if you will, between protein structures and their underlying DNA sequence?” Not their protein sequence, but their DNA sequence. 

LEE: Yeah. 

AFEYAN: So that’s a big leap. So ’17/’18, we started this thing. It was called 56. It was FL56, Flagship Labs 56, our 56th project. 

By the way, we started this parallel one called “57” that did it in a very different way. So one of them did pure black box model-building. The other one said, you know what, we don’t want to do the kind of … at that time, AlphaFold was in its very early embodiments. And we said, “Is there a way we could actually take little, you know, multi amino acid kind of almost grammars, if you will, a little piece, and then see if we could compose a protein that way?” So we were experimenting. 

And what we found was that actually, if you show enough instances and you could train a transformer model—back in the day, that’s what we were using—you could actually, say, predict another sequence that should have the same activity as the first one. 

LEE: Yeah. 

AFEYAN: So we trained on green fluorescent proteins. Now, we’re talking about seven years ago. We trained on enzymes, and then we got to antibodies. 

With antibodies, we started seeing that, boy, this could be a pretty big deal because it has big market impact. And we started bringing in some of the diffusion models that were beginning to come along at that time. And so we started getting much more excited. This was all done in a company that subsequently got renamed from FL56 to Generate:Biomedicines (opens in new tab), … 

LEE: Yep, yep. 

AFEYAN: … which is one of the leaders in protein design using the generative techniques. It was interesting because Generate:Biomedicines is a company that was called that before generative AI was a thing, [LAUGHTER] which was kind of very ironic. 

And, of course, that team, which operates today very, very kind of at the cutting edge, has published their models. They came up with this first Chroma (opens in new tab) model, which is a diffusion-based model, and then started incorporating a lot of the LLM capabilities and fusing them. 

Now we’re doing atomistic models and many other things. The point being, that gave us a glimpse of how quickly the capability was gaining, … 

LEE: Yeah. Yeah. 

AFEYAN: … just like evolution shows you. Sometimes evolution is super silent, and then all of a sudden, all hell breaks loose. And that’s what we saw. 

LEE: Right. One of the things that I reflect on just in my own journey through this is there are other emotions that come up. One that was prominent for me early on was skepticism. Were there points when even in your own work, transformer-based work on this early on, that you had doubts or skepticism that these transformer architectures would be or diffusion-based approaches would be worth anything? 

AFEYAN: You know, it’s interesting, I think that, I’m going to say this to you in a kind of a friendly way, but you’ll understand what I mean. In the world I live in, it’s kind of like the slums of innovation, [LAUGHTER] kind of like just doing things that are not supposed to work. The notion of skepticism is a luxury, right. I assume everything we do won’t work. And then once in a while I’m wrong. 

And so I don’t actually try to evaluate whether before I bring something in, like just think about it. We, some hundred or so times a year, ask “what if” questions that lead us to totally weird places of thought. We then try to iterate, iterate, iterate to come up with something that’s testable. Then we go into a lab, and we test it. 

So in that world, right, sitting there going, like, “How do I know this transformer is going to work?” The answer is, “For what?” Like, it’s going to work. To make something up … well, guess what? We knew early on with LLMs that hallucination was a feature, not a bug for what we wanted to do. 

So it’s just such a different use that, of course, I have trained scientific skepticism, but it’s a little bit like looking at a competitive situation in an ecology and saying, “I bet that thing’s going to die.” Well, you’d be right—most of the time, you’d be right. [LAUGHTER] 

So I just don’t … like, it … and that’s why—I guess, call me an early adopter—for us, things that could move the needle even a little, but then upon repetition a lot, let alone this, … 

LEE: Yeah. 

AFEYAN: … you have to embrace. You can’t wait there and say, I’ll embrace it once it’s ready. And so that’s what we did. 

LEE: Hmm. All right. So let’s get into some specifics and what you are seeing either in your portfolio companies or in the research projects or out in the industry. What is going on today with respect to AI really being used for something meaningful in the design and development of drugs? 

AFEYAN: In companies that are doing as diverse things as—let me give you a few examples—a project that’s now become a named company called ProFound Therapeutics (opens in new tab) that literally discovered three, four years ago, and would not have been able to without some of the big data-model-building capabilities, that our cells make literally thousands, if not tens of thousands, of more proteins than we were aware of, full stop. 

We had done the human genome sequence, there was 20,000 genes, we thought that there was … 

LEE: Wow. 

AFEYAN: … maybe 70-80,000, 100,000 proteins, and that’s that. And it turns out that our cells have a penchant to express themselves in the form of proteins, and they have many other ways than we knew to do that. 

Now, so what does that mean? That means that we have generated a massive amount of data, the interpretation of which, the use of which to guide what you do and what these things might be involved with is purely being done using the most cutting-edge data-trained models that allow you to navigate such complexity. 

LEE: Wow. Hmm. 

AFEYAN: That’s just one example. Another example: a company called Quotient Therapeutics (opens in new tab), again three, four years old. I can talk about the ones that are three, four years old because we’ve kind of gotten to a place where we’ve decided that it’s not going to fail yet, [LAUGHTER] so we can talk about it. 

You know, we discovered—our team discovered—that in our cells, right, so we know that when we get cancer, our cells have genetic mutations in them or DNA mutations that are correlated and often causal to the hyperproliferative stages of cancer. But what we assume is that all the other cells in our body, pretty much, have one copy of their genes from our mom, one copy from our dad, and that’s that. 

And when very precise deep sequencing came along, we always asked the question, “How much variation is there cell to cell?” 

LEE: Right. 

AFEYAN: And the answer was it’s kind of noise, random variation. Well, our team said, “Well, what if it’s not really that random?” because upon cell division cycles, there’s selection happening on these cells. And so not just in cancer but in liver cells, in muscle cells, in skin cells … 

LEE: Oh, interesting. 

AFEYAN: … can you imagine that there’s an evolutionary experiment that is favoring either compensatory mutations that are helping you avoid disease or disease-caused mutations that are gaining advantage as a way to understand the mechanism? Sure enough—I wouldn’t be telling you otherwise—with massive amount of single cell sequencing from individual patient samples, we’ve now discovered that the human genome is mutated on average in our bodies 10,000 times, like over every base, like, it’s huge numbers. 

And we’re finding very interesting big signals come out of this massive amount of data. By the way, data of the sort that the human mind, if it tries to assign causal explanations to what’s happening … 

LEE: Right. 

AFEYAN: … is completely inadequate. 

LEE: When you think about a language model, we’re learning from human language, and the totality of human language—at least relative to what we’re able to compute today in terms of constructing a model—the totality of human language is actually pretty limited. And in fact, you know, as is always written about in click-baity titles, you know, the big model builders are actually starting to run short. 

AFEYAN: Running out, running out, yes. [LAUGHTER] 

LEE: But one of the things that perplexes me and maybe even worries me—like these two examples—are generally in the realm of cellular biology and the complexity. Let’s just take the example of your company, ProFound. You know, the complexity of what’s going on and the potential genetic diversity is such that, can we ever have enough data? You know, because there just aren’t that many human beings. There just aren’t that many samples. 

AFEYAN: Well, it depends on what you want to train, right. So if you want to train a de novo evolutionary model that could take you from bacteria to human mammalian cells and the like, there may not be—and I’m not an expert in that—but that’s a question that we often kind of think about. 

But if you’re trying to train a … like you know what the proteins we know about, how they interact with pathways and disease mechanisms and the like. Now all of a sudden you find out that there’s a whole continent of them missing in your explanations. But there are things you can reason, in quotations, through analogy, functional analogy, sequence analogy, homology. So there’s a lot of things that we could do to essentially make use of this, even though you may not have the totality of data needed to, kind of, predict, based on a de novo sequence, exactly what it’s going to do. 

So I agree with the comparison. But … but you’re right. The complexity is … just keep in mind, on average, a protein may be interacting with 50 to 100 other proteins. 

LEE: Right. 

AFEYAN: So if you find thousands of proteins, you’ve found a massive interaction space through which information is being processed in a living cell. 

LEE: But do you find in your AI companies that access to data ends up being a key challenge? Or, you know, how central is that? 

AFEYAN: Access to data is a key challenge for the companies we have that are trying to build just models. But that’s the minority of things we do. The majority of things we do is to actually co-develop the data and the models. And as you know well, because you guys, you know, have given us some ideas around this space, that, you know, you could generate data and then think about what you’re to do with it, which is the way biotech is operated with bioinformatics. 

LEE: Right, right. 

AFEYAN: Or you could generate bespoke data that is used to train the model that’s quite separate from what you would have done in the natural course of biology. So we’re doing much more of the latter of late, and I think that’ll continue. So, but these things are proliferating. 

I mean, it’s hard to find a place where we’re not using this. And the “this” is any and all data-driven model building, generative, LLM-based, but also every other technique to make progress. 

LEE: Sure. So now moving away from the straight biochemistry applications, what about AI in the process of building a business, of making investment decisions, of actually running an operation? What are you seeing there? 

AFEYAN: So, well, you know, Moderna, which is a company that I’m quite proud of being a founder and chairman of, has adopted a significant, significant amount of AI embedded into their operations in all aspects: from the manufacturing, quality control, the clinical monitoring, the design—every aspect. And in fact, they’ve had a partnership that they’ve had for a little while here with OpenAI, and they’ve tried many different ways to stay at the cutting edge of that. 

So we see that play out at some scale. That’s a 5,000-, 6,000-person organization, and what they’re doing is a good example of what early adopters would do, at least in our kind of biotechnology company. 

But then, you know, in our space, I would say the efficiency impact is kind of no different, than, you know, anywhere else in academia you might adopt it or in other kinds of companies. But where I find it an interesting kind of maybe segue is the degree to which it may fundamentally change the way we think about how to do science, which is a whole other use, right? 

LEE: Right. 

AFEYAN: So it’s not an efficiency gain per se, although it’s maybe an effectiveness gain when it comes to science, but can you just fundamentally train models to generate hypotheses? 

LEE: Yep. 

AFEYAN: And we have done that, and we’ve been doing this for the last three years. And now it’s getting better and better, the better these reasoning engines are getting and kind of being able to extrapolate and train for novelty. Can you convert that to the world’s best experimental protocol to very precisely falsify your hypothesis, on and on? 

That closing of that loop, kind of what we call autonomous science, which we’ve been trying to do for the last two, three years and are making some progress in, that to me is another kind of bespoke use of these things, not to generate molecules in its chemistry, but to change the behavior of how science is done. 

LEE: Yeah. So I always end with a couple of provocative questions, but I need—before we do that, while we’re on this subject—to get your take on Lila Sciences (opens in new tab)

And there is a vision there that I think is very interesting. It’d be great to hear it described by you. 

AFEYAN: Sure. So Lila, after operating for two to three years in kind of a preparatory kind of stealth mode, we’ve now had a little bit more visibility around, and essentially what we’re trying to do there is to create what we call automated science factories, and such a factory would essentially be able to take problems, either computationally specified or human-specified, and essentially do the experimental work in order to either make an optimization happen or enable something that just didn’t exist. And it’s really, at this point, we’ve shown proof of concept in narrow areas. 

LEE: Yep. 

AFEYAN: But it’s hard to say that if you can do this, you can’t do some other things, so we’re just expanding it that way. We don’t think we need a complete proof or complete demonstration of it for every aspect. 

LEE: Right. 

AFEYAN: So we’re just kind of being opportunistic. The idea for Lila is to partner with a number of companies. The good news is, within Flagship, there’s 48 of them. And so there’s a whole lot of them they can partner with to get their learning cycles. But eventually they want to be a real alternative to every time somebody has an idea, having to kind of go into a lab and manually do this. 

I do want to say one thing we touched on, Peter, though, just on that front, which is … 

LEE: Yep. 

AFEYAN: … if you say, like, “What problem is this going to solve?” It’s several but an important one is just the flat-out human capacity to reason on this much data and this much complexity that is real. Because nature doesn’t try to abstract itself in a human understandable form. 

LEE: Right. Yeah. 

AFEYAN: In biology, since it’s kind of like progress happens through evolutionary kind of selections, the evidence of which [has] long been lost, and so therefore, you just see what you have, and then it has a behavior. I really do think that there’s something to be said, and I want to—just for your audience—lay out a provocative, at least, thought on all this, which Lila is a beginning embodiment of, which is that I really think that what’s going to happen over the next five, 10 years, even while we’re all fascinated with the impending arrival of AGI [artificial general intelligence] is really what I call poly-intelligence, which is the combination of human intelligence, machine intelligence, AI, and nature’s intelligence. 

We’re all fascinated at the human-machine interface. We know the human-nature interface, but imagine the machine-nature interface—that is, actually letting loose a digital kind of information processing life form through the algorithms that are being developed and the commensurately complex, maybe much more complex. We’ll see. And so now the question becomes, what does the human do? 

And we’re living in a world which is human dominated, which means the humans say, “If I don’t understand it, it’s not real, basically. And if I don’t understand it, I can’t regulate it.” And we’re going to have to make peace with the fact that we’re not going to be able to predictably affect things without necessarily understanding them the way we could if we just forced ourselves to only work on problems we can understand. And that world we’re not ready for at all. 

LEE: Yeah. All right. So this one I predict is going to be a little harder for you because I think while you think about the future, you live very much in the present. But I’d like you to make some predictions about what the biotech and biopharmaceutical industries are going to be able to do two years from now, five years from now, 10 years from now. 

AFEYAN: Yeah, well, it’s hard for me because you know my nature, which is that I think this is all emergent. 

LEE: Right. 

AFEYAN: And so I would be the conceit of predicting. So I would say with likelihood positive predictive value of less than 10%, I’m happy to answer your question. So I’m not trying to score high [LAUGHTER] because I really think that my job is to envision it, not to predict it. And that’s a little bit different, right? 

LEE: Yeah, I actually was trying to pick what would be the hardest possible question I could ask you, [LAUGHTER] and this is what I came up with. 

AFEYAN: Yeah, no, no, I’m kidding here. So now look, I think that we will cross this threshold of understandability. And of course you’re seeing that in a lot of LLM things today. And of course, people are trying to train for things that are explainers and all that whole, there’s a whole world of that. But I think at some point we’re going to have to kind of let go and get comfortable working on things that, you know … 

I sometimes tell people, you know, and I’m not the first, but scientists and engineers are different, it’s said, in that engineers work on things that they don’t wait until they get a full understanding of before they work with them. Well, now scientists are going to have to get used to that, too, right? 

LEE: Yeah. Yeah. 

AFEYAN: Because insisting that it’s only valid if it’s understandable. So, I would say, look, I hope that the time … for example, I think major improvements will be made in patient selection. If we can test drugs on patients that are more synchronized as to the stage of their disease … 

LEE: Yep. 

AFEYAN: … I think the answer will be much better. We’re working on that. It’s a company called Etiome (opens in new tab), very, very early stage. It’s really beautiful data, very early data that shows that when we talk about MASH [metabolic dysfunction-associated steatohepatitis], liver disease, when we talk about Parkinson’s, there’s such a heterogeneity, not only of the subset type of the disease, but the stage of the disease, that this notion that you have stage one cancer, stage two cancer, again, nobody told nature there’s stages of that kind. It’s a continuum. 

But if you can synchronize based on training, kind of, the ability to detect who are the patients that are in enough of a close proximity that should be treated so that the trial—much smaller a trial size—could give you a drug, then afterwards, you can prescribe it using these approaches. 

Kind of we’re going to find that what we thought is one disease is more like 15 diseases. That’s bad news because we’re not going to be able to claim that we can treat everything which we can. It’s good news in that there’s going to be people who are going to start making much more specific solutions to things. 

LEE: Right. 

AFEYAN: So I can imagine that. I can imagine a generation of, kind of, students who are going to be able to play in this space without having 25 years of graduate education on the subject. So what is deemed knowledge sufficient to do creative things will change. I can go on and on, but I think all this is very close by and it’s very exciting. 

LEE: Noubar, I just always have so much fun, and I learn really a lot. It’s high-density learning when I talk to you. And so I hope our listeners feel the same way. It’s something I really appreciate. 

AFEYAN: Well, Peter, thanks for this. And I think your listeners know that if I was asking you questions, you would be answering them with equal if not more fascinating stuff. So, thanks for giving me the chance to do that today. 

[TRANSITION MUSIC] 

LEE: I’m always fascinated by Noubar’s perspectives on fundamental research and how it connects to human health and the building of successful companies. I see him as a classic “systems thinker,” and by that, I mean he builds impressive things like Flagship Pioneering itself, which he created as a kind of biomedical innovation system. 

In our conversation, I was really struck by the fact that he’s been thinking about the potential impact of transformers—transformers being the fundamental building block of large language models—as far back as 2017, when the first paper on the attention mechanism in transformers was published by Google. 

But, you know, it isn’t only about using AI to do things like understand and design molecules and antibodies faster. It’s interesting that he is also pushing really hard towards a future where AI might “close the loop” from hypothesis generation, to experiment design, to analysis, and so on. 

Now, here’s my conversation with Dr. Eric Topol: 

LEE: Eric, it’s really great to have you here. 

ERIC TOPOL: Oh, Peter, I’m thrilled to be here with you here at Microsoft. 

LEE: You’re a super famous person. Extremely well known to researchers even in computer science, as we have here at Microsoft Research. 

But the question I’d like to ask is, how would you explain to your parents what you do every day? 

TOPOL: [LAUGHS] That’s a good question. If I was just telling them I’m trying to come up with better ways to keep people healthy, that probably would be the easiest way to do it because if I ever got in deeper, I would lose them real quickly. They’re not around, but just thinking about what they could understand. 

LEE: Right. 

TOPOL: I think as long as they knew it was work centered on innovative paths to promoting and preserving human health, that would get to them, I think. 

LEE: OK, so now, kind of the second topic, and then we let the conversation flow, is about origin stories with respect to AI. And with most of our guests, you know, I factor that into two pieces: the encounters with AI before ChatGPT and what we call generative AI and then the first contacts after. 

And, of course, you have extensive contact with both now. But let’s start with how you got interested in machine learning and AI prior to ChatGPT. How did that happen? 

TOPOL: Yeah, it was out of necessity. So back, you know, when I started at Scripps at the end of ’06, we started accumulating, you know, massive datasets. First, it was whole genomes. We did one of the early big cohorts of 1,400 people of healthy aging. We called the Wellderly whole genome sequence (opens in new tab)

And then we started big in the sensor world, and then we started saying, what are we going to do with all this data, with electronic health records and all those sensors? And now we got whole genomes. 

And basically, what we were doing, we were in hoarding mode. We didn’t have a way to meaningfully analyze it. 

LEE: Right. 

TOPOL: You would read about how, you know, data is the new oil and, you know, gold and whatnot. But we just didn’t have a way to extract the juice. And even when we wanted to analyze genomes, it was incredibly laborious. 

LEE: Yeah. 

TOPOL: And we weren’t extracting a lot of the important information. So that’s why … not having any training in computer science, when I was doing the … about three years of work to do the book Deep Medicine, I started really, first auto-didactic about, you know, machine learning. And then I started contacting a lot of the real top people in the field and hanging out with them, and learning from them, getting their views as to, you know, where we are today, what models are coming in the future. 

And then I said, “You know what? We are going to be able to fix this mess.” [LAUGHS] We’re going to get out of the hoarding phase, and we’re going to get into, you know, really making a difference. 

So that’s when I embraced the future of AI. And I knew, you know, back—that was six years ago when it was published and probably eight or nine years ago when I was doing the research, and I knew that we weren’t there yet. 

You know, at the time, we were seeing the image interpretation. That was kind of the early promise. But really, the models that were transformative, the transformer models, they were incubating back in 2017. So people knew something was brewing. 

LEE: Right. Yes. 

TOPOL: And everyone said we’re going to get there. 

LEE: So then, ChatGPT comes out November of 2022; there’s GPT-4 in 2023, and now a lot has happened. Do you remember what your first encounter with that technology was? 

TOPOL: Oh, sure. First, ChatGPT. You know, in the last days of November ’22, I was just blown away. I mean, I’m having a conversation. I’m having fun. And this is humanoid responding to me. I said, “What?” You know? So that was to me, a moment I’ll never forget. And so I knew that the world was, you know, at a very kind of momentous changing point. 

Of course, knowing, too, that this is going to be built on, and built on quickly. Of course, I didn’t know how soon GPT-4 and all the others were going to come forward, but that was a wake-up call that the capabilities of AI had just made a humongous jump, which seemingly was all of a sudden, although I did know this had been percolating … 

LEE: Right. 

TOPOL: … you know, for what, at least five years, that, you know, it really was getting into its position to do this. 

LEE: I know one of the things that was challenging psychologically and emotionally for me is, it made me rethink a lot of things that were going on in Microsoft Research in areas like causal reasoning, natural language processing, speech processing, and so on. 

I’m imagining you must have had some emotional struggles too because you have this amazing book, Deep Medicine. Did you have to … did it go through your mind to rethink what you wrote in Deep Medicine in light of this or, or, you know, how did that feel? 

TOPOL: It’s funny you ask that because in this one chapter I have on the virtual health coach, I wrote a whole bunch of scenarios … 

LEE: Yeah. 

TOPOL: … that were very kind of futuristic. You know, about how the AI interacts with the person’s health and schedules their appointment for this and their scan and tells them what lab tests they should tell their doctor to have, and, you know, all these things. And I sent a whole bunch of these, thinking that they were a little too far-fetched. 

LEE: Yes. 

TOPOL: And I sent them to my editor when I wrote the book, and he says, “Oh, these are great. You should put them all in.” [LAUGHTER] What I didn’t realize is they weren’t that, you know, they were all going to happen. 

 LEE: Yeah. They weren’t that far-fetched at all. 

TOPOL: Not at all. If there’s one thing I’ve learned from all this, is our imagination isn’t big enough. 

 LEE: Yeah. 

TOPOL: We think too small. 

LEE: Now in our book that Carey, Zak, and I wrote, you know, we made, you know, we sort of guessed that GPT-4 might help biomedical researchers, but I don’t think that any of us had the thought in mind that the architecture around generative AI would be so directly applicable to, you know, say, protein structures or, you know, to clinical health records and so on. 

And so a lot of that seems much more obvious today. But two years ago, it wasn’t. But we did guess that biomedical researchers would find this interesting and be helped along. 

So as you reflect over the past two years, you know, do you have things that you think are very important, kind of, meaningful applications of generative AI in the kinds of research that Scripps does? 

TOPOL: Yeah. I mean, I think for one, you pointed out how the term generative AI is a misnomer. 

LEE: Yeah. 

TOPOL: And so it really was prescient about how, you know, it had a pluripotent capability in every respect, you know, of editing and creating. So that was something that I think was telling us, an indicator that this is, you know, a lot bigger than how it’s being labeled. And our expectations can actually be more than what we had seen previously with the earlier version. 

So I think what’s happened is that now, we keep jumping. It’s so quick that we can’t … you know, first we think, oh, well, we’ve gone into the agentic era, and then we could pass that with reasoning. [LAUGHTER] And, you know, we just can’t … 

LEE: Right. 

TOPOL: It’s just wild. 

LEE: Yeah. 

TOPOL: So I think so many of us now will put in prompts that will necessitate or ideally result in a not-immediate gratification, but rather one that requires, you know, quite a bit of combing through the corpus of knowledge … 

LEE: Yeah. 

TOPOL: … and getting, with all the citations, a report or a response. And I think now this has been a reset because to do that on our own, it takes, you know, many, many hours. And it’s usually incomplete. 

But one of the things that was so different in the beginning was you would get the references from up to a year and a half previously. 

LEE: Yep. 

TOPOL: And that’s not good enough. [LAUGHS] 

LEE: Right. 

TOPOL: And now you get references, like, from the day before. 

LEE: Yes. Yeah. 

TOPOL: And so, you say, “Why would you do a regular search for anything when you could do something like this?” 

LEE: Yeah. 

TOPOL: And then, you know, the reasoning power. And a lot of people who are not using this enough still are talking about, “Well, there’s no reasoning.” 

LEE: Yeah.

TOPOL: Which you dealt with really well in the book. But what, of course, you couldn’t have predicted is the new dimensions. 

LEE: Right. 

TOPOL: I think you nailed it with GPT-4. But it’s all these just, kind of, stepwise progressions that have been occurring because of the velocity that’s unprecedented. I just can’t believe it. 

LEE: We were aware of the idea of multi-modality, but we didn’t appreciate, you know, what that would mean. Like AlphaFold (opens in new tab) [protein structure database], you know, the ability for AI to understand—or crystal structures—to really start understanding something more fundamental about biochemistry or medicinal chemistry. 

I have to admit, when we wrote the book, we really had no idea. 

TOPOL: Well, I feel the same way. I still today can’t get over it because the reason AlphaFold and Demis [Hassabis] and John Jumper [AlphaFold’s co-creators] were so successful is there was this protein databank. 

LEE: Yes. 

TOPOL: And it had been kept for decades. And so, they had the substrate to work with. 

LEE: Right. 

TOPOL: So, you say, “OK, we can do proteins.” But then how do you do everything else? 

LEE: Right. 

TOPOL: And so this whole, what I call, “large language of life model” work, which has gone into high gear like I’ve never seen. 

LEE: Yeah. 

TOPOL: You know, now to this holy grail of a virtual cell, and … 

LEE: Yeah. 

TOPOL: You know, it’s basically … it’s … it was inspired by proteins. But now it’s hitting on, you know, ligands and small molecules, cells. I mean, nothing is being held back here. 

LEE: Yeah. 

TOPOL: So how could anybody have predicted that? 

LEE: Right. 

TOPOL: I sure wouldn’t have thought it would be possible at this point. 

LEE: Yeah. So just to challenge you, where do you think that is going to be two years from now? Five years from now? Ten years from now? Like, so you talk about a virtual cell. Is that achievable within 10 years, or is that still too far out? 

TOPOL: No, I think within 10 years for sure. You know the group that got assembled that Steve Quake (opens in new tab) pulled together? 

LEE: Right. 

TOPOL: I think has 42 authors in a paper (opens in new tab) in Cell. The fact that he could get these 42 experts in life science and some in computer science to come together and all agree … 

LEE: Yeah. 

TOPOL: … that not only is this a worthy goal, but it’s actually going to be realized, that was impressive. 

I challenged him about that. How did you get these people all to agree? So many of them were naysayers. And by the time the workshop finished, they were fully convinced. I think that what we’re seeing is so much progress happening so quickly. And then all the different models, you know, across DNA, RNA, and everything are just zooming forward. 

LEE: Yeah. 

TOPOL: And it’s just a matter of pulling this together. Now when we have that, and I think it could easily be well before a decade and possibly, you know, between the five- and 10-year mark—that’s just a guess—but then we’re moving into another era of life science because right now, you know, this whole buzz about drug discovery. 

LEE: Yep. 

TOPOL: It’s not… with the ability to do all these perturbations at a cellular level. 

LEE: Right. 

TOPOL: Or the cell of interest. 

LEE: Yeah. 

TOPOL: Or the cell-to-cell interactions or the intra-cell interaction. So once you nail that, yeah, it takes it to a kind of another predictive level that we haven’t really fathomed. So, yes, there’s going to be drug discovery that’s accelerated. But this would make that and also the underpinnings of diseases. 

LEE: Yeah. 

TOPOL: So the idea that there’s so many diseases we don’t understand now. And if you had virtual cell, … 

LEE: Yeah. 

TOPOL: … you would probably get to that answer … 

LEE: Yeah. 

TOPOL: … much more quickly. So whether it’s underpinnings of diseases or what it’s going to take to really come up with far better treatments—preventions—I think that’s where virtual cell will get us. 

LEE: There’s a technical question … I wonder if you have an opinion. You may or may not. There is sort of what I would refer to as ab initio approaches to this. You know, you start from the fundamental physics and chemistry, and we know the laws, we have the math and, you know, we can try to derive from there … in fact, we can even run simulations of that math to generate training data to build generative models and work up to a cell, or forget all of that and just take as many observations and measurements of, say, living cells as possible, and just have faith that hidden amongst all of the observational data, there is structure and language that can be derived. 

So that’s sort of bottom-up versus top-down approaches. Do you have an opinion about which way? 

TOPOL: Oh, I think you go after both. And clearly whenever you’re positing that you’ve got a virtual cell model that’s working, you’ve got to do the traditional methods as well to validate it, and … so all that. You know, I think if you’re going to go out after this seriously, you have to pull out all the stops. Both approaches, I think, are going to be essential. 

LEE: You know, if what you’re saying is true, and it is amazing to hear the confidence, the one thing I tried to explain to someone nontechnical is that for a lot of problems in medicine, we just don’t have enough data in a really profound way. And the most profound way to say that is, since Adam and Eve, there have only been an estimated 106 billion people who have ever lived. 

So even if we had the DNA of every human being, every individual of Homo sapiens, there are certain problems for which we would not have enough data. 

TOPOL: Sure. 

LEE: And so I think another thing that seems profound to me, if we can actually have a virtual cell, is we can actually make trillions of virtual … 

TOPOL: Yeah 

LEE: … human beings. The true genetic diversity could be realized for our species. 

TOPOL: I think you nailed it. The ability to have that type of data, no less synthetic data, I mean, it’s just extraordinary. 

LEE: Yeah. 

TOPOL: We will get there someday. I’m confident of that. We may be wrong in projections. And I do think [science writer] Philip Ball won’t be right that it will never happen, though. [LAUGHTER] No, I think that if there’s a holy grail of biology, this is it. 

LEE: Yeah. 

TOPOL: And I think you’re absolutely right about where that will get us. 

LEE: Yeah. 

TOPOL: Transcending the beginning of the species. 

LEE: Yeah. 

TOPOL: Of our species. 

LEE: Yeah. All right. So now, we’re starting to run short on time here. And so I wanted to ask you about, I’m in my 60s, so I actually think about this a lot more. [LAUGHTER] And I know you’ve been thinking a lot about longevity. And, of course, your new book, Super Agers

And one of the reasons I’m so eager to read is it’s a topic very top of mind for me and actually for a lot of people. Where is this going? Because this is another area where you hear so much hype. At the same time, you see Nobel laureate scientists … 

TOPOL: Yeah. 

LEE: … working on this. 

TOPOL: Yeah. 

LEE: So, so what’s, what’s real there? 

TOPOL: Yeah. Well, it’s really … the real deal is the science of aging is zooming forward. 

And that’s exciting. But I see it bifurcating. On the one hand, all these new ideas, strategies to reverse aging are very ambitious. Like cell reprogramming and senolytics and, you know, the rejuvenation of our thymus gland, and it’s a long list. 

LEE: Yeah. 

TOPOL: And they’re really cool science, and it used to be the mouse lived longer. Now it’s the old mouse looks really young. 

LEE: Yeah. Yeah. 

TOPOL: All the different features. A blind mouse with cataracts is all of a sudden there’s no cataracts. I mean, so these things are exciting, but none of them are proven in people, and they all have significant risk, no less, you know, the expense that might be attached. 

LEE: Right. 

TOPOL: And some people are jumping the gun. They’re taking rapamycin, which can really knock out their immune system. So they all carry a lot of risk. And people are just getting a little carried away. We’re not there yet. 

But the other side, which is what I emphasize in the book, which is exciting, is that we have all these new metrics that came out of the science of aging. 

LEE: Yes. 

TOPOL: So we have clocks of the body. Our biological clock versus our chronological clock, and we have organ clocks. So I can say, you know, Peter, we’ve assessed all your organs and your immune system. And guess what? Every one of them is either at or less than your actual age. 

LEE: Right. 

TOPOL: And that’s very reassuring. And by the way, your methylation clock is also … I don’t need to worry about you so much. And then I have these other tests that I can do now, like, for example, the brain. We have an amazing protein p-Tau217 that we can say over 20 years in advance of you developing Alzheimer’s, … 

LEE: Yeah. 

TOPOL: … we can look at that, and it’s modifiable by lifestyle, bringing it down. It should be you can change the natural history. So what we’ve seen is an explosion of knowledge of metrics, proteins, no less, you know, our understanding at the gene level, the gut microbiome, the immune system. So that’s what’s so exciting. How our immune system ages. Immunosenescence. How we have more inflammation—inflammaging—with aging. So basically, we have three diseases that kill us, that take away our health: heart, cancer, and neurodegenerative. 

LEE: Yep. 

TOPOL: And they all take more than 20 years. They all have a defective immune system inflammation problem, and they’re all going to be preventable. 

LEE: Yeah. 

TOPOL: That’s what’s so exciting. So we don’t have to have reverse aging. We can actually work on … 

LEE: Just prevent aging in the first place. 

TOPOL: the age-related diseases. So basically, what it means is: I got to find out if you have a risk, if you’re in this high-risk group for this particular condition, because if you are—and we have many levels, layers, orthogonal ways to check—we don’t just bank it all on one polygenic test. We’re going to have several ways, say this is the one we are going … 

And then we go into high surveillance, where, let’s say if it’s your brain, we do more p-Tau, if we need to do brain imaging—whatever it takes. And also, we do preventive treatments on top of the lifestyle [changes], that one of the problems we have today is a lot of people know generally, what are good lifestyle factors. Although, I go through a lot more than people generally acknowledge. 

But they don’t incorporate them because they don’t know that they’re at risk and they could change their … extend their health span and prevent that disease. So what I at least put out there, a blueprint, is how we can use AI, because it’s multimodal AI, with all these layers of data, and then temporally, it’s like today you could say if you have two protein tests, not only are you going to have Alzheimer’s, but within a two-year time frame when … 

LEE: Yep. 

TOPOL: … and if you don’t change things, if we don’t gear up … you know, we can … we can completely prevent this, so … or at least defer it for a decade or more. So that’s why I’m excited, is that we made these strides in the science of aging. But we haven’t acknowledged the part that doesn’t require reversing aging. There’s this much less flashy, attainable, less risky approach … 

LEE: Yeah. 

TOPOL: than the one that … when you reverse aging, you’re playing with the hallmarks of cancer. They are like, if you look at the hallmarks of cancer … 

LEE: That has been one of the primary challenges. 

TOPOL: They’re lined up. 

LEE: Yeah. 

TOPOL: They’re all the same, you know, whether it’s telomeres, or whether it’s … you know … so this is the problem. I actually say in the book, I do think one of these—we have so many shots on goal—one of these reverse aging things will likely happen someday. But we’re nowhere close. 

On the other hand, let’s gear up. Let’s do what we can do. Because we have these new metrics that’s … people don’t … like, when I read the organ clock paper (opens in new tab) from Tony Wyss-Coray from Stanford. It was published end of ’23; it was the cover of Nature. It blew me away. 

LEE: Yeah. 

TOPOL: And I wrote a Substack (opens in new tab) [article] on it. And Tony said, “Well, that’s so nice of you.” I said, “So nice? This is revolutionary, you know.” [LAUGHTER] So … 

LEE: By the way, what’s so interesting is, how these things, this kind of understanding and AI, are coming together.

TOPOL: Yes. 

LEE: It’s almost eerie the timing of these things. 

TOPOL: Absolutely. Because you couldn’t take all these layers of data, just like we were talking about data hoarding.

LEE: Yep.

TOPOL: Now we have data hoarding on individual with no way to be able to make these assessments of what level of risk, when, what are we going to do in this individual to prevent that? We can do that now. 

We can do it today. And we could keep building on that. So I’m really excited about it. I think that, you know, when I wrote the last book on deep medicine, it was our overarching goal should be to bring back the patient-doctor relationship. I’m an old dog, and I know what it used to be when I got out of medical school. 

It’s totally … you couldn’t imagine how much erosion from the ’70s, ’80s to now. But now I have a new overarching goal. I’m thinking that that still is really important—humanity in medicine—but let’s prevent these three … big three diseases because it’s an opportunity that we’re not … you know, in medicine, all my life we’ve been hearing and talking about we need to prevent diseases. 

Curing is much harder than prevention. And the economics. Oh my gosh. But we haven’t done it. 

LEE: Yeah. 

TOPOL: Now we can do it. Primary prevention. We’d do really well. Somebody’s had heart attack. 

LEE: Yeah. 

TOPOL: Oh, we’re going to get all over it. Why did they have a heart attack in the first place? 

LEE: Well, the thing that makes so much sense in what you’re saying is that we understand we have an understanding both economically and medically that prevention is a good thing. And extending the concept of prevention to these age-related conditions, I think, makes all the sense in the world. 

You know, Eric, maybe on that optimistic note, it’s time to wrap up this conversation. Really appreciate you coming. Let me just brag in closing that I’m now the proud owner of an autographed copy of your latest book, and, really, thank you for that. 

TOPOL: Oh, thank you. I could spend the rest of the day talking to you. I’ve really enjoyed it. Thanks. 

[TRANSITION MUSIC] 

LEE: For me, the biggest takeaway from our conversation was Eric’s supremely optimistic predictions about what AI will allow us to do in much less than 10 years. 

You know, for me personally, I started off several years ago with the typical techie naivete that if we could solve protein folding using machine learning, we would solve human biology. But as I’ve gotten smarter, I’ve realized that things are way, way more complicated than that, and so hearing Eric’s techno-optimism on this is really both heartening and so interesting. 

Another thing that really caught my attention are Eric’s views on AI in medical diagnosis. That really stood out to me because within our labs here at Microsoft Research, we have been doing a lot of work on this, for example in creating foundation models for whole-slide digital pathology. 

The bottom line, though, is that biomedical research and development is really changing and changing quickly. It’s something that we thought about and wrote briefly about in our book, but just hearing it from these three people gives me reason to believe that this is going to create tremendous benefits in the diagnosis and treatment of disease. 

And in fact, I wonder now how regulators, such as the Food and Drug Administration here in the United States, will be able to keep up with what might become a really big increase in the number of animal and human studies that need to be approved. On this point, it’s clear that the FDA and other regulators will need to use AI to help process the likely rise in the pace of discovery and experimentation. And so stay tuned for more information about that. 

[THEME MUSIC] 

I’d like to thank Daphne, Noubar, and Eric again for their time and insights. And to our listeners, thank you for joining us. There are several episodes left in the series, including discussions on medical students’ experiences with AI and AI’s influence on the operation of health systems and public health departments. We hope you’ll continue to tune in. 

Until next time. 

[MUSIC FADES] 

The post How AI will accelerate biomedical research and discovery appeared first on Microsoft Research.

Read More

AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices

AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices

Illustrated headshots of Daniel Carpented, Timo Minssen, Chad Atalla, and Kathleen Sullivan.

Generative AI presents a unique challenge and opportunity to reexamine governance practices for the responsible development, deployment, and use of AI. To advance thinking in this space, Microsoft has tapped into the experience and knowledge of experts across domains—from genome editing to cybersecurity—to investigate the role of testing and evaluation as a governance tool. AI Testing and Evaluation: Learnings from Science and Industry, hosted by Microsoft Research’s Kathleen Sullivan, explores what the technology industry and policymakers can learn from these fields and how that might help shape the course of AI development.

In this episode, Daniel Carpenter, the Allie S. Freed Professor of Government and chair of the department of government at Harvard University, explains how the US Food and Drug Administration’s rigorous, multi-phase drug approval process serves as a gatekeeper that builds public trust and scientific credibility, while Timo Minssen, professor of law and founding director of the Center for Advanced Studies in Bioscience Innovation Law at the University of Copenhagen, explores the evolving regulatory landscape of medical devices with a focus on the challenges of balancing innovation with public safety. Later, Microsoft’s Chad Atalla, an applied scientist in responsible AI, discusses the sociotechnical nature of AI models and systems, their team’s work building an evaluation framework inspired by social science, and where AI researchers, developers, and policymakers might find inspiration from the approach to governance and testing in pharmaceuticals and medical devices.


Learn more:

Learning from other Domains to Advance AI Evaluation and Testing: The History and Evolution of Testing in Pharmaceutical Regulation
Case study | January 2025 

Learning from other Domains to Advance AI Evaluation and Testing: Medical Device Testing: Regulatory Requirements, Evolution and Lessons for AI Governance
Case study | January 2025 

Learning from other domains to advance AI evaluation and testing 
Microsoft Research Blog | June 2025   

Evaluating Generative AI Systems is a Social Science Measurement Challenge 
Publication | November 2024  

STAC: Sociotechnical Alignment Center

Responsible AI: Ethical policies and practices | Microsoft AI

AI and Microsoft Research 

Transcript

[MUSIC]

KATHLEEN SULLIVAN: Welcome to AI Testing and Evaluation: Learnings from Science and Industry. I’m your host, Kathleen Sullivan.

As generative AI continues to advance, Microsoft has gathered a range of experts—from genome editing to cybersecurity—to share how their fields approach evaluation and risk assessment. Our goal is to learn from their successes and their stumbles to move the science and practice of AI testing forward. In this series, we’ll explore how these insights might help guide the future of AI development, deployment, and responsible use.

[MUSIC ENDS]

SULLIVAN: Today, I’m excited to welcome Dan Carpenter and Timo Minssen to the podcast to explore testing and risk assessment in the areas of pharmaceuticals and medical devices, respectively.

Dan Carpenter is chair of the Department of Government at Harvard University. His research spans the sphere of social and political science, from petitioning in democratic society to regulation and government organizations. His recent work includes the FDA Project, which examines pharmaceutical regulation in the United States.

Timo is a professor of law at the University of Copenhagen, where he is also director of the Center for Advanced Studies in Bioscience Innovation Law. He specializes in legal aspects of biomedical innovation, including intellectual property law and regulatory law. He’s exercised his expertise as an advisor to such organizations as the World Health Organization and the European Commission.

And after our conversations, we’ll talk to Microsoft’s Chad Atalla, an applied scientist in responsible AI, about how we should think about these insights in the context of AI.

Daniel, it’s a pleasure to welcome you to the podcast. I’m just so appreciative of you being here. Thanks for joining us today.


DANIEL CARPENTER: Thanks for having me. 

SULLIVAN: Dan, before we dissect policy, let’s rewind the tape to your origin story. Can you take us to the moment that you first became fascinated with regulators rather than, say, politicians? Was there a spark that pulled you toward the FDA story? 

CARPENTER: At one point during graduate school, I was studying a combination of American politics and political theory, and I did a summer interning at the Department of Housing and Urban Development. And I began to think, why don’t people study these administrators more and the rules they make, the, you know, inefficiencies, the efficiencies? Really more from, kind of, a descriptive standpoint, less from a normative standpoint. And I was reading a lot that summer about the Food and Drug Administration and some of the decisions it was making on AIDS drugs. That was a, sort of, a major, …

SULLIVAN: Right. 

CARPENTER: … sort of, you know, moment in the news, in the global news as well as the national news during, I would say, what? The late ’80s, early ’90s? And so I began to look into that.

SULLIVAN: So now that we know what pulled you in, let’s zoom out for our listeners. Give us the whirlwind tour. I think most of us know pharma involves years of trials, but what’s the part we don’t know?

CARPENTER: So I think when most businesses develop a product, they all go through some phases of research and development and testing. And I think what’s different about the FDA is, sort of, two- or three-fold.

First, a lot of those tests are much more stringently specified and regulated by the government, and second, one of the reasons for that is that the FDA imposes not simply safety requirements upon drugs in particular but also efficacy requirements. The FDA wants you to prove not simply that it’s safe and non-toxic but also that it’s effective. And the final thing, I think, that makes the FDA different is that it stands as what I would call the “veto player” over R&D [research and development] to the marketplace. The FDA basically has, sort of, this control over entry to the marketplace.

And so what that involves is usually first, a set of human trials where people who have no disease take it. And you’re only looking for toxicity generally. Then there’s a set of Phase 2 trials, where they look more at safety and a little bit at efficacy, and you’re now examining people who have the disease that the drug claims to treat. And you’re also basically comparing people who get the drug, often with those who do not.

And then finally, Phase 3 involves a much more direct and large-scale attack, if you will, or assessment of efficacy, and that’s where you get the sort of large randomized clinical trials that are very expensive for pharmaceutical companies, biomedical companies to launch, to execute, to analyze. And those are often the sort of core evidence base for the decisions that the FDA makes about whether or not to approve a new drug for marketing in the United States.

SULLIVAN: Are there differences in how that process has, you know, changed through other countries and maybe just how that’s evolved as you’ve seen it play out? 

CARPENTER: Yeah, for a long time, I would say that the United States had probably the most stringent regime of regulation for biopharmaceutical products until, I would say, about the 1990s and early 2000s. It used to be the case that a number of other countries, especially in Europe but around the world, basically waited for the FDA to mandate tests on a drug and only after the drug was approved in the United States would they deem it approvable and marketable in their own countries. And then after the formation of the European Union and the creation of the European Medicines Agency, gradually the European Medicines Agency began to get a bit more stringent.  

But, you know, over the long run, there’s been a lot of, sort of, heterogeneity, a lot of variation over time and space, in the way that the FDA has approached these problems. And I’d say in the last 20 years, it’s begun to partially deregulate, namely, you know, trying to find all sorts of mechanisms or pathways for really innovative drugs for deadly diseases without a lot of treatments to basically get through the process at lower cost. For many people, that has not been sufficient. They’re concerned about the cost of the system. Of course, then the agency also gets criticized by those who believe it’s too lax. It is potentially letting ineffective and unsafe therapies on the market.

SULLIVAN: In your view, when does the structured model genuinely safeguard patients and where do you think it maybe slows or limits innovation?

CARPENTER: So I think the worry is that if you approach pharmaceutical approval as a world where only things can go wrong, then you’re really at a risk of limiting innovation. And even if you end up letting a lot of things through, if by your regulations you end up basically slowing down the development process or making it very, very costly, then there’s just a whole bunch of drugs that either come to market too slowly or they come to market not at all because they just aren’t worth the kind of cost-benefit or, sort of, profit analysis of the firm. You know, so that’s been a concern. And I think it’s been one of the reasons that the Food and Drug Administration as well as other world regulators have begun to basically try to smooth the process and accelerate the process at the margins.

The other thing is that they’ve started to basically make approvals on the basis of what are called surrogate endpoints. So the idea is that a cancer drug, we really want to know whether that drug saves lives, but if we wait to see whose lives are saved or prolonged by that drug, we might miss the opportunity to make judgments on the basis of, well, are we detecting tumors in the bloodstream? Or can we measure the size of those tumors in, say, a solid cancer? And then the further question is, is the size of the tumor basically a really good correlate or predictor of whether people will die or not, right? Generally, the FDA tends to be less stringent when you’ve got, you know, a remarkably innovative new therapy and the disease being treated is one that just doesn’t have a lot of available treatments, right.

The one thing that people often think about when they’re thinking about pharmaceutical regulation is they often contrast, kind of, speed versus safety …

SULLIVAN: Right.  

CARPENTER: … right. And that’s useful as a tradeoff, but I often try to remind people that it’s not simply about whether the drug gets out there and it’s unsafe. You know, you and I as patients and even doctors have a hard time knowing whether something works and whether it should be prescribed. And the evidence for knowing whether something works isn’t just, well, you know, Sally took it or Dan took it or Kathleen took it, and they seem to get better or they didn’t seem to get better.  

The really rigorous evidence comes from randomized clinical trials. And I think it’s fair to say that if you didn’t have the FDA there as a veto player, you wouldn’t get as many randomized clinical trials and the evidence probably wouldn’t be as rigorous for whether these things work. And as I like to put it, basically there’s a whole ecology of expectations and beliefs around the biopharmaceutical industry in the United States and globally, and to some extent, it’s undergirded by all of these tests that happen.  

SULLIVAN: Right.  

CARPENTER: And in part, that means it’s undergirded by regulation. Would there still be a market without regulation? Yes. But it would be a market in which people had far less information in and confidence about the drugs that are being taken. And so I think it’s important to recognize that kind of confidence-boosting potential of, kind of, a scientific regulation base. 

SULLIVAN: Actually, if we could double-click on that for a minute, I’d love to hear your perspective on, testing has been completed; there’s results. Can you walk us through how those results actually shape the next steps and decisions of a particular drug and just, like, how regulators actually think about using that data to influence really what happens next with it?

CARPENTER: Right. So it’s important to understand that every drug is approved for what’s called an indication. It can have a first primary indication, which is the main disease that it treats, and then others can be added as more evidence is shown. But a drug is not something that just kind of exists out there in the ether. It has to have the right form of administration. Maybe it should be injected. Maybe it should be ingested. Maybe it should be administered only at a clinic because it needs to be kind of administered in just the right way. As doctors will tell you, dosage is everything, right.  

And so one of the reasons that you want those trials is not simply a, you know, yes or no answer about whether the drug works, right. It’s not simply if-then. It’s literally what goes into what you might call the dose response curve. You know, how much of this drug do we need to basically, you know, get the benefit? At what point does that fall off significantly that we can basically say, we can stop there? All that evidence comes from trials. And that’s the kind of evidence that is required on the basis of regulation.  

Because it’s not simply a drug that’s approved. It’s a drug and a frequency of administration. It’s a method of administration. And so the drug isn’t just, there’s something to be taken off the shelf and popped into your mouth. I mean, sometimes that’s what happens, but even then, we want to know what the dosage is, right. We want to know what to look for in terms of side effects, things like that.

SULLIVAN: Going back to that point, I mean, it sounds like we’re making a lot of progress from a regulation perspective in, you know, sort of speed and getting things approved but doing it in a really balanced way. I mean, any other kind of closing thoughts on the tradeoffs there or where you’re seeing that going?

CARPENTER: I think you’re going to see some move in the coming years—there’s already been some of it—to say, do we always need a really large Phase 3 clinical trial? And to what degree do we need the, like, you know, all the i’s dotted and the t’s crossed or a really, really large sample size? And I’m open to innovation there. I’m also open to the idea that we consider, again, things like accelerated approvals or pathways for looking at different kinds of surrogate endpoints. I do think, once we do that, then we also have to have some degree of follow-up.

SULLIVAN: So I know we’re getting close to out of time, but maybe just a quick rapid fire if you’re open to it. Biggest myth about clinical trials?

CARPENTER: Well, some people tend to think that the FDA performs them. You know, it’s companies that do it. And the only other thing I would say is the company that does a lot of the testing and even the innovating is not always the company that takes the drug to market, and it tells you something about how powerful regulation is in our system, in our world, that you often need a company that has dealt with the FDA quite a bit and knows all the regulations and knows how to dot the i’s and cross the t’s in order to get a drug across the finish line.

SULLIVAN: If you had a magic wand, what’s the one thing you’d change in regulation today?

CARPENTER: I would like people to think a little bit less about just speed versus safety and, again, more about this basic issue of confidence. I think it’s fundamental to everything that happens in markets but especially in biopharmaceuticals.

SULLIVAN: Such a great point. This has been really fun. Just thanks so much for being here today. We’re really excited to share your thoughts out to our listeners. Thanks.

[TRANSITION MUSIC] 

CARPENTER: Likewise. 

SULLIVAN: Now to the world of medical devices, I’m joined by Professor Timo Minssen. Professor Minssen, it’s great to have you here. Thank you for joining us today. 

TIMO MINSSEN: Yeah, thank you very much, it’s a pleasure.

SULLIVAN: Before getting into the regulatory world of medical devices, tell our audience a bit about your personal journey or your origin story, as we’re asking our guests. How did you land in regulation, and what’s kept you hooked in this space?

MINSSEN: So I started out as a patent expert in the biomedical area, starting with my PhD thesis on patenting biologics in Europe and in the US. So during that time, I was mostly interested in patent and trade secret questions. But at the same time, I also developed and taught courses in regulatory law and held talks on regulating advanced medical therapy medicinal products. I then started to lead large research projects on legal challenges in a wide variety of health and life science innovation frontiers. I also started to focus increasingly on AI-enabled medical devices and software as a medical device, resulting in several academic articles in this area and also in the regulatory area and a book on the future of medical device regulation.  

SULLIVAN: Yeah, what’s kept you hooked in the space?

MINSSEN: It’s just incredibly exciting, in particular right now with everything that is going on, you know, in the software arena, in the marriage between AI and medical devices. And this is really challenging not only societies but also regulators and authorities in Europe and in the US.

SULLIVAN: Yeah, it’s a super exciting time to be in this space. You know, we talked to Daniel a little earlier and, you know, I think similar to pharmaceuticals, people have a general sense of what we mean when we say medical devices, but most listeners may picture like a stethoscope or a hip implant. The word “medical device” reaches much wider. Can you give us a quick, kind of, range from perhaps very simple to even, I don’t know, sci-fi and then your 90-second tour of how risk assessment works and why a framework is essential?

MINSSEN: Let me start out by saying that the WHO [World Health Organization] estimates that today there are approximately 2 million different kinds of medical devices on the world market, and as of the FDA’s latest update that I’m aware of, the FDA has authorized more than 1,000 AI-, machine learning-enabled medical devices, and that number is rising rapidly.

So in that context, I think it is important to understand that medical devices can be any instrument, apparatus, implement, machine, appliance, implant, reagent for in vitro use, software, material, or other similar or related articles that are intended by the manufacturer to be used alone or in combination for a medical purpose. And the spectrum of what constitutes a medical device can thus range from very simple devices such as tongue depressors, contact lenses, and thermometers to more complex devices such as blood pressure monitors, insulin pumps, MRI machines, implantable pacemakers, and even software as a medical device or AI-enabled monitors or drug device combinations, as well.

So talking about regulation, I think it is also very important to stress that medical devices are used in many diverse situations by very different stakeholders. And testing has to take this variety into consideration, and it is intrinsically tied to regulatory requirements across various jurisdictions.

During the pre-market phase, medical testing establishes baseline safety and effectiveness metrics through bench testing, performance standards, and clinical studies. And post-market testing ensures that real-world data informs ongoing compliance and safety improvements. So testing is indispensable in translating technological innovation into safe and effective medical devices. And while particular details of pre-market and post-market review procedures may slightly differ among countries, most developed jurisdictions regulate medical devices similarly to the US or European models. 

So most jurisdictions with medical device regulation classify devices based on their risk profile, intended use, indications for use, technological characteristics, and the regulatory controls necessary to provide a reasonable assurance of safety and effectiveness.

SULLIVAN: So medical devices face a pretty prescriptive multi-level testing path before they hit the market. From your vantage point, what are some of the downsides of that system and when does it make the most sense?

MINSSEN: One primary drawback is, of course, the lengthy and expensive approval process. High-risk devices, for example, often undergo years of clinical trials, which can cost millions of dollars, and this can create a significant barrier for startups and small companies with limited resources. And even for moderate-risk devices, the regulatory burden can slow product development and time to the market.

And the approach can also limit flexibility. Prescriptive requirements may not accommodate emerging innovations like digital therapeutics or AI-based diagnostics in a feasible way. And in such cases, the framework can unintentionally [stiffen] innovation by discouraging creative solutions or iterative improvements, which as matter of fact can also put patients at risk when you don’t use new technologies and AI. And additionally, the same level of scrutiny may be applied to low-risk devices, where the extensive testing and documentation may also be disproportionate to the actual patient risk.

However, the prescriptive model is highly appropriate where we have high testing standards for high-risk medical devices, in my view, particularly those that are life-sustaining, implanted, or involve new materials or mechanisms.

I also wanted to say that I think that these higher compliance thresholds can be OK and necessary if you have a system where authorities and stakeholders also have the capacity and funding to enforce, monitor, and achieve compliance with such rules in a feasible, time-effective, and straightforward manner. And this, of course, requires resources, novel solutions, and investments.

SULLIVAN: A range of tests are undertaken across the life cycle of medical devices. How do these testing requirements vary across different stages of development and across various applications?

MINSSEN: Yes, that’s a good question. So I think first it is important to realize that testing is conducted by various entities, including manufacturers, independent third-party laboratories, and regulatory agencies. And it occurs throughout the device life cycle, beginning with iterative testing during the research and development stage, advancing to pre-market evaluations, and continuing into post-market monitoring. And the outcomes of these tests directly impact regulatory approvals, market access, and device design refinements, as well. So the testing results are typically shared with regulatory authorities and in some cases with healthcare providers and the broader public to enhance transparency and trust.

So if you talk about the different phases that play a role here … so let’s turn to the pre-market phase, where manufacturers must demonstrate that the device is conformed to safety and performance benchmarks defined by regulatory authorities. Pre-market evaluations include functional bench testing, biocompatibility, for example, assessments and software validation, all of which are integral components of a manufacturer’s submission. 

But, yes, but, testing also, and we touched already up on that, extends into the post-market phase, where it continues to ensure device safety and efficacy, and post-market surveillance relies on testing to monitor real-world performance and identify emerging risks on the post-market phase. By integrating real-world evidence into ongoing assessments, manufacturers can address unforeseen issues, update devices as needed, and maintain compliance with evolving regulatory expectations. And I think this is particularly important in this new generation of medical devices that are AI-enabled or machine-learning enabled.

I think we have to understand that in this AI-enabled medical devices field, you know, the devices and the algorithms that are working with them, they can improve in the lifetime of a product. So actually, not only you could assess them and make sure that they maintain safe, you could also sometimes lower the risk category by finding evidence that these devices are actually becoming more precise and safer. So it can both, you know, heighten the risk category or lower the risk category, and that’s why this continuous testing is so important.

SULLIVAN: Given what you just said, how should regulators handle a device whose algorithm keeps updating itself after approval?

MINSSEN: Well, it has to be an iterative process that is feasible and straightforward and that is based on a very efficient, both time efficient and performance efficient, communication between the regulatory authorities and the medical device developers, right. We need to have the sensors in place that spot potential changes, and we need to have the mechanisms in place that allow us to quickly react to these changes both regulatory wise and also in the technological way. 

So I think communication is important, and we need to have the pathways and the feedback loops in the regulation that quickly allow us to monitor these self-learning algorithms and devices.

SULLIVAN: It sounds like it’s just … there’s such a delicate balance between advancing technology and really ensuring public safety. You know, if we clamp down too hard, we stifle that innovation. You already touched upon this a bit. But if we’re too lax, we risk unintended consequences. And I’d just love to hear how you think the field is balancing that and any learnings you can share.

MINSSEN: So this is very true, and you just touched upon a very central question also in our research and our writing. And this is also the reason why medical device regulation is so fascinating and continues to evolve in response to rapid advancements in technologies, particularly dual technologies regarding digital health, artificial intelligence, for example, and personalized medicine.

And finding the balance is tricky because also [a] related major future challenge relates to the increasing regulatory jungle and the complex interplay between evolving regulatory landscapes that regulate AI more generally.

We really need to make sure that the regulatory authorities that deal with this, that need to find the right balance to promote innovation and mitigate and prevent risks, need to have the capacity to do this. So this requires investments, and it also requires new ways to regulate this technology more flexibly, for example through regulatory sandboxes and so on.

SULLIVAN: Could you just expand upon that a bit and double-click on what it is you’re seeing there? What excites you about what’s happening in that space?

MINSSEN: Yes, well, the research of my group at the Center for Advanced Studies in Bioscience Innovation Law is very broad. I mean, we are looking into gene editing technologies. We are looking into new biologics. We are looking into medical devices, as well, obviously, but also other technologies in advanced medical computing.

And what we see across the line here is that there is an increasing demand for having more adaptive and flexible regulatory frameworks in these new technologies, in particular when they have new uses, regulations that are focusing more on the product rather than the process. And I have recently written a report, for example, for emerging biotechnologies and bio-solutions for the EU commission. And even in that area, regulatory sandboxes are increasingly important, increasingly considered.

So this idea of regulatory sandboxes has been developing originally in the financial sector, and it is now penetrating into other sectors, including synthetic biology, emerging biotechnologies, gene editing, AI, quantum technology, as well. This is basically creating an environment where actors can test new ideas in close collaboration and under the oversight of regulatory authorities.

But to implement this in the AI sector now also leads us to a lot of questions and challenges. For example, you need to have the capacities of authorities that are governing and monitoring and deciding on these regulatory sandboxes. There are issues relating to competition law, for example, which you call antitrust law in the US, because the question is, who can enter the sandbox and how may they compete after they exit the sandbox? And there are many questions relating to, how should we work with these sandboxes and how should we implement these sandboxes?

[TRANSITION MUSIC] 

SULLIVAN: Well, Timo, it has just been such a pleasure to speak with you today.

MINSSEN: Yes, thank you very much. 

And now I’m happy to introduce Chad Atalla.

Chad is senior applied scientist in Microsoft Research New York City’s Sociotechnical Alignment Center, where they contribute to foundational responsible AI research and practical responsible AI solutions for teams across Microsoft.

Chad, welcome!

CHAD ATALLA: Thank you.

SULLIVAN: So we’ll kick off with a couple questions just to dive right in. So tell me a little bit more about the Sociotechnical Alignment Center, or STAC? I know it was founded in 2022. I’d love to just learn a little bit more about what the group does, how you’re thinking about evaluating AI, and maybe just give us a sense of some of the projects you’re working on.

ATALLA: Yeah, absolutely. The name is quite a mouthful.

SULLIVAN: It is! [LAUGHS] 

ATALLA: So let’s start by breaking that down and seeing what that means.

SULLIVAN: Great.

ATALLA: So modern AI systems are sociotechnical systems, meaning that the social and technical aspects are deeply intertwined. And we’re interested in aligning the behaviors of these sociotechnical systems with some values. Those could be societal values; they could be regulatory values, organizational values, etc. And to make this alignment happen, we need the ability to evaluate the systems.

So my team is broadly working on an evaluation framework that acknowledges the sociotechnical nature of the technology and the often-abstract nature of the concepts we’re actually interested in evaluating. As you noted, it’s an applied science team, so we split our time between some fundamental research and time to bridge the work into real products across the company. And I also want to note that to power this sort of work, we have an interdisciplinary team drawing upon the social sciences, linguistics, statistics, and, of course, computer science.

SULLIVAN: Well, I’m eager to get into our takeaways from the conversation with both Daniel and Timo. But maybe just to double-click on this for a minute, can you talk a bit about some of the overarching goals of the AI evaluations that you noted? 

ATALLA: So evaluation is really the act of making valuative judgments based on some evidence, and in the case of AI evaluation, that evidence might be from tests or measurements, right. And the goal of why we’re doing this in the first place is to make decisions and claims most often.

So perhaps I am going to make a claim about a model that I’m producing, and I want to say that it’s better than this other model. Or we are asking whether a certain product is safe to ship. All of these decisions need to be informed by good evaluation and therefore good measurement or testing. And I’ll also note that in the regulatory conversation, risk is often what we want to evaluate. So that is a goal in and of itself. And I’ll touch more on that later.

SULLIVAN: I read a recent paper that you had put out with some of our colleagues from Microsoft Research, from the University of Michigan, and Stanford, and you were arguing that evaluating generative AI is the social-science measurement challenge. Maybe for those who haven’t read the paper, what does this mean? And can you tell us a little bit more about what motivated you and your coauthors? 

ATALLA: So the measurement tasks involved in evaluating generative AI systems are often abstract and contested. So that means they cannot be directly measured and must instead [be] indirectly measured via other observable phenomena. So this is very different than the older machine learning paradigm, where, let’s say, for example, I had a system that took a picture of a traffic light and told you whether it was green, yellow, or red at a given time. 

If we wanted to evaluate that system, the task is much simpler. But with the modern generative AI systems that are also general purpose, they have open-ended output, and language in a whole chat or multiple paragraphs being outputted can have a lot of different properties. And as I noted, these are general-purpose systems, so we don’t know exactly what task they’re supposed to be carrying out.

So then the question becomes, if I want to make some decision or claim—maybe I want to make a claim that this system has human-level reasoning capabilities—well, what does that mean? Do I have the same impression of what that means as you do? And how do we know whether the downstream, you know, measurements and tests that I’m conducting actually will support my notion of what it means to have human-level reasoning, right? Difficult questions. But luckily, social scientists have been dealing with these exact sorts of challenges for multiple decades in fields like education, political science, and psychometrics. So we’re really attempting to avoid reinventing the wheel here and trying to learn from their past methodologies.

And so the rest of the paper goes on to delve into a four-level framework, a measurement framework, that’s grounded in the measurement theory from the quantitative social sciences that takes us all the way from these abstract and contested concepts through processes to get much clearer and eventually reach reliable and valid measurements that can power our evaluations.

SULLIVAN: I love that. I mean, that’s the whole point of this podcast, too, right. Is to really build on those other learnings and frameworks that we’re taking from industries that have been thinking about this for much longer. Maybe from your vantage point, what are some of the biggest day-to-day hurdles in building solid AI evaluations and, I don’t know, do we need more shared standards? Are there bespoke methods? Are those the way to go? I would love to just hear your thoughts on that.

ATALLA: So let’s talk about some of those practical challenges. And I want to briefly go back to what I mentioned about risk before, all right. Oftentimes, some of the regulatory environment is requiring practitioners to measure the risk involved in deploying one of their models or AI systems. Now, risk is importantly a concept that includes both event and impact, right. So there’s the probability of some event occurring. For the case of AI evaluation, perhaps this is us seeing a certain AI behavior exhibited. Then there’s also the severity of the impacts, and this is a complex chain of effects in the real world that happen to people, organizations, systems, etc., and it’s a lot more challenging to observe the impacts, right.

So if we’re saying that we need to measure risk, we have to measure both the event and the impacts. But realistically, right now, the field is not doing a very good job of actually measuring the impacts. This requires vastly different techniques and methodologies where if I just wanted to measure something about the event itself, I can, you know, do that in a technical sandbox environment and perhaps have some automated methods to detect whether a certain AI behavior is being exhibited. But if I want to measure the impacts? Now, we’re in the realm of needing to have real people involved, and perhaps a longitudinal study where you have interviews, questionnaires, and more qualitative evidence-gathering techniques to truly understand the long-term impacts. So that’s a significant challenge.

Another is that, you know, let’s say we forget about the impacts for now and we focus on the event side of things. Still, we need datasets, we need annotations, and we need metrics to make this whole thing work. When I say we need datasets, if I want to test whether my system has good mathematical reasoning, what questions should I ask? What are my set of inputs that are relevant? And then when I get the response from the system, how do I annotate them? How do I know if it was a good response that did demonstrate mathematical reasoning or if it was a mediocre response? And then once I have an annotation of all of these outputs from the AI system, how do I aggregate those all up into a single informative number?

SULLIVAN: Earlier in this episode, we heard Daniel and Timo walk through the regulatory frameworks in pharma and medical devices. I’d be curious what pieces of those mature systems are already showing up or at least may be bubbling up in AI governance.

ATALLA: Great question. You know, Timo was talking about the pre-market and post-market testing difference. Of course, this is similarly important in the AI evaluation space. But again, these have different methodologies and serve different purposes.

So within the pre-deployment phase, we don’t have evidence of how people are going to use the system. And when we have these general-purpose AI systems, to understand what the risks are, we really need to have a sense of what might happen and how they might be used. So there are significant challenges there where I think we can learn from other fields and how they do pre-market testing. And the difference in that pre- versus post-market testing also ties to testing at different stages in the life cycle.

For AI systems, we already see some regulations saying you need to start with the base model and do some evaluation of the base model, some basic attributes, some core attributes, of that base model before you start putting it into any real products. But once we have a product in mind, we have a user base in mind, we have a specific task—like maybe we’re going to integrate this model into Outlook and it’s going to help you write emails—now we suddenly have a much crisper picture of how the system will interact with the world around it. And again, at that stage, we need to think about another round of evaluation.

Another part that jumped out to me in what they were saying about pharmaceuticals is that sometimes approvals can be based on surrogate endpoints. So this is like we’re choosing some heuristic. Instead of measuring the long-term impact, which is what we actually care about, perhaps we have a proxy that we feel like is a good enough indicator of what that long-term impact might look like.  

This is occurring in the AI evaluation space right now and is often perhaps even the default here since we’re not seeing that many studies of the long-term impact itself. We are seeing, instead, folks constructing these heuristics or proxies and saying if I see this behavior happen, I’m going to assume that it indicates this sort of impact will happen downstream. And that’s great. It’s one of the techniques that was used to speed up and reduce the barrier to innovation in the other fields. And I think it’s great that we are applying that in the AI evaluation space. But special care is, of course, needed to ensure that those heuristics and proxies you’re using are reasonable indicators of the greater outcome you’re looking for.

SULLIVAN: What are some of the promising ideas from maybe pharma or med device regulation that maybe haven’t made it to AI testing yet and maybe should? And where would you urge technologists, policymakers, and researchers to focus their energy next?

ATALLA: Well, one of the key things that jumped out to me in the discussion about pharmaceuticals was driving home the emphasis that there is a holistic focus on safety and efficacy. These go hand in hand and decisions must be made while considering both pieces of the picture. I would like to see that further emphasized in the AI evaluation space.

Often, we are seeing evaluations of risk being separated from evaluations of performance or quality or efficacy, but these two pieces of the puzzle really are not enough for us to make informed decisions independently. And that ties back into my desire to really also see us measuring the impacts.

So we see Phase 3 trials as something that occurs in the medical devices and pharmaceuticals field. That’s not something that we are doing an equivalent of in the AI evaluation space at this time. These are really cost intensive. They can last years and really involve careful monitoring of that holistic picture of safety and efficacy. And realistically, we are not going to be able to put that on the critical path to getting specific individual AI models or AI systems vetted before they go out into the world. However, I would love to see a world in which this sort of work is prioritized and funded or required. Think of how, with social media, it took quite a long time for us to understand that there are some long-term negative impacts on mental health, and we have the opportunity now, while the AI wave is still building, to start prioritizing and funding this sort of work. Let it run in the background and as soon as possible develop a good understanding of the subtle, long-term effects.

More broadly, I would love to see us focus on reliability and validity of the evaluations we’re conducting because trust in these decisions and claims is important. If we don’t focus on building reliable, valid, and trustworthy evaluations, we’re just going to continue to be flooded by a bunch of competing, conflicting, and largely meaningless AI evaluations.

SULLIVAN: In a number of the discussions we’ve had on this podcast, we talked about how it’s not just one entity that really needs to ensure safety across the board, and I’d just love to hear from you how you think about some of those ecosystem collaborations, and you know, from across … where we think about ourselves as more of a platform company or places that these AI models are being deployed more at the application level. Tell me a little bit about how you think about, sort of, stakeholders in that mix and where responsibility lies across the board.

ATALLA: It’s interesting. In this age of general-purpose AI technologies, we’re often seeing one company or organization being responsible for building the foundational model. And then many, many other people will take that model and build it into specific products that are designed for specific tasks and contexts.

Of course, in that, we already see that there is a responsibility of the owners of that foundational model to do some testing of the central model before they distribute it broadly. And then again, there is responsibility of all of the downstream individuals digesting that and turning it into products to consider the specific contexts that they are deploying into and how that may affect the risks we’re concerned with or the types of quality and safety and performance we need to evaluate.

Again, because that field of risks we may be concerned with is so broad, some of them also require an immense amount of expertise. Let’s think about whether AI systems can enable people to create dangerous chemicals or dangerous weapons at home. It’s not that every AI practitioner is going to have the knowledge to evaluate this, so in some of those cases, we really need third-party experts, people who are experts in chemistry, biology, etc., to come in and evaluate certain systems and models for those specific risks, as well.

So I think there are many reasons why multiple stakeholders need to be involved, partly from who owns what and is responsible for what and partly from the perspective of who has the expertise to meaningfully construct the evaluations that we need.

SULLIVAN: Well, Chad, this has just been great to connect, and in a few of our discussions, we’ve done a bit of a lightning round, so I’d love to just hear your 30-second responses to a few of these questions. Perhaps favorite evaluation you’ve run so far this year? 

ATALLA: So I’ve been involved in trying to evaluate some language models for whether they infer sensitive attributes about people. So perhaps you’re chatting with a chatbot, and it infers your religion or sexuality based on things you’re saying or how you sound, right. And in working to evaluate this, we encounter a lot of interesting questions. Or, like, what is a sensitive attribute? What makes these attributes sensitive, and what are the differences that make it inappropriate for an AI system to infer these things about a person? Whereas realistically, whenever I meet a person on the street, my brain is immediately forming first impressions and some assumptions about these people. So it’s a very interesting and thought-provoking evaluation to conduct and think about the norms that we place upon people interacting with other people and the norms we place upon AI systems interacting with other people.

SULLIVAN: That’s fascinating! I’d love to hear the AI buzzword you’d retire tomorrow. [LAUGHTER]

ATALLA: I would love to see the term “bias” being used less when referring to fairness-related issues and systems. Bias happens to be a highly overloaded term in statistics and machine learning and has a lot of technical meanings and just fails to perfectly capture what we mean in the AI risk sense.

SULLIVAN: And last one. One metric we’re not tracking enough.

ATALLA: I would say over-blocking, and this comes into that connection between the holistic picture of safety and efficacy. It’s too easy to produce systems that throw safety to the wind and focus purely on utility or achieving some goal, but simultaneously, the other side of the picture is possible, where we can clamp down too hard and reduce the utility of our systems and block even benign and useful outputs just because they border on something sensitive. So it’s important for us to track that over-blocking and actively track that tradeoff between safety and efficacy.

SULLIVAN: Yeah, we talk a lot about this on the podcast, too, of how do you both make things safe but also ensure innovation can thrive, and I think you hit the nail on the head with that last piece.

[MUSIC] 

Well, Chad, this was really terrific. Thanks for joining us and thanks for your work and your perspectives. And another big thanks to Daniel and Timo for setting the stage earlier in the podcast.

And to our listeners, thanks for tuning in. You can find resources related to this podcast in the show notes. And if you want to learn more about how Microsoft approaches AI governance, you can visit microsoft.com/RAI. 

See you next time! 

[MUSIC FADES]

The post AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices appeared first on Microsoft Research.

Read More

AI Testing and Evaluation: Learnings from genome editing

AI Testing and Evaluation: Learnings from genome editing

illustration of R. Alta Charo, Kathleen Sullivan, and Daniel Kluttz for the Microsoft Research Podcast

Generative AI presents a unique challenge and opportunity to reexamine governance practices for the responsible development, deployment, and use of AI. To advance thinking in this space, Microsoft has tapped into the experience and knowledge of experts across domains—from genome editing to cybersecurity—to investigate the role of testing and evaluation as a governance tool. AI Testing and Evaluation: Learnings from Science and Industry, hosted by Microsoft Research’s Kathleen Sullivan, explores what the technology industry and policymakers can learn from these fields and how that might help shape the course of AI development.

In this episode, Alta Charo (opens in new tab), emerita professor of law and bioethics at the University of Wisconsin–Madison, joins Sullivan for a conversation on the evolving landscape of genome editing and its regulatory implications. Drawing on decades of experience in biotechnology policy, Charo emphasizes the importance of distinguishing between hazards and risks and describes the field’s approach to regulating applications of technology rather than the technology itself. The discussion also explores opportunities and challenges in biotech’s multi-agency oversight model and the role of international coordination. Later, Daniel Kluttz (opens in new tab), a partner general manager in Microsoft’s Office of Responsible AI, joins Sullivan to discuss how insights from genome editing could inform more nuanced and robust governance frameworks for emerging technologies like AI.

Transcript

[MUSIC]

KATHLEEN SULLIVAN: Welcome to AI Testing and Evaluation: Learnings from Science and Industry. I’m your host, Kathleen Sullivan.

As generative AI continues to advance, Microsoft has gathered a range of experts—from genome editing to cybersecurity—to share how their fields approach evaluation and risk assessment. Our goal is to learn from their successes and their stumbles to move the science and practice of AI testing forward. In this series, we’ll explore how these insights might help guide the future of AI development, deployment, and responsible use.

[MUSIC ENDS]

Today I’m excited to welcome R. Alta Charo, the Warren P. Knowles Professor Emerita of Law and Bioethics at the University of Wisconsin–Madison, to explore testing and risk assessment in genome editing.

Professor Charo has been at the forefront of biotechnology policy and governance for decades, advising former President Obama’s transition team on issues of medical research and public health, as well as serving as a senior policy advisor at the Food and Drug Administration. She consults on gene therapy and genome editing for various companies and organizations and has held positions on a number of advisory committees, including for the National Academy of Sciences. Her committee work has spanned women’s health, stem cell research, genome editing, biosecurity, and more.

After our conversation with Professor Charo, we’ll hear from Daniel Kluttz, a partner general manager in Microsoft’s Office of Responsible AI, about what these insights from biotech regulation could mean for AI governance and risk assessment and his team’s work governing sensitive AI uses and emerging technologies.

Alta, thank you so much for being here today. I’m a follower of your work and have really been looking forward to our conversation.


ALTA CHARO: It’s my pleasure. Thanks for having me.

SULLIVAN: Alta, I’d love to begin by stepping back in time a bit before you became a leading figure in bioethics and legal policy. You’ve shared that your interest in science was really inspired by your brothers’ interest in the topic and that your upbringing really helped shape your perseverance and resilience. Can you talk to us about what put you on the path to law and policy?

CHARO: Well, I think it’s true that many of us are strongly influenced by our families and certainly my family had, kind of, a science-y, techy orientation. My father was a refugee, you know, escaping the Nazis, and when he finally was able to start working in the United States, he took advantage of the G.I. Bill to learn how to repair televisions and radios, which were really just coming in in the 1950s. So he was, kind of, technically oriented.

My mother retrained from being a talented amateur artist to becoming a math teacher, and not surprisingly, both my brothers began to aim toward things like engineering and chemistry and physics. And our form of entertainment was to watch PBS or Star Trek. [LAUGHTER]

And so the interest comes from that background coupled with, in the 1960s, this enormous surge of interest in the so-called nature-versus-nurture debate about the degree to which we are destined by our biology or shaped by our environments. It was a heady debate, and one that perfectly combined the two interests in politics and science.

SULLIVAN: For listeners who are brand new to your field in genomic editing, can you give us what I’ll call a “90-second survey” of the space in perhaps plain language and why it’s important to have a framework for ensuring its responsible use.

CHARO: Well, you know, genome editing is both very old and very new. At base, what we’re talking about is a way to either delete sections of the genome, our collection of genes, or to add things or to alter what’s there. The goal is simply to be able to take what might not be healthy and make it healthy, whether it’s a plant, an animal, or a human.

Many people have compared it to a word processor, where you can edit text by swapping things in and out. You could change the letter g to the letter h in every word, and in our genomes, you can do similar kinds of things.

But because of this, we have a responsibility to make sure that whatever we change doesn’t become dangerous and that it doesn’t become socially disruptive. Now the earliest forms of genome editing were very inefficient, and so we didn’t worry that much. But with the advances that were spearheaded by people like Jennifer Doudna and Emmanuelle Charpentier, who won the Nobel Prize for their work in this area, genome editing has become much easier to do.

It’s become more efficient. It doesn’t require as much sophisticated laboratory equipment. It’s moved from being something that only a few people can do to something that we’re going to be seeing in our junior high school biology labs. And that means you have to pay attention to who’s doing it, why are they doing it, what are they releasing, if anything, into the environment, what are they trying to sell, and is it honest and is it safe?

SULLIVAN: How would you describe the risks, and are there, you know, sort of, specifically inherent risks in the technology itself, or do those risks really emerge only when it’s applied in certain contexts, like CRISPR in agriculture or CRISPR for human therapies?

CHARO: Well, to answer that, I’m going to do something that may seem a little picky, even pedantic. [LAUGHTER] But I’m going to distinguish between hazards and risks. So there are certain intrinsic hazards. That is, there are things that can go wrong.

You want to change one particular gene or one particular portion of a gene, and you might accidentally change something else, a so-called off-target effect. Or you might change something in a gene expecting a certain effect but not necessarily anticipating that there’s going to be an interaction between what you changed and what was there, a gene-gene interaction, that might have an unanticipated kind of result, a side effect essentially.

So there are some intrinsic hazards, but risk is a hazard coupled with the probability that it’s going to actually create something harmful. And that really depends upon the application.

If you are doing something that is making a change in a human being that is going to be a lifelong change, that enhances the significance of that hazard. It amplifies what I call the risk because if something goes wrong, then its consequences are greater.

It may also be that in other settings, what you’re doing is going to have a much lower risk because you’re working with a more familiar substance, your predictive power is much greater, and it’s not going into a human or an animal or into the environment. So I think that you have to say that the risk and the benefits, by the way, all are going to depend upon the particular application.

SULLIVAN: Yeah, I think on this point of application, there’s many players involved in that, right. Like, we often hear about this puzzle of who’s actually responsible for ensuring safety and a reasonable balance between risks and benefits or hazards and benefits, to quote you. Is it the scientists, the biotech companies, government agencies? And then if you could touch upon, as well, maybe how does the nature of genome editing risks … how do those responsibilities get divvied up?

CHARO: Well, in the 1980s, we had a very significant policy discussion about whether we should regulate the technology—no matter how it’s used or for whatever purpose—or if we should simply fold the technology in with all the other technologies that we currently have and regulate its applications the way we regulate applications generally. And we went for the second, the so-called coordinated framework.

So what we have in the United States is a system in which if you use genome editing in purely laboratory-based work, then you will be regulated the way we regulate laboratories.

There’s also, at most universities because of the way the government works with this, something called Institutional Biosafety Committees, IBCs. You want to do research that involves recombinant DNA and modern biotechnology, including genome editing but not limited to it, you have to go first to your IBC, and they look and see what you’re doing to decide if there’s a danger there that you have not anticipated that requires special attention.

If what you’re doing is going to get released into the environment or it’s going to be used to change an animal that’s going to be in the environment, then there are agencies that oversee the safety of our environment, predominantly the Environmental Protection Agency and the U.S. Department of Agriculture.

If you’re working with humans and you’re doing medical therapies, like you’re doing the gene therapies that just have been developed for things like sickle cell anemia, then you have to go through a very elaborate regulatory process that’s overseen by the Food and Drug Administration and also seen locally at the research stages overseen by institutional review boards that make sure the people who are being recruited into research understand what they’re getting into, that they’re the right people to be recruited, etc.

So we do have this kind of Jenga game …

SULLIVAN: [LAUGHS] Yeah, sounds like it.

CHARO: … of regulatory agencies. And on top of all that, most of this involves professionals who’ve had to be licensed in some way. There may be state laws specifically on licensing. If you are dealing with things that might cross national borders, there may be international treaties and agreements that cover this.

And, of course, the insurance industry plays a big part because they decide whether or not what you’re doing is safe enough to be insured. So all of these things come together in a way that is not at all easy to understand if you’re not, kind of, working in the field. But the bottom-line thing to remember, the way to really think about it is, we don’t regulate genome editing; we regulate the things that use genome editing.

SULLIVAN: Yeah, that makes a lot of sense. Actually, maybe just following up a little bit on this notion of a variety of different, particularly like government agencies being involved. You know, in this multi-stakeholder model, where do you see gaps today that need to be filled, some of the pros and cons to keep in mind, and, you know, just as we think about distributing these systems at a global level, like, what are some of the considerations you are keeping in mind on that front?

CHARO: Well, certainly there are times where the way the statutes were written that govern the regulation of drugs or the regulation of foods did not anticipate this tremendous capacity we now have in the area of biotechnology generally or genome editing in particular. And so you can find that there are times where it feels a little bit ambiguous, and the agencies have to figure out how to apply their existing rules.

So an example. If you’re going to make alterations in an animal, right, we have a system for regulating drugs, including veterinary drugs. But we didn’t have something that regulated genome editing of animals. But in a sense, genome editing of an animal is the same thing as using a veterinary drug. You’re trying to affect the animal’s physical constitution in some fashion.

And it took a long time within the FDA to, sort of, work out how the regulation of veterinary drugs would apply if you think about the genetic construct that’s being used to alter the animal as the same thing as injecting a chemically based drug. And on that basis, they now know here’s the regulatory path—here are the tests you have to do; here are the permissions you have to do; here’s the surveillance you have to do after it goes on the market.

Even there, sometimes, it was confusing. What happens when it’s not the kind of animal you’re thinking about when you think about animal drugs? Like, we think about pigs and dogs, but what about mosquitoes?

Because there, you’re really thinking more about pests, and if you’re editing the mosquito so that it can’t, for example, transmit dengue fever, right, it feels more like a public health thing than it is a drug for the mosquito itself, and it, kind of, fell in between the agencies that possibly had jurisdiction. And it took a while for the USDA, the Department of Agriculture, and the Food and Drug Administration to work out an agreement about how they would share this responsibility. So you do get those kinds of areas in which you have at least ambiguity.

We also have situations where frankly the fact that some things can move across national borders means you have to have a system for harmonizing or coordinating national rules. If you want to, for example, genetically engineer mosquitoes that can’t transmit dengue, mosquitoes have a tendency to fly. [LAUGHTER] And so … they can’t fly very far. That’s good. That actually makes it easier to control.

But if you’re doing work that’s right near a border, then you have to be sure that the country next to you has the same rules for whether it’s permitted to do this and how to surveil what you’ve done in order to be sure that you got the results you wanted to get and no other results. And that also is an area where we have a lot of work to be done in terms of coordinating across government borders and harmonizing our rules.

SULLIVAN: Yeah, I mean, you’ve touched on this a little bit, but there is such this striking balance between advancing technology, ensuring public safety, and sometimes, I think it feels just like you’re walking a tightrope where, you know, if we clamp down too hard, we’ll stifle innovation, and if we’re too lax, we risk some of these unintended consequences. And on a global scale like you just mentioned, as well. How has the field of genome editing found its balance?

CHARO: It’s still being worked out, frankly, but it’s finding its balance application by application. So in the United States, we have two very different approaches on regulation of things that are going to go into the market.

Some things can’t be marketed until they’ve gotten an approval from the government. So you come up with a new drug, you can’t sell that until it’s gone through FDA approval.

On the other hand, for most foods that are made up of familiar kinds of things, you can go on the market, and it’s only after they’re on the market that the FDA can act to withdraw it if a problem arises. So basically, we have either pre-market controls: you can’t go on without permission. Or post-market controls: we can take you off the market if a problem occurs.

How do we decide which one is appropriate for a particular application? It’s based on our experience. New drugs typically are both less familiar than existing things on the market and also have a higher potential for injury if they, in fact, are not effective or they are, in fact, dangerous and toxic.

If you have foods, even bioengineered foods, that are basically the same as foods that are already here, it can go on the market with notice but without a prior approval. But if you create something truly novel, then it has to go through a whole long process.

And so that is the way that we make this balance. We look at the application area. And we’re just now seeing in the Department of Agriculture a new approach on some of the animal editing, again, to try and distinguish between things that are simply a more efficient way to make a familiar kind of animal variant and those things that are genuinely novel and to have a regulatory process that is more rigid the more unfamiliar it is and the more that we see a risk associated with it.

SULLIVAN: I know we’re at the end of our time here and maybe just a quick kind of lightning-round of a question. For students, young scientists, lawyers, or maybe even entrepreneurs listening who are inspired by your work, what’s the single piece of advice you give them if they’re interested in policy, regulation, the ethical side of things in genomics or other fields?

CHARO: I’d say be a bio-optimist and read a lot of science fiction. Because it expands your imagination about what the world could be like. Is it going to be a world in which we’re now going to be growing our buildings instead of building them out of concrete?

Is it going to be a world in which our plants will glow in the evening so we don’t need to be using batteries or electrical power from other sources but instead our environment is adapting to our needs?

You know, expand your imagination with a sense of optimism about what could be and see ethics and regulation not as an obstacle but as a partner to bringing these things to fruition in a way that’s responsible and helpful to everyone.

[TRANSITION MUSIC]

SULLIVAN: Wonderful. Well, Alta, this has been just an absolute pleasure. So thank you.

CHARO: It was my pleasure. Thank you for having me.

SULLIVAN: Now, I’m happy to bring in Daniel Kluttz. As a partner general manager in Microsoft’s Office of Responsible AI, Daniel leads the group’s Sensitive Uses and Emerging Technologies program.

Daniel, it’s great to have you here. Thanks for coming in.

DANIEL KLUTTZ: It’s great to be here, Kathleen.

SULLIVAN: Yeah. So maybe before we unpack Alta Charo’s insights, I’d love to just understand the elevator pitch here. What exactly is [the] Sensitive Uses and Emerging Tech program, and what was the impetus for establishing it?

KLUTTZ: Yeah. So the Sensitive Uses and Emerging Technologies program sits within our Office of Responsible AI at Microsoft. And inherent in the name, there are two real core functions. There’s the sensitive uses and emerging technologies. What does that mean?

Sensitive uses, think of that as Microsoft’s internal consulting and oversight function for our higher-risk, most impactful AI system deployments. And so my team is a team of multidisciplinary experts who engages in sort of a white-glove-treatment sort of way with product teams at Microsoft that are designing, building, and deploying these higher-risk AI systems, and where that sort of consulting journey culminates is in a set of bespoke requirements tailored to the use case of that given system that really implement and apply our more standardized, generalized requirements that apply across the board.

Then the emerging technologies function of my team faces a little bit further out, trying to look around corners to see what new and novel and emerging risks are coming out of new AI technologies with the idea that we work with our researchers, our engineering partners, and, of course, product leaders across the company to understand where Microsoft is going with those emerging technologies, and we’re developing sort of rapid, quick-fire early-steer guidance that implements our policies ahead of that formal internal policymaking process, which can take a bit of time. So it’s designed to, sort of, both afford that innovation speed that we like to optimize for at Microsoft but also integrate our responsible AI commitments and our AI principles into emerging product development.

SULLIVAN: That segues really nicely, actually, as we met with Professor Charo and she was, you know, talking about the field of genome editing and the governing at the application level. I’d love to just understand how similar or not is that to managing the risks of AI in our world?

KLUTTZ: Yeah. I mean, Professor Charo’s comments were music to my ears because, you know, where we make our bread and butter, so to speak, in our team is in applying to use cases. AI systems, especially in this era of generative AI, are almost inherently multi-use, dual use. And so what really matters is how you’re going to apply that more general-purpose technology. Who’s going to use it? In what domain is it going to be deployed? And then tailor that oversight to those use cases. Try to be risk proportionate.

Professor Charo talked a little bit about this, but if it’s something that’s been done before and it’s just a new spin on an old thing, maybe we’re not so concerned about how closely we need to oversee and gate that application of that technology, whereas if it’s something new and novel or some new risk that might be posed by that technology, we take a little bit closer look and we are overseeing that in a more sort of high-touch way.

SULLIVAN: Maybe following up on that, I mean, how do you define sensitive use or maybe like high-impact application, and once that’s labeled, what happens? Like, what kind of steps kick in from there?

KLUTTZ: Yeah. So we have this Sensitive Uses program that’s been at Microsoft since 2019. I came to Microsoft in 2019 when we were starting this program in the Office of Responsible AI, and it had actually been incubated in Microsoft Research with our Aether community of colleagues who are experts in sociotechnical approaches to responsible AI, as well. Once we put it in the Office of Responsible AI, I came over. I came from academia. I was a researcher myself …

SULLIVAN: At Berkeley, right?

KLUTTZ: At Berkeley. That’s right. Yep. Sociologist by training and a lawyer in a past life. [LAUGHTER] But that has helped sort of bridge those fields for me.

But Sensitive Uses, we force all of our teams when they’re envisioning their system design to think about, could the reasonably foreseeable use or misuse of the system that they’re developing in practice result in three really major, sort of, risk types. One is, could that deployment result in a consequential impact on someone’s legal position or life opportunity? Another category we have is, could that foreseeable use or misuse result in significant psychological or physical injury or harm? And then the third really ties in with a longstanding commitment we’ve had to human rights at Microsoft. And so could that system in it’s reasonably foreseeable use or misuse result in human rights impacts and injurious consequences to folks along different dimensions of human rights?

Once you decide, we have a process to reporting that project into my office, and we will triage that project, working with the product team, for example, and our Responsible AI Champs community, which are folks who are dispersed throughout the ecosystem at Microsoft and educated in our responsible AI program, and then determine, OK, is it in scope for our program? If it is, say, OK, we’re going to go along for that ride with you, and then we get into that whole sort of consulting arrangement that then culminates in this set of bespoke use-case-based requirements applying our AI principles.

SULLIVAN: That’s super fascinating. What are some of the approaches in the governance of genome editing are you maybe seeing happening in AI governance or maybe just, like, bubbling up in conversations around it?

KLUTTZ: Yeah, I mean, I think we’ve learned a lot from fields like genome editing that Professor Charo talked about and others. And again, it gets back to this, sort of, risk-proportionate-based approach. It’s a balancing test. It’s a tradeoff of trying to, sort of, foster innovation and really look for the beneficial uses of these technologies. I appreciated her speaking about that. What are the intended uses of the system, right? And then getting to, OK, how do we balance trying to, again, foster that innovation in a very fast-moving space, a pretty complex space, and a very unsettled space contrasting to other, sort of, professional fields or technological fields that have a long history and are relatively settled from an oversight and regulatory standpoint? This one is not, and for good reason. It is still developing.

And I think, you know, there are certain oversight and policy regimes that exist today that can be applied. Professor Charo talked about this, as well, where, you know, maybe you have certain policy and oversight regimes that, depending on how the application of that technology is applied, applies there versus some horizontal, overarching regulatory sort of framework. And I think that applies from an internal governance standpoint, as well.

SULLIVAN: Yeah. It’s a great point. So what isn’t being explored from genome editing that, you know, maybe we think could be useful to AI governance, or as we think about the evolving frameworks …

KLUTTZ: Yeah.

SULLIVAN: … what maybe we should be taking into account from what Professor Charo shared with us?

KLUTTZ: So one of the things I’ve thought about and took from Professor Charo’s discussion was she had just this amazing way of framing up how genome editing regulation is done. And she said, you know, we don’t regulate genome editing; we regulate the things that use genome editing. And while it’s not a one-to-one analogy with the AI space because we do have this sort of very general model level distinction versus application layer and even platform layer distinctions, I think it’s fair to say, you know, we don’t regulate AI applications writ large. We regulate the things that use AI in a very similar way. And that’s how we think of our internal policy and oversight process at Microsoft, as well.

And maybe there are things that we regulated and oversaw internally at the first instance and the first time we saw it come through, and it graduates into more of a programmatic framework for how we manage that. So one good example of that is some of our higher-risk AI systems that we offer out of Azure at the platform level. When I say that, I mean APIs that you call that developers can then build their own applications on top of. We were really deep in evaluating and assessing mitigations on those platform systems in the first instance, but we also graduated them into what we call our Limited Access AI services program.

And some of the things that Professor Charo discussed really resonated with me. You know, she had this moment where she was mentioning how, you know, you want to know who’s using your tools and how they’re being used. And it’s the same concepts. We want to have trust in our customers, we want to understand their use cases, and we want to apply technical controls that, sort of, force those use cases or give us signal post-deployment that use cases are being done in a way that may give us some level of concern, to reach out and understand what those use cases are.

SULLIVAN: Yeah, you’re hitting on a great point. And I love this kind of layered approach that we’re taking and that Alta highlighted, as well. Maybe to double-click a little bit just on that post-market control and what we’re tracking, kind of, once things are out and being used by our customers. How do we take some of that deployment data and bring it back in to maybe even better inform upfront governance or just how we think about some of the frameworks that we’re operating in?

KLUTTZ: It’s a great question. The number one thing is for us at Microsoft, we want to know the voice of our customer. We want our customers to talk to us. We don’t want to just understand telemetry and data. But it’s really getting out there and understanding from our customers and not just our customers. I would say our stakeholders is maybe a better term because that includes civil society organizations. It includes governments. It includes all of these non, sort of, customer actors that we care about and that we’re trying to sort of optimize for, as well. It includes end users of our enterprise customers. If we can gather data about how our products are being used and trying to understand maybe areas that we didn’t foresee how customers or users might be using those things, and then we can tune those systems to better align with what both customers and users want but also our own AI principles and policies and programs.

SULLIVAN: Daniel, before coming to Microsoft, you led social science research and sociotechnical applications of AI-driven tech at Berkeley. What do you think some of the biggest challenges are in defining and maybe even just, kind of, measuring at, like, a societal level some of the impacts of AI more broadly?

KLUTTZ: Measuring social phenomenon is a difficult thing. And one of the things that, as social scientists, you’re very interested in is scientifically observing and measuring social phenomena. Well, that sounds great. It sounds also very high level and jargony. What do we mean by that? You know, it’s very easy to say that you’re collecting data and you’re measuring, I don’t know, trust in AI, right? That’s a very fuzzy concept.

SULLIVAN: Right. Definitely.

KLUTTZ: It is a concept that we want to get to, but we have to unpack that, and we have to develop what we call measurable constructs. What are the things that we might observe that could give us an indication toward what is a very fuzzy and general concept. And there’s challenges with that everywhere. And I’m extremely fortunate to work at Microsoft with some of the world’s leading sociotechnical researchers and some of these folks who are thinking about—you know, very steeped in measurement theory, literally PhDs in these fields—how to both measure and allow for a scalable way to do that at a place the size of Microsoft. And that is trying to develop frameworks that are scalable and repeatable and put into our platform that then serves our product teams. Are we providing, as a platform, a service to those product teams that they can plug in and do their automated evaluations at scale as much as possible and then go back in over the top and do some of your more qualitative targeted testing and evaluations.

SULLIVAN: Yeah, makes a lot of sense. Before we close out, if you’re game for it, maybe we do a quick lightning round. Just 30-second answers here. Favorite real-world sensitive use case you’ve ever reviewed.

KLUTTZ: Oh gosh. Wow, this is where I get to be the social scientist.

SULLIVAN: [LAUGHS] Yes.

KLUTTZ: It’s like, define favorite, Kathleen. [LAUGHS] Most memorable, most painful.

SULLIVAN: Let’s do most memorable.

KLUTTZ: We’ll do most memorable.

SULLIVAN: Yeah.

KLUTTZ: You know, I would say the most memorable project I worked on was when we rolled out the new Bing Chat, which is no longer called Bing Chat, because that was the first really big cross-company effort to deploy GPT-4, which was, you know, the next step up in AI innovation from our partners at OpenAI. And I really value working hand in hand with engineering teams and with researchers and that was us at our best and really sort of turbocharged the model that we have.

SULLIVAN: Wonderful. What’s one of the most overused phrases that you have in your AI governance meetings?

KLUTTZ: Gosh. [LAUGHS] If I hear “We need to get aligned; we need to align on this more” …

SULLIVAN: [LAUGHS] Right.

KLUTTZ: But, you know, it’s said for a reason. And I think it sort of speaks to that clever nature. That’s one that comes to mind.

SULLIVAN: That’s great. And then maybe, maybe last one. What are you most excited about in the next, I don’t know, let’s say three months? This world is moving so fast!

KLUTTZ: You know, the pace of innovation, as you just said, is just staggering. It is unbelievable. And sometimes it can feel overwhelming in my space. But what I am most excited about is how we are building up this Emerging … I mentioned this Emerging Technologies program in my team as a, sort of, formal program is relatively new. And I really enjoy being able to take a step back and think a little bit more about the future and a little bit more holistically. And I love working with engineering teams and sort of strategic visionaries who are thinking about what we’re doing a year from now or five years from now, or even 10 years from now, and I get to be a part of those conversations. And that really gives me energy and helps me … helps keep me grounded and not just dealing with the day to day, and, you know, various fire drills that you may run. It’s thinking strategically and having that foresight about what’s to come. And it’s exciting.

SULLIVAN: Great. Well, Daniel, just thanks so much for being here. I had such a wonderful discussion with you, and I think the thoughtfulness in our discussion today I hope resonates with our listeners. And again, thanks to Alta for setting the stage and sharing her really amazing, insightful thoughts here, as well. So thank you.

[MUSIC]

KLUTTZ: Thank you, Kathleen. I appreciate it. It’s been fun.

SULLIVAN: And to our listeners, thanks for tuning in. You can find resources related to this podcast in the show notes. And if you want to learn more about how Microsoft approaches AI governance, you can visit microsoft.com/RAI.

See you next time! 

[MUSIC FADES]

The post AI Testing and Evaluation: Learnings from genome editing appeared first on Microsoft Research.

Read More

PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays

PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays

Alt text: The image features three white icons on a gradient background transitioning from blue on the left to green on the right. The first icon, located on the left, resembles an X-ray of a ribcage enclosed in a square with rounded corners. The middle icon depicts a hierarchical structure with one circle at the top connected by lines to two smaller circles below it. The third icon, positioned on the right, shows the letters

In our ever-evolving journey to enhance healthcare through technology, we’re announcing a unique new benchmark for grounded radiology report generation—PadChest-GR (opens in new tab). The world’s first multimodal, bilingual sentence-level radiology report dataset, developed by the University of Alicante with Microsoft Research, University Hospital Sant Joan d’Alacant and MedBravo, is set to redefine how AI and radiologists interpret radiological images. Our work demonstrates how collaboration between humans and AI can create powerful feedback loops—where new datasets drive better AI models, and those models, in turn, inspire richer datasets. We’re excited to share this progress in NEJM AI, highlighting both the clinical relevance and research excellence of this initiative. 

A new frontier in radiology report generation 

It is estimated that over half of people visiting hospitals have radiology scans that must be interpreted by a clinical professional. Traditional radiology reports often condense multiple findings into unstructured narratives. In contrast, grounded radiology reporting demands that each finding be described and localized individually.

This can mitigate the risk of AI fabrications and enable new interactive capabilities that enhance clinical and patient interpretability. PadChest-GR is the first bilingual dataset to address this need with 4,555 chest X-ray studies complete with Spanish and English sentence-level descriptions and precise spatial (bounding box) annotations for both positive and negative findings. It is the first public benchmark that enables us to evaluate generation of fully grounded radiology reports in chest X-rays. 

Figure 1: A chest X-ray overlaid with numbered bounding boxes, next to a matching list of structured radiological findings in Spanish and English.
Figure 1. Example of a grounded report from PadChest-GR. The original free-text report in Spanish was ”Motivo de consulta: Preoperatorio. Rx PA tórax: Impresión diagnóstica: Ateromatosis aórtica calcificada. Engrosamiento pleural biapical. Atelectasia laminar basal izquierda. Elongación aórtica. Sin otros hallazgos radiológicos significativos.”

This benchmark isn’t standing alone—it plays a critical role in powering our state-of-the-art multimodal report generation model, MAIRA-2. Leveraging the detailed annotations of PadChest-GR, MAIRA-2 represents our commitment to building more interpretable and clinically useful AI systems. You can explore our work on MAIRA-2 on our project web page, including recent user research conducted with clinicians in healthcare settings.

PadChest-GR is a testament to the power of collaboration. Aurelia Bustos at MedBravo and Antonio Pertusa at the University of Alicante published the original PadChest dataset (opens in new tab) in 2020, with the help of Jose María Salinas from Hospital San Juan de Alicante and María de la Iglesia Vayá from the Center of Excellence in Biomedical Imaging at the Ministry of Health in Valencia, Spain. We started to look at PadChest and were deeply impressed by the scale, depth, and diversity of the data.

As we worked more closely with the dataset, we realized the opportunity to develop this for grounded radiology reporting research and worked with the team at the University of Alicante to determine how to approach this together. Our complementary expertise was a nice fit. At Microsoft Research, our mission is to push the boundaries of medical AI through innovative, data-driven solutions. The University of Alicante, with its deep clinical expertise, provided critical insights that greatly enriched the dataset’s relevance and utility. The result of this collaboration is the PadChest-GR dataset.

A significant enabler of our annotation process was Centaur Labs. The team of senior and junior radiologists from the University Hospital Sant Joan d’Alacant, coordinated by Joaquin Galant, used this HIPAA-compliant labeling platform to perform rigorous study-level quality control and bounding box annotations. The annotation protocol implemented ensured that each annotation was accurate and consistent, forming the backbone of a dataset designed for the next generation of grounded radiology report generation models. 

Accelerating PadChest-GR dataset annotation with AI 

Our approach integrates advanced large language models with comprehensive manual annotation: 

Data Selection & Processing: Leveraging Microsoft Azure OpenAI Service (opens in new tab) with GPT-4, we extracted sentences describing individual positive and negative findings from raw radiology reports, translated them from Spanish to English, and linked each sentence to the existing expert labels from PadChest. This was done for a selected subset of the full PadChest dataset, carefully curated to reflect a realistic distribution of clinically relevant findings. 

Manual Quality Control & Annotation: The processed studies underwent meticulous quality checks on the Centaur Labs platform by radiologist from Hospital San Juan de Alicante. Each positive finding was then annotated with bounding boxes to capture critical spatial information. 

Standardization & Integration: All annotations were harmonized into coherent grounded reports, preserving the structure and context of the original findings while enhancing interpretability. 

Figure 2: A detailed block diagram illustrating the flow of data between various stages of AI processing and manual annotation.
Figure 2. Overview of the data curation pipeline.

Impact and future directions 

PadChest-GR not only sets a new benchmark for grounded radiology reporting, but also serves as the foundation for our MAIRA-2 model, which already showcases the potential of highly interpretable AI in clinical settings. While we developed PadChest-GR to help train and validate our own models, we believe the research community will greatly benefit from this dataset for many years to come. We look forward to seeing the broader research community build on this—improving grounded reporting AI models and using PadChest-GR as a standard for evaluation. We believe that by fostering open collaboration and sharing our resources, we can accelerate progress in medical imaging AI and ultimately improve patient care together with the community.

The collaboration between Microsoft Research and the University of Alicante highlights the transformative power of working together across disciplines. With our publication in NEJM-AI and the integral role of PadChest-GR in the development of MAIRA-2 (opens in new tab) and RadFact (opens in new tab), we are excited about the future of AI-empowered radiology. We invite researchers and industry experts to explore PadChest-GR and MAIRA-2, contribute innovative ideas, and join us in advancing the field of grounded radiology reporting. 

Papers already using PadChest-GR:

For further details or to download PadChest-GR, please visit the BIMCV PadChest-GR Project (opens in new tab)

Models in the Azure Foundry that can do Grounded Reporting: 

Acknowledgement

The post PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays appeared first on Microsoft Research.

Read More