Microsoft AI – Page 6

Research Focus: Week of January 13, 2025

January 17, 2025

by Brenda Potts Microsoft AI

In this edition:

We introduce privacy enhancements for multiparty deep learning, a framework using smaller, open-source models to provide relevance judgments, and other notable new research.
We congratulate Yasuyuki Matsushita, who was named an IEEE Computer Society Fellow.
We’ve included a recap of the extraordinary, far-reaching work done by researchers at Microsoft in 2024.

Decorative graphic with wavy shapes in the background in blues and purples. Text overlay in center left reads: “Research Focus: January 17, 2024”

NEW RESEARCH

AI meets materials discovery

Two of the transformative tools that play a central role in Microsoft’s work on AI for science are MatterGen and MatterSim. In the world of materials discovery, each plays a distinct yet complementary role in reshaping how researchers design and validate new materials.

Read the story

NEW RESEARCH

Communication Efficient Secure and Private Multi-Party Deep Learning

Distributed training enables multiple parties to jointly train a machine learning model on their respective datasets, which can help address the challenges posed by requirements in modern machine learning for large volumes of diverse data. However, this can raise security and privacy issues – protecting each party’s data during training and preventing leakage of private information from the model after training through various inference attacks.

In a recent paper, Communication Efficient Secure and Private Multi-Party Deep Learning, researchers from Microsoft address these concerns simultaneously by designing efficient Differentially Private, secure Multiparty Computation (DP-MPC) protocols for jointly training a model on data distributed among multiple parties. This DP-MPC protocol in the two-party setting is 56-to-794 times more communication-efficient and 16-to-182 times faster than previous such protocols. This work simplifies and improves on previous attempts to combine techniques from secure multiparty computation and differential privacy, especially in the context of training machine learning models.

Read the paper

NEW RESEARCH

JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Training and evaluating retrieval systems requires significant relevance judgments, which are traditionally collected from human assessors. This process is both costly and time-consuming. Large language models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM. While effective, this approach can be expensive and prone to intra-model biases that can favor systems leveraging similar models.

In a recent paper: JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment, researchers from Microsoft we introduce a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark, they compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. This research shows that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.

Read the paper

NEW RESEARCH

Convergence to Equilibrium of No-regret Dynamics in Congestion Games

Congestion games are used to describe the behavior of agents who share a set of resources. Each player chooses a combination of resources, which may become congested, decreasing utility for the players who choose them. Players can avoid congestion by choosing combinations that are less popular. This is useful for modeling a range of real-world scenarios, such as traffic flow, data routing, and wireless communication networks.

In a recent paper: Convergence to Equilibrium of No-regret Dynamics in Congestion Games; researchers from Microsoft and external colleagues propose CongestEXP, a decentralized algorithm based on the classic exponential weights method. They evaluate CongestEXP in a traffic congestion game setting. As more drivers use a particular route, congestion increases, leading to higher travel times and lower utility. Players can choose a different route every day to optimize their utility, but the observed utility by each player may be subject to randomness due to uncertainty (e.g., bad weather). The researchers show that this approach provides both regret guarantees and convergence to Nash Equilibrium, where no player can unilaterally improve their outcome by changing their strategy.

Read the paper

NEW RESEARCH

RD-Agent: An open-source solution for smarter R&D

Research and development (R&D) plays a pivotal role in boosting industrial productivity. However, the rapid advance of AI has exposed the limitations of traditional R&D automation. Current methods often lack the intelligence needed to support innovative research and complex development tasks, underperforming human experts with deep knowledge.

LLMs trained on vast datasets spanning many subjects are equipped with extensive knowledge and reasoning capabilities that support complex decision-making in diverse workflows. By autonomously performing tasks and analyzing data, LLMs can significantly increase the efficiency and precision of R&D processes.

In a recent article, researchers from Microsoft introduce RD-Agent, a tool that integrates data-driven R&D systems and harnesses advanced AI to automate innovation and development.

At the heart of RD-Agent is an autonomous agent framework with two key components: a) Research and b) Development. Research focuses on actively exploring and generating new ideas, while Development implements these ideas. Both components improve through an iterative process, illustrated in Figure 1 of the article, ensures the system becomes increasingly effective over time.

Read the article

Microsoft Research | In case you missed it

Microsoft Research 2024: A year in review

December 20, 2024

Microsoft Research did extraordinary work this year, using AI and scientific research to make progress on real-world challenges like climate change, food security, global health, and human trafficking. Here’s a look back at the broad range of accomplishments and advances in 2024.

AIOpsLab: Building AI agents for autonomous clouds

December 20, 2024

AIOpsLab is a holistic evaluation framework for researchers and developers, to enable the design, development, evaluation, and enhancement of AIOps agents, which also serves the purpose of reproducible, standardized, interoperable, and scalable benchmarks.

Yasuyuki Matsushita, IEEE Computer Society 2025 Fellow

December 19, 2024

Congratulations to Yasuyuki Matsushita, Senior Principal Research Manager at Microsoft Research, who was named a 2025 IEEE Computer Society Fellow. Matsushita was recognized for contributions to photometric 3D modeling and computational photography.

View more news and awards

The post Research Focus: Week of January 13, 2025 appeared first on Microsoft Research.

Ideas: AI for materials discovery with Tian Xie and Ziheng Lu

January 16, 2025

by Brenda Potts Microsoft AI

Ideas podcast | illustration of Tian Xie and Ziheng Lu

Behind every emerging technology is a great idea propelling it forward. In the Microsoft Research Podcast series Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets.

In this episode, guest host Lindsay Kalter talks with Principal Research Manager Tian Xie and Principal Researcher Ziheng Lu about their groundbreaking AI tools for materials discovery. Xie introduces MatterGen, which can generate new materials tailored to the specific needs of an application, such as materials with powerful magnetic properties or those that efficiently conduct lithium ions for better batteries. Lu explains how MatterSim accelerates simulations to validate and refine these discoveries. Together, these tools act as a “copilot” for scientists, proposing creative hypotheses and exploring vast material spaces far beyond traditional methods. The conversation highlights the challenges of bridging AI and experimental science and the potential of these tools to drive advancements in energy, manufacturing, and sustainability. At the cutting edge of AI research, Xie and Lu share their vision for the future of materials design and how these technologies could transform the field.

Learn more:

MatterSim: A deep-learning model for materials under real-world conditions
Microsoft Research blog, May 2024

MatterSim: A Deep Learning Atomistic Model Across Elements, Temperatures and Pressures
Publication, March 2024

MatterSim (opens in new tab)
GitHub repo

A generative model for inorganic materials design (opens in new tab)
Publication, January 2025

MatterGen: A Generative Model for Materials Design
Video, Microsoft Research Forum, June 2024

MatterGen: Property-guided materials design
Microsoft Research blog, December 2023

MatterGen (opens in new tab)
GitHub repo

Crystal Diffusion Variational Autoencoder for Periodic Material Generation
Publication, October 2021

Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties
Publication, April 2018

Transcript

[TEASER]

[MUSIC PLAYS UNDER DIALOGUE]

TIAN XIE: Yeah, so the problem of generating materials from properties is actually a pretty old one. I still remember back in 2018, when I was giving a talk about property-prediction models, right, one of the first questions people asked is, instead of going from material structure to properties, can you, kind of, inversely generate the materials directly from their property conditions? So in a way, this is, kind of, like a dream for material scientists because, like, the end goal is really about finding materials property, right, [that] will satisfy your application.

ZIHENG LU: Previously, a lot of people are using this atomistic simulator and this generative models alone. But if you think about it, now that we have these two foundation models together, it really can make things different, right. You have a very good idea generator. And you have a very good goalkeeper. And you put them together. They form a loop. And now you can use this loop to design materials really quickly.

[TEASER ENDS]

LINDSAY KALTER: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.

[MUSIC FADES]

I’m your guest host, Lindsay Kalter. Today I’m talking to Microsoft Principal Research Manager Tian Xie and Microsoft Principal Researcher Ziheng Lu. Tian is doing fascinating work with MatterGen, an AI tool for generating new materials guided by specific design requirements. Ziheng is one of the visionaries behind MatterSim, which puts those new materials to the test through advanced simulations. Together, they’re redefining what’s possible in materials science. Tian and Ziheng, welcome to the podcast.

TIAN XIE: Very excited to be here.

ZIHENG LU: Thanks, Lindsay, very excited.

KALTER: Before we dig into the specifics of MatterGen and MatterSim, let’s give our audience a sense of how you, as researchers, arrived at this moment. Materials science, especially at the intersection of computer science, is such a cutting-edge and transformative field. What first drew each of you to this space? And what, if any, moment or experience made you realize this was where you wanted to innovate? Tian, do you want to start?

XIE: So I started working on AI for materials back in 2015, when I started my PhD. So I come as a chemist and materials scientist, but I was, kind of, figuring out what I want to do during my PhD. So there is actually one moment really drove me into the field. That was AlphaGo. AlphaGo was, kind of, coming out in 2016, where it was able to beat the world champion in go in 2016. I was extremely impressed by that because I, kind of, learned how to do go, like, in my childhood. I know how hard it is and how much effort those professional go players have spent, right, in learning about go. So I, kind of, have the feeling that if AI can surpass the world-leading go players, one day, it will too surpass material scientists, right, in their ability to design novel materials. So that’s why I ended up deciding to focus my entire PhD on working on AI for materials. And I have been working on that since then. So it was actually very interesting because it was a very small field back then. And it’s great to see how much progress has been made, right, in the past 10 years and how much bigger a field it is now compared with 10 years ago.

LU: That’s very interesting, Tian. So, actually, I think I started, like, two years before you as a PhD student. So I, actually, I was trained as a computational materials scientist solely, not really an AI expert. But at that time, the computational materials science did not really work that well. It works but not working that well. So after, like, two or three years, I went back to experiments for, like, another two or three years because, I mean, the experiment is always the gold standard, right. And I worked on this experiments for a few years, and then about three years ago, I went back to this field of computation, especially because of AI. At that time, I think GPT and these large AI models that currently we’re using is not there, but we already have their prior forms like BERT, so we see the very large potential of AI. We know that these large AIs might work. So one idea is really to use AI to learn the entire space of materials and really grasp the physics there, and that really drove me to this field and that’s why I’m here working on this field, yeah.

KALTER: We’re going to get into what MatterGen and MatterSim mean for materials science—the potential, the challenges, and open questions. But first, give us an overview of what each of these tools are, how they do what they do, and—as this show is about big ideas—the idea driving the work. Ziheng, let’s have you go first.

LU: So MatterSim is a tool to do in silico characterizations of materials. If you think about working on materials, you have several steps. You first need to synthesize it, and then you need to characterize this. Basically, you need to know what property, what structures, whatever stuff about these materials. So for MatterSim, what we want to do is to really move the characterization process, a lot of these processes, into using computations. So the idea behind MatterSim is to really learn the fundamentals of physics. So we learn the energies and forces and stresses from these atomic structures and the charge densities, all of these things, and then with these, we can really simulate any sort of materials using our computational machines. And then with these, we can really characterize a lot of these materials’ properties using our computer, that is very fast. It’s much faster than we do experiments so that we can accelerate the materials design. So just in a word, basically, you input your material into your computer, a structure into your computer, and MatterSim will try to simulate these materials like what you do in a furnace or with an XRD (x-ray diffraction) and then you get your properties out of that, and a lot of times it’s much faster than you do experiments.

KALTER: All right, thank you very much. Tian, why don’t you tell us about MatterGen?

XIE: Yeah, thank you. So, actually, Ziheng, once you start with explaining MatterSim, it makes it much easier for me to explain MatterGen. So MatterGen actually represents a new way to design materials with generative AI. Material discovery is like finding needles in a haystack. You’re looking for a material with a very specific property for a material application. For example, like finding a room-temperature superconductor or finding a solid that can conduct a lithium ion very well inside a battery. So it’s like finding one very specific material from a million, kind of, candidates. So the conventional way of doing material discovery is via screening, where you, kind of, go over millions of candidates to find the one that you’re looking for, where MatterSim is able to significantly accelerate that process by making the simulation much faster. But it’s still very inefficient because you need to go through this million candidates, right. So with MatterGen, you can, kind of, directly generate materials given the prompts of the design requirements for the application. So this means that you can discover materials—discover useful materials— much more efficiently. And it also allows us to explore a much larger space beyond the set of known materials.

KALTER: Thank you, Tian. Can you tell us a little bit about how MatterGen and MatterSim work together?

XIE: So you can really think about MatterSim and MatterGen accelerating different parts of materials discovery process. MatterSim is trying to accelerate the simulation of material properties, while MatterGen is trying to accelerate the search of novel material candidates. It means that they can really work together as a flywheel and you can compound the acceleration from both models. They are also both foundation AI models, meaning they can both be used for a broad range of materials design problems. So we’re really looking forward to see how they can, kind of, working together iteratively as a tool to design novel materials for a broad range of applications.

LU: I think that’s a very good, like, general introduction of how they work together. I think I can provide an example of how they really fit together. If you want a material with a specific, like, bulk modulus or lithium-ion conductivity or thermal conductivity for your CPU chips, so basically what you want to do is start with a pool of material structures, like some structures from the database, and then you compute or you characterize your wanted property from that stack of materials. And then what you do, you’ve got these properties and structure pairs, and you input these pairs into MatterGen. And MatterGen will be able to give you a lot more of these structures that are highly possible to be real. But the number will be very large. For example, for the bulk modulus, I don’t remember the number we generated in our work … was that like thousands, tens of thousands?

XIE: Thousands, tens of thousands.

LU: Yeah, that would be a very large number pool even with MatterGen, so then the next step will be, how would you like to screen that? You cannot really just send all of those structures to a lab to synthesize. It’s too much, right. That’s when MatterSim again comes in. So MatterSim comes in and screen all those structures again and see which ones are the most likely to be synthesized and which ones have the closest property you wanted. And then after screening, you probably get five, 10 top candidates and then you send to a lab. Boom, everything goes down. That’s it.

KALTER: I’m wondering if there’s any prior research or advancements that you drew from in creating MatterGen and MatterSim. Were there any specific breakthroughs that influenced your approaches at all?

LU: Thanks, Lindsay. I think I’ll take that question first. So interestingly for MatterSim, a very fundamental idea was drew from Chi Chen, who was a previous lab mate of mine and now also works for Microsoft at Microsoft Quantum. He made this fantastic model named M3GNet, which is a prior form of a lot of these large-scale models for atomistic simulations. That model, M3GNet, actually resolves the near ground state prediction problem. I mean, the near ground state problem sounds like a fancy but not realistic word, but what that actually means is that it can simulate materials at near-zero covalent states. So basically at very low temperatures. So at that time, we were thinking since the models are now able to simulate materials at their near ground states, it’s not a very large space. But if you also look at other larger models, like GPT whatever, those models are large enough to simulate entire human language. So it’s possible to really extend the capability from these such prior models to very large space. Because we believe in the capability of AI, then it really drove us to use MatterSim to learn the entire space of materials. I mean, the entire space really means the entire periodic table, all the temperatures and the pressures people can actually grasp.

XIE: Yeah, I still remember a lot of the amazing works from Chi Chen whenever we’re, kind of, back working on property-prediction models. So, yeah, so the problem of generating materials from properties is actually a pretty old one. I still remember back in 2018, when I was, kind of, working on CGCNN (crystal graph convolutional neural networks) and giving a talk about property-prediction models, right, one of the first questions people asked is, OK, can you inverse this process? Instead of going from material structure to properties, can you, kind of, inversely generate the materials directly from their property conditions? So in a way, this is, kind of, like a dream for material scientists—some people even call it, like, holy grail—because, like, the end goal is really about finding materials property, right, [that] will satisfy your application. So I’ve been, kind of, thinking about this problem for a while and also there has been a lot of work, right, over the past few years in the community to build a generative model for materials. A lot of people have tried before, like 2020, using ideas like VAEs or GANs. But it’s hard to represent materials in this type of generative model architecture, and many of those models generated relatively poor candidates. So I thought it was a hard problem. I, kind of, know it for a while. But there is no good solutions back then. So I started to focus more on this problem during my postdoc, when I studied that in 2020 and I keep working on that in 2021. At the beginning, I wasn’t really sure exactly what approach to take because it’s, kind of, like open question and really tried a lot of random ideas. So one day actually in my group back then with Tommi Jaakkola and Regina Barzilay at MIT’s CSAIL (Computer Science & Artificial Intelligence Laboratory), we, kind of, get to know this method called diffusion model. It was a very early stage of a diffusion model back then, but it already began to show very promising signs, kind of, achieving state of art in many problems like 3D point cloud generation and the 3D molecular conformer generation. So the work that really inspired me a lot is two works that was for molecular conformer generation. One is ConfGF, and one is GeoDiff. So they, kind of, inspired me to, kind of, focus more on diffusion models. That actually lead to CDVAE (crystal diffusion variational autoencoder). So it’s interesting that we, kind of, spend like a couple of weeks in trying all this diffusion idea, and without that much work, it actually worked quite out of box. And at that time, CDVAE achieves much better performance than any previous models in materials generation, and we’re, kind of, super happy with that. So after CDVAE, I, kind of, joined Microsoft, now working with more people together on this problem of generative model for materials. So we, kind of, know what the limitations of CDVAE are, is that it can do unconditional material generation well means it can generate novel material structures, but it is very hard to use CDVAE to do property-guided generations. So basically, it uses an architecture called a variational autoencoder, where you have a latent space. So the way that you do property-guided generation there was to do a, kind of, a gradient update inside the latent space. But because the latent space wasn’t learned very well, so it actually … you cannot do, kind of, good property-guided generation. We only managed to do energy-guided generation, but it wasn’t successful in going beyond energy. So that comes us to really thinking, right, how can we make the property-guided generation much better? So I remember like one day, actually, my colleague, Daniel Zügner, who actually really showed me this blog which basically explains this idea of classifier-free guidance, which is the powerhouse behind the text-image generative models. And so, yeah, then we began to think about, can we actually make the diffusion model work for classifier-free guidance? That lead us to remove the, kind of, the variational autoencoder component from CDVAE and begin to work on a pure diffusion architecture. But then there was, kind of, a lot of development around that. But it turns out that classifier-free guidance is the key really to make property-guided generation work, and then combined with a lot more effort in, kind of, improving architecture and also generating more data and also trying out all these different downstream tasks that end up leading into MatterGen as we see today.

KALTER: Yeah, I think you’ve both done a really great job of explaining how MatterGen and MatterSim work together and how MatterGen can offer a lot in terms of reducing the amount of time and work that goes into finding new materials. Tian, how does the process of using MatterGen to generate materials translate into real-world applications?

XIE: Yeah, that’s a fantastic question. So one way that I think about MatterGen, right, is that you can think about it as like a copilot for materials scientists, right. So they can help you to come up with, kind of, potential good hypothesis for the materials design problems that you’re looking for. So say you’re trying to design a battery, right. So you may have some ideas over, OK, what candidates you want to make, but this is, kind of, based on your own experience, right. Depths of experience as a researcher. But MatterGen is able to, kind of, learn from a very broad set of data, so therefore, it may be able to come up with some good suggestions, even surprising suggestions, for you so that you can, kind of, try this out, right, both with computation or even one day in wet lab and experimentally synthesize it. But I also want to note that this, in a way, this is still an early stage in generative AI for materials means that I don’t expect all the candidates MatterGen generates will be, kind of, suits your needs, right. So you still need to, kind of, look into them with expertise or with some kind of computational screening. But I think in the future, as this model keep improving themselves, they will become a key component, right, in the design process of many of the materials we’re seeing today, like designing new batteries, new solar cells, or even computer chips, right, so that like Ziheng mentioned earlier.

KALTER: I want to pivot a little bit to the MatterSim side of things. I know identifying new combinations of compounds is key to meeting changing needs for things like sustainable materials. But testing them is equally important to developing materials that can be put to use. Ziheng, how does MatterSim handle the uncertainty of how materials behave under various conditions, and how do you ensure that the predictions remain robust despite the inherent complexity of molecular systems?

LU: Thanks. That’s a very, very good question. So uncertainty quantification is a key to make sure all these predictions and simulations are trustworthy. And that’s actually one of the questions we got almost every time after a presentation. So people will ask, well—especially those experimentalists—would ask, well, I’ve been using your model; how do I know those predictions are true under the very complex conditions I’m using in my experiments? So to understand how we deal with uncertainty, we need to know how MatterSim really functions in predicting an arbitrary property, especially under the condition you want, like the temperature and pressure. That would be quite complex, right? So in the ideal case, we would hope that by using MatterSim, you can directly simulate the properties you want using molecular dynamics combined with statistical mechanics. So if so, it would be easy to really quantify the uncertainty because there are just two parts: the error from the model and the error from the simulations, the statistical mechanics. So the error from the model will be able to be measured by, what we call, an ensemble. So basically you start with different random seeds when you train the model, and then when you predict your property, you use several models from the ensemble and then you get different numbers. If the variance from the numbers are very large, you’ll say the prediction is not that trustworthy. But a lot of times, we will see the variance is very small. So basically, an ensemble of several different models will give you almost exactly the same number; you’re quite sure that the number is somehow very, like, useful. So that’s one level of the way we want to get our property. But sometimes, it’s very hard to really directly simulate the property you want. For example, for catalytic processes, it’s very hard to imagine how you really get those coefficients. It’s very hard. The process is just too complicated. So for that process, what we do is to really use the, what we call, embeddings learned from the entire material space. So basically that vector we learned for any arbitrary material. And then start from that, we build a very shallow layer of a neural network to predict the property, but that also means you need to bring in some of your experimental or simulation data from your side. And for that way of predicting a property to measure the uncertainty, it’s still like the two levels, right. So we don’t really have the statistical error anymore, but what we have is, like, only the model error. So you can still stick to the ensemble, and then it will work, right. So to be short, so MatterSim can provide you an uncertainty to make sure the prediction tells you whether it’s true or not.

KALTER: So in many ways, MatterSim is the realist in the equation, and it’s there to sort of be a gatekeeper for MatterGen, which is the idea generator.

XIE: I really like the analogy.

LU: Yeah.

KALTER: As is the case with many AI models, the development of MatterGen and MatterSim relies on massive amounts of data. And here you use a simulation to create the needed training data. Can you talk about that process and why you’ve chosen that approach, Tian?

XIE: So one advantage here is that we can really use large-scale simulation to generate data. So we have a lot of compute here at Microsoft on our Azure platform, right. So how we generate the data is that we use a method called density functional theory, DFT, which is a quantum mechanical method. And we use a simulation workflow built on top with DFT to simulate the stability of materials. So what we do is that we curate a huge amount of material structures from multiple different sources of open data, mostly including Materials Project and Alexandria database, and in total, there are around 3 million materials candidates coming from these two databases. But not all of these structures, they are stable. So therefore, we try to use DFT to compute their stability and try to filter down the candidates such that we are making sure that our training data only have the most stable ones. This leads into around 600,000 training data, which was used to train the base model of MatterGen. So I want to note that actually we also use MatterSim as part of the workflow because MatterSim can be used to prescreen unstable candidates so that we don’t need to use DFT to compute all of them. I think at the end, we computed around 1 million DFT calculations where two-thirds of them, they are already filtered out by MatterSim, which saves us a lot of compute in generating our training data.

LU: Tian, you have a very good description of how we really get those ground state structures for the MatterGen model. Actually, we’ve been also using MatterGen for MatterSim to really get the training data. So if you think about the simulation space of materials, it’s extremely large. So we would think it in a way that it has three axis, so basically the elements, the temperature, and the pressure. So if you think about existing databases, they have pretty good coverage of the elements space. Basically, we think about Materials Project, NOMAD, they really have this very good coverage of lithium oxide, lithium sulfide, hydrogen sulfide, whatever, those different ground-state structures. But they don’t really tell you how these materials behave under certain temperature and pressure, especially under those extreme conditions like 1,600 Kelvin, which you really use to synthesize your materials. That’s where we really focused on to generate the data for MatterSim. So it’s really easy to think about how we generate the data, right. You put your wanted material into a pressure cooker, basically, molecular dynamics; it can simulate the materials behavior on the temperature and pressure. So that’s it. Sounds easy, right? But that’s not true because what we want is not one single material. What we want is the entire material space. So that will be making the effort almost impossible because the space is just so large. So that’s where we really develop this active learning pipeline. So basically, what we do is, like, we generate a lot of these structures for different elements and temperatures, pressures. Really, really a lot. And then what we do is, like, we ask the active learning or the uncertainty measurements to really say whether the model knows about this structure already. So if the model thinks, well, I think I know the structure already. So then, we don’t really calculate this structure using density function theory, as Tian just said. So this will really save us like 99% of the effort in generating the data. So in the end, by combining this molecular dynamics, basically pressure cooker, together with active learning, we gathered around 17 million data for MatterSim. So that was used to train the model. And now it can cover the entire periodic table and a lot of temperature and pressures.

KALTER: Thank you, Ziheng. Now, I’m sure this is not news to either one of you, given that you’re both at the forefront of these efforts, but there are a growing number of tools aimed at advancing materials science. So what is it about MatterGen and MatterSim in their approach or capabilities that distinguish them?

XIE: Yeah, I think I can start. So I think there is, in the past one year, there is a huge interest in building up generative AI tools for materials. So we have seen lots and lots of innovations from the community published in top conferences like NeurIPS, ICLR, ICML, etc. So I think what distinguishes MatterGen, in my point of view, are two things. First is that we are trained with a very big dataset that we curated very, very carefully, and we also spent quite a lot of time to refining our diffusion architecture, which means that our model is capable of generating very, kind of, high-quality, highly stable and novel materials. We have some kind of bar plot in our paper showcasing the advantage of our performance. I think that’s one key aspect. And I think the second aspect, which in my point of view is even more important, is that it has the ability to do property-guided generation. Many of the works that we saw in the community, they are more focused on the problem of crystal structure prediction, which MatterGen can also do, but we focus more on really property-guided generation because we think this is one of the key problems that really materials scientists care about. So the ability to do a very broad range of property-guided generation—and we have, kind of, both computational and now experimental result to validate those—I think that’s the second strong point for MatterGen.

KALTER: Ziheng, do you want to add to that?

LU: Yeah, thanks, Lindsay. So on the MatterSim side, I think it’s really the diverse condition it can handle that makes a difference. We’ve been talking about, like, the training data we collected really covers the entire periodic table and also, more importantly, the temperatures from 0 Kelvin to 5,000 Kelvin and the pressures from 0 gigapascal to 1,000 gigapascal. That really covers what humans can control nowadays. I mean, it’s very hard to go beyond that. If you know anyone [who] can go beyond that, let me know. So that really makes MatterSim different. Like, it can handle the realistic conditions. I think beyond that, I would say the combo between MatterSim and MatterGen really makes these set of tools really different. So previously, a lot of people are using this atomistic simulator and this generative models alone. But if you think about it, now that we have these two foundation models together, they really can make things different, right. So we have predictor; we have the generator; you have a very good idea generator. And you have a very good goalkeeper. And you put them together. They form a loop. And now you can use this loop to design materials really quickly. So I would say to me, now, when I think about it, it’s really the combo that makes these set of tools different.

KALTER: I know that I’ve spoken with both of you recently about how there’s so much excitement around this, and it’s clear that we’re on the precipice of this—as both of you have called it—a paradigm shift. And Microsoft places a very strong emphasis on ensuring that its innovations are grounded in reality and capable of addressing real-world problems. So with that in mind, how do you balance the excitement of scientific exploration with the practical challenges of implementation? Tian, do you want to take this?

XIE: Yeah, I think this is a very, very important point, because … as there are so many hypes around AI that is happening right now, right. We must be very, very careful about the claims that we are making so that people will not have unrealistic expectations, right, over how these models can do. So for MatterGen, we’re pretty careful about that. We’re trying to, basically, we’re trying to say that this is an early stage of generative AI in materials where this model will be improved over time quite significantly, but you should not say, oh, all the materials generated by MatterGen is going to be amazing. That’s not what is happening today. So we try to be very careful to understand how far MatterGen is already capable of designing materials with real-world impact. So therefore, we went all the way to synthesize one material that was generated by MatterGen. So this material we generated is called tantalum chromium oxide¹. So this is a new material. It has not been discovered before. And it was generated by MatterGen by conditioning a bulk modulus equal to 200 gigapascal. Bulk modulus is, like, the compressiveness of the material. So we end up measuring the experimental synthesized material experimentally, and the measured bulk modulus is 169 gigapascal, which is within 20% of error. So this is a very good proof concept, in our point of view, to show that, oh, you can actually give it a prompt, right, and then MatterGen can generate a material, and the material actually have the property that is very close to your target. But it’s still a proof of concept. And we’re still working to see how MatterGen can design materials that are much more useful with a much broader range of applications. And I’m sure that there will be more challenges we are seeing along the way. But we’re looking forward to further working with our experimental partners to, kind of, push this further. And also working with MatterSim, right, to see how these two tools can be used to design really useful materials and bringing this into real-world impact.

LU: Yeah, Tian, I think that’s very well said. It’s not really only for MatterGen. For MatterSim, we’re also very careful, right. So we really want to make sure that people understand how these models really behave under their instructions and understand, like, what they can do and they cannot do. So I think one thing that we really care about is that in the next few, maybe one or two years, we want to really work with our experimental partners to make this realistic materials, like, in different areas so that we can, even us, can really better understand the limitations and at the same time explore the forefront of materials science to make this excitement become true.

KALTER: Ziheng, could you give us a concrete example of what exactly MatterSim is capable of doing?

LU: Now MatterSim can really do, like, whatever you have on a potential energy surface. So what that means is, like, anything that can be simulated with the energy and forces, stresses alone. So to give you an example, we can compute … the first example would be the stability of a material. So basically, you input a structure, and from the energies of the relaxed structures, you can really tell whether the material is likely to be stable, like, the composition, right. So another example would be the thermal conductivity. Thermal conductivity is like a fundamental property of materials that tells you how fast heat can transfer in the material, right. So for MatterSim, it can really simulate how fast this heat can go through your diamond, your graphene, your copper, right. So basically, those are two examples. So these examples are based on energies and forces alone. But there are things MatterSim cannot do—at least for now. For example, you cannot really do anything related to electronic structures. So you cannot really compute the light absorption of a semitransparent material. That would be a no-no for now.

KALTER: It’s clear from speaking with researchers, both from MatterSim and MatterGen, that despite these very rapid advancements in technology, you take very seriously the responsibility to consider the broader implications of the challenges that are still ahead. How do you think about the ethical considerations of creating entirely new materials and simulating their properties, particularly in terms of things like safety, sustainability, and societal impact?

XIE: Yeah, that’s a fantastic question. So it’s extremely important that we are making sure that these AI tools, they are not misused. A potential misuse, right, as you just mentioned, is that people begin to use these AI tools—MatterGen, MatterSim—to, kind of, design harmful materials. There was actually extensive discussion over how generative AI tools that was originally purposed for drug design can be then misused to create bioweapons. So at Microsoft, we take this very seriously because we believe that when we create new technologies, you must also ensure that the technology is used responsibly. So we have an extensive process to ensure that all of our models respect those ethical considerations. In the meantime, as you mentioned, maybe sustainability and the societal impact, right, so there’s a huge amount these AI tools—MatterGen, MatterSim—can do for sustainability because a lot of the sustainability challenges, they are really, at the end, materials design challenges, right. So therefore, I think that MatterGen and MatterSim can really help with that in solving, in helping us to alleviate climate change and having positive societal impact for the broader society.

KALTER: And, Ziheng, how about from a simulation standpoint?

LU: Yeah, I think Tian gave a very good, like, description. At Microsoft, we are really careful about these ethical, like, considerations. So I would add a little bit on the more, like, the bright side of things. Like, so for MatterSim, like, it really carries out these simulations at atomic scales. So one thing you can think about is really the educational purpose. So back in my bachelor and PhD period, so I would sit, like, at the table and really grab a pen to really deal with those very complex equations and get into those statistics using my pen. It’s really painful. But now with MatterSim, these simulation tools at atomic level, what you can do is to really simulate the reactions, the movement of atoms, at atomic scale in real time. You can really see the chemical reactions and see the statistics. So you can get really the feeling, like very direct feeling, of how the system works instead of just working on those toy systems with your pen. I think it’s going to be a very good educational tool using MatterSim, yeah. Also MatterGen. MatterGen as, like, a generative tool and generating those i.i.d. (independent and identically distributed) distributions, it will be a perfect example to show the students how the Boltzmann distribution works. I think, Tian, you will agree with that, right?

XIE: 100%. Yeah, I really, really like the example that Ziheng mentioned about the educational purposes. I still remember, like, when I was, kind of, learning material simulation class, right. So everything is DFT. You, kind of, need to wait for an hour, right, for getting some simulation. Maybe then you’ll make some animation. Now you can do this in real time. This is, like, a huge step forward, right, for our young researchers to, kind of, gaining a sense, right, about how atoms interact at an atomic level.

LU: Yeah, and the results are really, I mean, true; not really those toy models. I think it’s going to be very exciting stuff.

KALTER: And, Tian, I’m directing this question to you, even though, Ziheng, I’m sure you can chime in, as well. But, Tian, I know that you and I have previously discussed this specifically. I know that you said back in, you know, 2017, 2018, that you knew an AI-based approach to materials science was possible but that even you were surprised by how far the technology has come so fast in aiding this area. What is the status of these tools right now? Are they in use? And if so, who are they available to? And, you know, what’s next for them?

XIE: Yes, this is a fantastic question, right. So I think for AI generative tools like MatterGen, as I said many times earlier, it’s still in its early stages. MatterGen is the first tool that we managed to show that generative AI can enable very broad property-guided generation, and we have managed to have experimental validation to show it’s possible. But it will take more work to show, OK, it can actually design batteries, can design solar cells, right. It can design really useful materials in these broader domains. So this is, kind of, exactly why we are now taking a pretty open approach with MatterGen. We make our code, our training data, and model weights available to the general public. We’re really hoping the community can really use our tools to the problem that they care about and even build on top of that. So in terms of what next, I always like to use what happened with generative AI for drugs, right, to kind of predict how generative AI will impact materials. Three years ago, there is a lot of research around generative model for drugs, first coming from the machine learning community, right. So then all the big drug companies begin to take notice, and then there are, kind of, researchers in these drug companies begin to use these tools in actual drug design processes. From my colleague, Marwin Segler, because he, kind of, works together with Novartis in Microsoft and Novartis collaboration, he has been basically telling me that at the beginning, all the chemists in the drug companies, they’re all very suspicious, right. The molecules generated by these generative models, they all look a bit weird, so they don’t believe this will work. But once these chemists see one or two examples that actually turns out to be performing pretty well from the experimental result, then they begin to build more trust, right, into these generative AI models. And today, these generative AI tools, they are part of the standard drug discovery pipeline that is widely used in all the drug companies. That is today. So I think generative AI for materials is going through a very similar period. People will have doubts; people will have suspicions at the beginning. But I think in three years, right, so it will become a standard tool over how people are going to design new solar cells, design new batteries, and many other different applications.

KALTER: Great. Ziheng, do you have anything to add to that?

LU: So actually for MatterSim, we released the model, I think, back in last year, December. I mean, both the weights and the models, right. So we’re really grateful how much the community has contributed to the repo. And now, I mean, we really welcome the community to contribute more to both MatterSim and MatterGen via our open-source code bases. So, I mean, the community effort is really important, yeah.

KALTER: Well, it has been fascinating to pick your brains, and as we close, you know, I know that you’re both capable of quite a bit, which you have demonstrated. I know that asking you to predict the future is a big ask, so I won’t explicitly ask that. But just as a fun thought exercise, let’s fast-forward 20 years and look back. How have MatterGen and MatterSim and the big ideas behind them impacted the world, and how are people better off because of how you and your teams have worked to make them a reality? Tian, you want to start?

XIE: Yeah, I think one of the biggest challenges our human society is going to face, right, in the next 20 years is going to be climate change, right, and there are so many materials design problems people need to solve in order to properly handle climate change, like finding new materials that can absorb CO₂ from atmosphere to create a carbon capture industry or have a battery materials that is able to do large-scale energy grid storage so that we can fully utilizing all the wind powers and the solar power, etc., right. So if you want me to make one prediction, I really believe that these AI tools, like MatterGen and MatterSim, is going to play a central role in our human’s ability to design these new materials for climate problems. So therefore in 20 years, I would like to see we have already solved climate change, right. We have large-scale energy storage systems that was designed by AI that is … basically that we have removed all the fossil fuels, right, from our energy production, and for the rest of the carbon emissions that is very hard to remove, we will have a carbon capture industry with materials designed by AI that absorbs the CO₂ from the atmosphere. It’s hard to predict exactly what will happen, but I think AI will play a key role, right, into defining how our society will look like in 20 years.

LU: Tian, very well said. So I think instead of really describing the future, I would really quote a science fiction scene in Iron Man. So basically in 20 years, I will say when we want to really get a new material, we will just sit in an office and say, “Well, J.A.R.V.I.S., can you design us a new material that really fits my newest MK 7 suit?” That will be the end. And it will run automatically, and we get this auto lab running, and all those MatterGen and MatterSim, these AI models, running, and then probably in a few hours, in a few days, we get the material.

KALTER: Well, I think I speak for many people from several industries when I say that I cannot wait to see what is on the horizon for these projects. Tian and Ziheng, thank you so much for joining us on Ideas. It’s been a pleasure.

[MUSIC]

XIE: Thank you so much.

LU: Thank you.

[MUSIC FADES]

1 Learn more about MatterGen and the new material tantalum chromium oxide in the Nature paper “A generative model for inorganic materials design (opens in new tab).”

The post Ideas: AI for materials discovery with Tian Xie and Ziheng Lu appeared first on Microsoft Research.

MatterGen: A new paradigm of materials design with generative AI

January 16, 2025

by Alyssa Hughes (2ADAPTIVE LLC dba 2A Consulting) Microsoft AI

A grid of colorful, abstract shapes on a black background. Each cell in the grid features a unique three-dimensional geometric pattern, showcasing a variety of colors including green, red, blue, and purple.

Materials innovation is one of the key drivers of major technological breakthroughs. The discovery of lithium cobalt oxide in the 1980s laid the groundwork for today’s lithium-ion battery technology. It now powers modern mobile phones and electric cars, impacting the daily lives of billions of people. Materials innovation is also required for designing more efficient solar cells, cheaper batteries for grid-level energy storage, and adsorbents to recycle CO2 from atmosphere.

Finding a new material for a target application is like finding a needle in a haystack. Historically, this task has been done via expensive and time-consuming experimental trial-and-error. More recently, computational screening of large materials databases has allowed researchers to speed up this process. Nonetheless, finding the few materials with the desired properties still requires the screening of millions of candidates.

Today, in a paper published in Nature (opens in new tab), we share MatterGen, a generative AI tool that tackles materials discovery from a different angle. Instead of screening the candidates, it directly generates novel materials given prompts of the design requirements for an application. It can generate materials with desired chemistry, mechanical, electronic, or magnetic properties, as well as combinations of different constraints. MatterGen enables a new paradigm of generative AI-assisted materials design that allows for efficient exploration of materials, going beyond the limited set of known ones.

An illustration comparing screening and generation at the task of finding shapes that have a given number of edges and color. A blue pentagon is shown with a question mark at the top of the illustration, denoting this as the target for the task. To the left, a collection of colored shapes that does not include a blue pentagon is poured into a screening funnel. Two green pentagons pass through the funnel. To the right of the illustration, a laptop representing MatterGen inputs a target of 5 edges and the color blue. Three green and one blue pentagon are produced in addition to a single blue hexagon. — *Figure 1: Schematic representation of screening and generative approaches to materials design*

A novel diffusion architecture

MatterGen is a diffusion model that operates on the 3D geometry of materials. Much like an image diffusion model generates pictures from a text prompt by modifying the color of pixels from a noisy image, MatterGen generates proposed structures by adjusting the positions, elements, and periodic lattice from a random structure. The diffusion architecture is specifically designed for materials to handle specialties like periodicity and 3D geometry.

An illustration showing a two-dimensional crystal structure at various states in the reverse diffusion process from a random to a stable material (left to right). Three additional illustrations are shown for denoising processes that are conditioned on the chemistry, symmetry and magnetic density of the material. — Figure 2: Schematic representation of MatterGen: a diffusion model to generate novel and stable materials. MatterGen can be fine-tuned to generate materials under different design requirements such as specific chemistry, crystal symmetry, or materials’ properties.

The base model of MatterGen achieves state-of-the-art performance in generating novel, stable, diverse materials (Figure 3). It is trained on 608,000 stable materials from the Materials Project (opens in new tab) (MP) and Alexandria (opens in new tab) (Alex) databases. The performance improvement can be attributed to both the architecture advancements, as well as the quality and size of our training data.

A figure comparing the percentage of samples generated that are stable, novel and unique for several methods. From most performant to least performant, the figure ranks methods in order of MatterGen (alex-mp), MatterGen (mp), DiffCSP (mp), CDVAE (mp), P-G-SchNet (mp), G-SchNet (mp), FTCP (mp). — Figure 3: Performance of MatterGen and other methods in the generation of stable, unique, and novel structures. The training dataset for each method is indicated in parentheses. The purple bar highlights performance improvements due to MatterGen’s architecture alone, while the teal bar highlights performance improvements that come also from the larger training dataset.

MatterGen can be fine-tuned with a labelled dataset to generate novel materials given any desired conditions. We demonstrate examples of generating novel materials given a target’s chemistry and symmetry, as well as electronic, magnetic, and mechanical property constraints (Figure 2).

Outperforming screening

A figure comparing MatterGen and traditional screening in the task of generating stable, unique and novel structures with a bulk modulus greater than 400 giga pascal. The figure shows that the number of such structures discovered with screening plateaus at approximately 40, while for MatterGen this number continues to increase to above 100 for 175 density functional theory calculations. — Figure 4: Performance of MatterGen (teal) and traditional screening (yellow) in finding novel, stable, and unique structures that satisfy the design requirement of having bulk modulus greater than 400 GPa.

The key advantage of MatterGen over screening is its ability to access the full space of unknown materials. In Figure 4, we show that MatterGen continues to generate more novel candidate materials with high bulk modulus above 400 GPa, for example, which are hard to compress. In contrast, screening baseline saturates due to exhausting known candidates.

Handling compositional disorder

An illustration of a two-dimensional cubic crystal lattice containing two distinct atom types. The primitive cell is ordered and each atomic site is occupied by a single atom type. Another crystal lattice is shown to the right and is compositionally disordered such that each atom site contains either atom type with a probability of one half. — Figure 5: Illustration of compositional disorder. Left: a perfect crystal without compositional disorder and with a repeating unit cell (black dashed). Right: crystal with compositional disorder, where each site has 50% probability of yellow and teal atoms.

Compositional disorder (Figure 5) is a commonly observed phenomenon where different atoms can randomly swap their crystallographic sites in a synthesized material. Recently (opens in new tab), the community has been exploring what it means for a material to be novel in the context of computationally designed materials, as widely employed algorithms will not distinguish between pairs of structures where the only difference is a permutation of similar elements in their respective sites.

We provide an initial solution to this issue by introducing a new structure matching algorithm that considers compositional disorder. The algorithm assesses whether a pair of structures can be identified as ordered approximations of the same underlying compositionally disordered structure. This provides a new definition of novelty and uniqueness, which we adopt in our computational evaluation metrics. We also make our algorithm publicly available (opens in new tab) as part of our evaluation package.

Experimental lab verification

A photo that shows a scientist in a laboratory working at a bench and holding a small sample with tweezers. — *Figure 6: Experimental validation of the proposed compound, TaCr2O6*

In addition to our extensive computational evaluation, we have validated MatterGen’s capabilities through experimental synthesis. In collaboration with the team led by Prof Li Wenjie from the Shenzhen Institutes of Advanced Technology (opens in new tab) (SIAT) of the Chinese Academy of Sciences, we have synthesized a novel material, TaCr2O6, whose structure was generated by MatterGen after conditioning the model on a bulk modulus value of 200 GPa. The synthesized material’s structure aligns with the one proposed by MatterGen, with the caveat of compositional disorder between Ta and Cr. Additionally, we experimentally measure a bulk modulus of 169 GPa against the 200 GPa given as design specification, with a relative error below 20%, very close from an experimental perspective. If similar results can be translated to other domains, it will have a profound impact on the design of batteries, fuel cells, and more.

AI emulator and generator flywheel

MatterGen presents a new opportunity for AI accelerated materials design, complementing our AI emulator MatterSim. MatterSim follows the fifth paradigm of scientific discovery, significantly accelerating the speed of material properties’ simulations. MatterGen in turn accelerates the speed of exploring new material candidates with property guided generation. MatterGen and MatterSim can work together as a flywheel to speed up both the simulation and exploration of novel materials.

Making MatterGen available

We believe the best way to make an impact in materials design is to make our model available to the public. We release the source code of MatterGen (opens in new tab) under the MIT license, together with the training and fine-tuning data. We welcome the community to use and build on top of our model.

Looking ahead

MatterGen represents a new paradigm of materials design enabled by generative AI technology. It explores a significantly larger space of materials than screening-based methods. It is also more efficient by guiding materials exploration with prompts. Similar to how generative AI has impacted drug discovery (opens in new tab), it will have profound impact on how we design materials in broad domains including batteries, magnets, and fuel cells.

We plan to continue our work with external collaborators to further develop and validate the technology. “At the Johns Hopkins University Applied Physics Laboratory (APL), we’re dedicated to the exploration of tools with the potential to advance discovery of novel, mission-enabling materials. That’s why we are interested in understanding the impact that MatterGen could have on materials discovery,” said Christopher Stiles, a computational materials scientists leading multiple materials discovery efforts at APL.

Acknowledgement

This work is the result of highly collaborative team efforts at Microsoft Research AI for Science. The full authors include: Claudio Zeni, Robert Pinsler, Daniel Zügner, Andrew Fowler, Matthew Horton, Xiang Fu, Zilong Wang, Aliaksandra Shysheya, Jonathan Crabbé, Shoko Ueda, Roberto Sordillo, Lixin Sun, Jake Smith, Bichlien Nguyen, Hannes Schulz, Sarah Lewis, Chin-Wei Huang, Ziheng Lu, Yichi Zhou, Han Yang, Hongxia Hao, Jielan Li, Chunlei Yang, Wenjie Li, Ryota Tomioka, Tian Xie.

The post MatterGen: A new paradigm of materials design with generative AI appeared first on Microsoft Research.

AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness

January 14, 2025

by Alyssa Hughes (2ADAPTIVE LLC dba 2A Consulting) Microsoft AI

The v0.4 update introduces a cohesive AutoGen ecosystem that includes the framework, developer tools, and applications. The framework’s layered architecture clearly defines each layer’s functionality. It supports both first-party and third-party applications and extensions.

Over the past year, our work on AutoGen has highlighted the transformative potential of agentic AI and multi-agent applications. Today, we are excited to announce AutoGen v0.4, a significant milestone informed by insights from our community of users and developers. This update represents a complete redesign of the AutoGen library, developed to improve code quality, robustness, generality, and scalability in agentic workflows.

The initial release of AutoGen generated widespread interest in agentic technologies. At the same time, users struggled with architectural constraints, an inefficient API compounded by rapid growth, and limited debugging and intervention functionality. Feedback highlighted the need for stronger observability and control, more flexible multi-agent collaboration patterns, and reusable components. AutoGen v0.4 addresses these issues with its asynchronous, event-driven architecture.

This update makes AutoGen more robust and extensible, enabling a broader range of agentic scenarios. The new framework includes the following features, inspired by feedback from both within and outside Microsoft.

Asynchronous messaging: Agents communicate through asynchronous messages, supporting both event-driven and request/response interaction patterns.
Modular and extensible: Users can easily customize systems with pluggable components, including custom agents, tools, memory, and models. They can also build proactive and long-running agents using event-driven patterns.
Observability and debugging: Built-in metric tracking, message tracing, and debugging tools provide monitoring and control over agent interactions and workflows, with support for OpenTelemetry for industry-standard observability.
Scalable and distributed: Users can design complex, distributed agent networks that operate seamlessly across organizational boundaries.
Built-in and community extensions: The extensions module enhances the framework’s functionality with advanced model clients, agents, multi-agent teams, and tools for agentic workflows. Community support allows open-source developers to manage their own extensions.
Cross-language support: This update enables interoperability between agents built in different programming languages, with current support for Python and .NET and additional languages in development.
Full type support: Interfaces enforce type checks at build time, improving robustness and maintaining code quality.

New AutoGen framework

As shown in Figure 1, the AutoGen framework features a layered architecture with clearly defined responsibilities across the framework, developer tools, and applications. The framework comprises three layers: core, agent chat, and first-party extensions.

Core: The foundational building blocks for an event-driven agentic system.
AgentChat: A task-driven, high-level API built on the core layer, featuring group chat, code execution, pre-built agents, and more. This layer is most similar to AutoGen v0.2 (opens in new tab), making it the easiest API to migrate to.
Extensions: Implementations of core interfaces and third-party integrations, such as the Azure code executor and OpenAI model client.

Developer tools

In addition to the framework, AutoGen 0.4 includes upgraded programming tools and applications, designed to support developers in building and experimenting with AutoGen.

AutoGen Bench: Enables developers to benchmark their agents by measuring and comparing performance across tasks and environments.

AutoGen Studio: Rebuilt on the v0.4 AgentChat API, this low-code interface enables rapid prototyping of AI agents. It introduces several new capabilities:

Real-time agent updates: View agent action streams in real time with asynchronous, event-driven messages.
Mid-execution control: Pause conversations, redirect agent actions, and adjust team composition. Then seamlessly resume tasks.
Interactive feedback through the UI: Add a UserProxyAgent to enable user input and guidance during team runs in real time.
Message flow visualization: Understand agent communication through an intuitive visual interface that maps message paths and dependencies.
Drag-and-drop team builder: Design agent teams visually using an interface for dragging components into place and configuring their relationships and properties.
Third-party component galleries: Import and use custom agents, tools, and workflows from external galleries to extend functionality.

Magentic-One: A new generalist multi-agent application to solve open-ended web and file-based tasks across various domains. This tool marks a significant step toward creating agents capable of completing tasks commonly encountered in both work and personal contexts.

Migrating to AutoGen v0.4

We implemented several measures to facilitate a smooth upgrade from the previous v0.2 API, addressing core differences in the underlying architecture.

First, the AgentChat API maintains the same level of abstraction as v0.2, making it easy to migrate existing code to v0.4. For example, AgentChat offers an AssistantAgent and UserProxy agent with similar behaviors to those in v0.2. It also provides a team interface with implementations like RoundRobinGroupChat and SelectorGroupChat, which cover all the capabilities of the GroupChat class in v0.2. Additionally, v0.4 introduces many new functionalities, such as streaming messages, improved observability, saving and restoring task progress, and resuming paused actions where they left off.

For detailed guidance, refer to the migration guide (opens in new tab).

Looking forward

This new release sets the stage for a robust ecosystem and strong foundation to drive advances in agentic AI application and research. Our roadmap includes releasing .NET support, introducing built-in, well-designed applications and extensions for challenging domains, and fostering a community-driven ecosystem. We remain committed to the responsible development of AutoGen and its evolving capabilities.

We encourage you to engage with us on AutoGen’s Discord server (opens in new tab) and share feedback on the official AutoGen repository (opens in new tab) via GitHub Issues. Stay up to date with frequent AutoGen updates via X.

Acknowledgments

We would like to thank the many individuals whose ideas and insights helped formalize the concepts introduced in this release, including Rajan Chari, Ece Kamar, John Langford, Ching-An Chen, Bob West, Paul Minero, Safoora Yousefi, Will Epperson, Grace Proebsting, Enhao Zhang, and Andrew Ng.

The post AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness appeared first on Microsoft Research.

AIOpsLab: Building AI agents for autonomous clouds

December 20, 2024

by Brenda Potts Microsoft AI

graphical user interface, application, icon

In our increasingly complex digital landscape, enterprises and cloud providers face significant challenges in the development, deployment, and maintenance of sophisticated IT applications. The broad adoption of microservices and cloud-based serverless architecture has streamlined certain aspects of application development while simultaneously introducing a host of operational difficulties, particularly in fault diagnosis and mitigation. These complexities can result in outages, which have the potential to cause major business disruptions, underscoring the critical need for robust solutions that ensure high availability and reliability in cloud services. As the expectation for five-nines availability grows, organizations must navigate the intricate web of operational demands to maintain customer satisfaction and business continuity.

To tackle these challenges, recent research on using AIOps agents for cloud operations—such as AI agents for incident root cause analysis (RCA) or triaging—has relied on proprietary services and datasets. Other prior works use frameworks specific to the solutions that they are building, or ad hoc and static benchmarks and metrics that fail to capture the dynamic nature of real-world cloud services. Users developing agents for cloud operations tasks with Azure AI Agent Service can evaluate and improve them using AIOpsLab. Furthermore, current approaches do not agree on standard metrics or a standard taxonomy for operational tasks. This calls for a standardized and principled research framework for building, testing, comparing, and improving AIOps agents. The framework should allow agents to interact with realistic service operation tasks in a reproducible manner. It must be flexible in extending to new applications, workloads, and faults. Importantly, it should go beyond just evaluating the AI agents and enabling users to improve the agents themselves; for example, by providing sufficient observability and even serving as a training environment (“gym”) to generate samples to learn on.

We developed AIOpsLab, a holistic evaluation framework for researchers and developers, to enable the design, development, evaluation, and enhancement of AIOps agents, which also serves the purpose of reproducible, standardized, interoperable, and scalable benchmarks. AIOpsLab is open sourced at GitHub (opens in new tab) with the MIT license, so that researchers and engineers can leverage it to evaluate AIOps agents at scale. The AIOpsLab research paper has been accepted at SoCC’24 (the annual ACM Symposium on Cloud Computing).

Flowchart of an AIOpsLab system. The chart is divided into four main sections: AIOps Tasks, Orchestrator, Problem Cache, and Service. AIOps Tasks list various applications like SocialNetwork, HotelReservation, E-Commerce, and others, each with associated Data, Actions, Metrics. These tasks connect to the Orchestrator. The Orchestrator is the central element and interacts with various components: it receives a Problem Query Q, detailing Problem, Task T, Workload W, Fault F, and Solution S. It is responsible for deploying or running the workload and injecting faults, as well as taking actions based on the Service State relayed by an Agent. The Problem Cache connects to a Workload Generator and a Fault Generator, creating Workload W for the Service. The Service component shows observability through Traces, Metrics, and Logs. It communicates with the Orchestrator to provide service state updates. The components are connected with arrows that indicate the flow of data and control between each part of the system. — Figure 1. System architecture of AIOpsLab.

Agent-cloud interface (ACI)

AIOpsLab strictly separates the agent and the application service using an intermediate orchestrator. It provides several interfaces for other system parts to integrate and extend. First, it establishes a session with an agent to share information about benchmark problems: (1) the problem description, (2) instructions (e.g., response format), and (3) available APIs to call as actions.

The APIs are a set of documented tools, e.g., get logs, get metrics, and exec shell, designed to help the agent solve a task. There are no restrictions on the agent’s implementation; the orchestrator poses problems and polls it for the next action to perform given the previous result. Each action must be a valid API call, which the orchestrator validates and carries out. The orchestrator has privileged access to the deployment and can take arbitrary actions (e.g., scale-up, redeploy) using appropriate tools (e.g., helm, kubectl) to resolve problems on behalf of the agent. Lastly, the orchestrator calls workload and fault generators to create service disruptions, which serve as live benchmark problems. AIOpsLab provides additional APIs to extend to new services and generators.

Example shows how to onboard an agent to AIOpsLab

from aiopslab import Orchestrator
class Agent:
    def __init__(self, prob, instructs, apis):
        self.prompt = self.set_prompt(prob, instructs, apis)
        self.llm = GPT4()

    async def get_action(self, state: str) -> str:
        return self.llm.generate(self.prompt + state)

#initialize the orchestrator
orch = Orchestrator()
pid = "misconfig_app_hotel_res-mitigation-1"
prob_desc, instructs, apis = orch.init_problem(pid)

#register and evaluate the agent
agent = Agent(prob_desc, instructs, apis)
orch.register_agent(agent, name="myAgent")
asyncio.run(orch.start_problem(max_steps=10))

Service

AIOpsLab abstracts a diverse set of services to reflect the variance in production environments. This includes live, running services that are implemented using various architectural principles, including microservices, serverless, and monolithic.

We also leverage open-sourced application suites such as DeathStarBench as they provide artifacts, like source code and commit history, along with run-time telemetry. Adding tools like BluePrint can help AIOpsLab scale to other academic and production services.

Workload generator

The workload generator in AIOpsLab plays a crucial role by creating simulations of both faulty and normal scenarios. It receives specifications from the orchestrator, such as the task, desired effects, scale, and duration. The generator can use a model trained on real production traces to generate workloads that align with these specifications. Faulty scenarios may simulate conditions like resource exhaustion, exploit edge cases, or trigger cascading failures, inspired by real incidents. Normal scenarios mimic typical production patterns, such as daily activity cycles and multi-user interactions. When various characteristics (e.g., service calls, user distribution, arrival times) can lead to the desired effect, multiple workloads can be stored in the problem cache for use by the orchestrator. In coordination with the fault generator, the workload generator can also create complex fault scenarios with workloads.

Fault generator

AIOpsLab has a novel push-button fault generator designed for generic applicability across various cloud scenarios. Our approach integrates application and domain knowledge to create adaptable policies and “oracles” compatible with AIOps scenarios. This includes fine-grained fault injection capable of simulating complex failures inspired by production incidents. Additionally, it can inject faults at various system levels, exposing root causes while maintaining semantic integrity and considering interdependencies between cloud microservices. The fault injector’s versatility can enhance the reliability and robustness of cloud systems by enabling thorough testing and evaluation of AIOps capabilities.

Observability

AIOpsLab is equipped with an extensible observability layer designed to provide comprehensive monitoring capabilities across various system layers for any AIOps tool. AIOpsLab collects a wide array of telemetry data, including (1) traces from Jaeger detailing the end-to-end paths of requests through distributed systems, (2) application logs formatted and recorded by Filebeat and Logstash, and (3) system metrics monitored by Prometheus. Additionally, AIOpsLab also captures lower-level system information such as syscall logs and cluster information. As mentioned, we handle potential data overload through flexible APIs to tune the telemetry data relevant to the AIOps tools.

AIOpsLab currently supports four key tasks within the AIOps domain. These tasks include incident detection, localization, root cause diagnosis, and mitigation. Additionally, we support several popular agent frameworks: React, Autogen, and TaskWeaver. Two key insights from the study emphasize the importance of observability and well-designed ACI. Observability is crucial for clear root-cause diagnosis, for instance, pinpointing a misconfigured API gateway can be vital in preventing service downtime.

Flexibility is another critical factor; the ability to execute arbitrary shell commands allowed for effective troubleshooting in real-time scenarios. Lastly, robust error handling is essential—providing agents with high-quality feedback on execution barriers, such as a failed database connection, ensures swift resolution and continuous improvement.

Example of ACI on diagnosis task

Agent: I need to first look at the logs to understand if there are any errors or warnings that could indicate problems. 

Action: 

``` 

get_logs("compose-post-service", "test-social-network") 

```
Service: [2024-Aug-04 23:18:49.365494] <info>: 

… 

Thrift: Sun Aug 4 23:19:19 2024 TSocket::open() connect() 
<Host: user-service Port: 9090>: Connection refusedAgent: … unable to connect to the `user-service` at port 9090, even though the pod is running. This suggests a possible network issue or misconfiguration in service discovery

Next steps

This research project adopts Microsoft security standards and Responsible AI principles, and we envision this research evolving into a vital resource for organizations aiming to optimize their IT operations. Additionally, we plan to collaborate with various generative AI teams to incorporate AIOpsLab as a benchmark scenario for evaluating state-of-the-art models. By doing so, we aim to foster innovation and encourage the development of more advanced AIOps solutions. This research is essential not only for IT professionals but also for anyone invested in the future of technology, as it has the potential to redefine how organizations manage operations, respond to incidents, and ultimately serve their customers in an increasingly automated world.

Acknowledgements

We would like to thank Yinfang Chen, Manish Shetty, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, and Suman Nath, for contributing to this project.

The post AIOpsLab: Building AI agents for autonomous clouds appeared first on Microsoft Research.

Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness

December 19, 2024

by Brenda Potts Microsoft AI

Illustrated headshots of Ginny Badanes, Madeleine Daepp and Robert Ness

In 2024, with advancements in generative AI continuing to reach new levels and the world experiencing its “biggest election year in history (opens in new tab),” could there possibly be a better time to examine the technology’s emerging role in global democracies? Inspired by the moment, senior researchers Madeleine Daepp (opens in new tab) and Robert Osazuwa Ness (opens in new tab) conducted research in Taiwan, studying the technology’s influence on disinformation, and in India, documenting its impact on digital communications more broadly. In this episode, Daepp and Ness join guest host Ginny Badanes (opens in new tab), general manager of the Democracy Forward program at Microsoft. They discuss how leveraging commonly understood language such as fraud can help people understand potential risks associated with generative AI; the varied ways in which Daepp and Ness saw the tech being deployed to promote or discredit candidates; and the opportunities for the technology to be a force for fortifying democracy.

Learn more: 

Video will kill the truth if monitoring doesn’t improve, argue two researchers (opens in new tab)
The Economist, March 2024

Microsoft Research Special Projects
Group homepage

Democracy Forward
Program homepage, Microsoft Corporate Social Responsibility

As the US election nears, Russia, Iran and China step up influence efforts (opens in new tab)
Microsoft On the Issues blog, October 2024

Combatting AI Deepfakes: Our Participation in the 2024 Political Conventions (opens in new tab)
Microsoft On the Issues blog, July 2024

China tests US voter fault lines and ramps AI content to boost its geopolitical interests (opens in new tab)
Microsoft On the Issues, April 2024

Project Providence (opens in new tab)
Project homepage

Transcript

[TEASER] [MUSIC PLAYS UNDER DIALOGUE]

MADELEINE DAEPP: Last summer, I was working on all of these like pro-democracy applications, trying to build out, like, a social data collection tool with AI, all this kind of stuff. And I went to the elections workshop that the Democracy Forward team at Microsoft had put on, and Dave Leichtman, who, you know, was the MC of that work, was really talking about how big of a global elections year 2024 was going to be. Over 70 countries around the world. And, you know, we’re coming from Microsoft Research, where we were so excited about this technology. And then, all of a sudden, I was at the elections workshop, and I thought, oh no, [LAUGHS] like, this is not good timing.

ROBERT OSAZUWA NESS: What are we really talking about in the context of deepfakes in the political context, elections context? It’s deception, right. I’m trying to use this technology to, say, create some kind of false record of events in order to convince people that something happened that actually did not happen. And so that goal of deceiving, of creating a false record, that’s kind of how I have been thinking about deepfakes in contrast to the broader category of generative AI.

[TEASER ENDS]

GINNY BADANES: Welcome to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.

[MUSIC FADES]

I’m your guest host, Ginny Badanes, and I lead Microsoft’s Democracy Forward program, where we’ve spent the past year deeply engaged in supporting democratic elections around the world, including the recent US elections. We have been working on everything from raising awareness of nation-state propaganda efforts to helping campaigns and election officials prepare for deepfakes to protecting political campaigns from cyberattacks. Today, I’m joined by two researchers who have also been diving deep into the impact of generative AI on democracy.

Microsoft senior researchers Madeleine Daepp and Robert Osazuwa Ness are studying generative AI’s influence in the political sphere with the goal of making AI systems more robust against misuse while supporting the development of AI tools that can strengthen democratic processes and systems. They spent time in Taiwan and India earlier this year, where both had big democratic elections. Madeleine and Robert, welcome to the podcast!

MADELEINE DAEPP: Thanks for having us.

ROBERT OSAZUWA NESS: Thanks for having us.

BADANES: So I have so many questions for you all—from how you conducted your research to what you’ve learned—and I’m really interested in what you think comes next. But first, let’s talk about how you got involved in this in the first place. Could you both start by telling me a little bit about your backgrounds and just what got you into AI research in the first place?

DAEPP: Sure. So I’m a senior researcher here at Microsoft Research in the Special Projects team. But I did my PhD at MIT in urban studies and planning. And I think a lot of folks hear that field and think, oh, you know, housing, like upzoning housing and figuring out transportation systems. But it really is a field that’s about little “d” democracy, right. About how people make choices about shared public spaces every single day. You know, I joined Microsoft first off to run this, sort of, technology deployment in the city of Chicago, running a low-cost air-quality-sensor network for the city. And when GPT-4 came out, you know, first ChatGPT, and then we, sort of, had this big recognition of, sort of, how well this technology could do in summarizing and in representing opinions and in making sense of big unstructured datasets, right. I got actually very excited. Like, I thought this could be used for town planning processes. [LAUGHS] Like, I thought we could … I had a whole project with a wonderful intern, Eva Maxfield Brown, looking at, can we summarize planning documents using AI? Can we build out policies from conversations that people have in shared public spaces? And so that was very much the impetus for thinking about how to apply and build things with this amazing new technology in these spaces.

BADANES: Robert, I think your background is a little bit different, yet you guys ended up in a similar place. So how did you get there?

NESS: Yeah, so I’m also on Special Projects, Microsoft Research. My work is focusing on large language models, LLMs. And, you know, so I focus on making these models more reliable and controllable in real-world applications. And my PhD is in statistics. And so I focus a lot on using just basic bread-and-butter statistical methods to try and control and understand LLM behavior. So currently, for example, I’m leading a team of engineers and running experiments designed to find ways to enhance a graphical approach to combining information retrieval in large language models. I work on statistical tests for testing significance of adversarial attacks on these models.

BADANES: Wow.

NESS: So, for example, if you find a way to trick one of these models into doing something it’s not supposed to do, I make sure that it’s not, like, a random fluke; that it’s something that’s reproducible. And I also work at this intersection between generative AI and, you know, Bayesian stuff, causal inference stuff. And so I came at looking at this democracy work through an alignment lens. So alignment is this task in AI of making sure these models align with human values and goals. And what I was seeing was a lot of research in the alignment space was viewing it as a technical problem. And, you know, as a statistician, we’re trained to consult, right. Like, to go to the actual stakeholders and say, hey, what are your goals? What are your values? And so this democracy work was an opportunity to do that in Microsoft Research and connected with Madeleine. So she was planning to go to Taiwan, and kind of from a past life, I wanted to become a trade economist and learned Mandarin. And so I speak fluent Mandarin and seemed like a good matchup of our skill sets …

BADANES: Yeah.

NESS: … and interests. And so that’s, kind of, how we got started.

BADANES: So, Madeleine, you brought the two of you together, but what started it for you? This podcast is all about big ideas. What sparked the big idea to bring this work that you’ve been doing on generative AI into the space of democracy and then to go out and find Robert and match up together?

DAEPP: Yeah, well, Ginny, it was you. [LAUGHS] It was actually your team.

BADANES: I didn’t plant that! [LAUGHS]

DAEPP: So, you know, I think last summer, I was working on all of these like pro-democracy applications, trying to build out, like, a social data collection tool with AI, all this kind of stuff. And I went to the elections workshop that the Democracy Forward team at Microsoft had put on, and Dave Leichtman, who, you know, was the MC of that work, was really talking about how big of a global elections year 2024 was going to be, that this—he was calling it “Votorama.” You know, that term didn’t take off. [LAUGHTER] The term that has taken off is biggest election year in history, right. Over 70 countries around the world. And, you know, we’re coming from Microsoft Research, where we were so excited about this technology. Like, when it started to pass theory of mind tests, right, which is like the ability to think about how other people are thinking, like, we were all like, oh, this is amazing; this opens up so many cool application spaces, right. When it was, like, passing benchmarks for multilingual communication, again, like, we were so excited about the prospect of building out multilingual systems. And then, all of a sudden, I was at the elections workshop, and I thought, oh no, [LAUGHS] this is not good timing.

BADANES: Yeah …

DAEPP: And because so much of my work focuses on, you know, building out computer science systems like, um, data science systems or AI systems but with communities in the loop, I really wanted to go to the folks most affected by this problem. And so I proposed a project to go to Taiwan and to study one of the … it was the second election of 2024. And Taiwan is known to be subject to more external disinformation than any other place in the world. So if you were going to see something anywhere, you would see it there. Also, it has amazing civil society response so really interesting people to talk to. But I do not speak, Chinese, right. Like, I don’t have the context; I don’t speak the language. And so part of my process is to hire a half-local team. We had an amazing interpreter, Vickie Wang, and then a wonderful graduate student, Ti-Chung Cheng, who supported this work. But then also my team, Special Projects, happened to have this person who, like, not only is a leading AI researcher publishing in NeurIPS, like building out these systems, but who also spoke Chinese, had worked in technology security, and had a real understanding of international studies and economics as well as AI. And so for me, like, finding Robert as a collaborator was kind of a unicorn moment.

BADANES: So it sounds like it was a match made in heaven of skill sets and abilities. Before we get into what you all found there, which I do want to get into, I first think it’s helpful—I don’t know, when we’re dealing with these, like, complicated issues, particularly things that are moving and changing really quickly, sometimes I found it’s helpful to agree on definitions and sort of say, this is what we mean when we say this word. And that helps lead to understanding. So while I know that this research is about more than deepfakes—and we’ll talk about some of the things that are more than deepfakes—I am curious how you all define that term and how you think of it. Because this is something that I think is constantly moving and changing. So how have you all been thinking about the definition of that term?

NESS: So I’ve been thinking about it in terms of the intention behind it, right. We say deepfake, and I think colloquially that means kind of all of generative AI. That’s a bit unfortunate because there are things that are … you know, you can use generative AI to generate cartoons …

BADANES: Right.

NESS: … or illustrations for a children’s book. And so in thinking about what are we really talking about in the context of deepfakes in the political context, elections context, it’s deception, right. I’m trying to use this technology to, say, create some kind of false record of events, say, for example, something that a politician says, in order to convince people that something happened that actually did not happen.

BADANES: Right.

NESS: And so that goal of deceiving, of creating a false record, that’s kind of how I have been thinking about deepfakes in contrast to the broader category of generative AI and deepfakes in terms of being a malicious use case. There are other malicious use cases that don’t necessarily have to be deceptive, as well, as well as positive use cases.

BADANES: Well, that really, I mean, that resonates with me because what we found was when you use the term deception—or another term we hear a lot that I think works is fraud—that resonates with other people, too. Like, that helps them distinguish between neutral uses or even positive uses of AI in this space and the malicious use cases, though to your point, I suppose there’s probably even deeper definitions of what malicious use could look like. Are you finding that distinction showing up in your work between fraud and deception in these use cases? Is that something that has been coming through?

DAEPP: You know, we didn’t really think about the term fraud until we started prepping for this interview with you. As Robert said, so much of what we were thinking about in our definition was this representation of people or events, you know, done in order to deceive and with malicious intent. But in fact, in all of our conversations, no matter who we were talking to, no matter what political bent, no matter, you know, national security, fact-checking, et cetera, you know, they all agreed that using AI for the purposes of scamming somebody financially was not OK, right. That’s fraud. Using AI for the purposes of nudifying, like removing somebody’s clothes and then sextorting them, right, extorting them for money out of fear that this would be shared, like, that was not OK. And those are such clear lines. And it was clear that there’s a set of uses of generative AI also in the political space, you know, of saying this person said something that they didn’t, …

BADANES: Mm-hmm.

DAEPP: … of voter suppression, that in general, there’s a very clear line that when it gets into that fraudulent place, when it gets into that simultaneously deceptive and malicious space, that’s very clearly a no-go zone.

NESS: Oftentimes during this research, I found myself thinking about this dichotomy in cybersecurity of state actors, or broadly speaking, kind of, political actors, versus criminals.

BADANES: Right.

NESS: And it’s important to understand the distinction because criminals are typically trying to target targets of opportunity and make money, while state-sponsored agents are willing to spend a lot more money and have very specific targets and have a very specific definition of success. And so, like, this fraud versus deception kind of feels like that a little bit in the sense that fraud is typically associated with criminal behavior, while, say, I might put out deceptive political messaging, but it might fall within the bounds of free speech within my country.

BADANES: Right, yeah.

NESS: And so this is not to say I disagree with that, but it just, actually, that it could be a useful contrast in terms of thinking about the criminal versus the political uses, both legitimate and illegitimate.

BADANES: Well, I also think those of us who work in the AI space are dealing in very complicated issues that the majority of the world is still trying to understand. And so any time you can find a word that people understand immediately in order to do the, sort of, storytelling: the reason that we are worried about deepfakes in elections is because we do not want voters to be defrauded. And that, we find really breaks through because people understand that term already. That’s a thing that they already know that they don’t want to be; they do not want to be defrauded in their personal life or in how they vote. And so that really, I found, breaks through. But as much as I have talked about deepfakes, I know that you—and I know there’s a lot of interest in talking about deepfakes when we talk about this subject—but I know your research goes beyond that. So what other forms of generative AI did you include in your research or did you encounter in the effort that you were doing both in Taiwan and India?

DAEPP: Yeah. So let me tell you just, kind of, a big overview of, like, our taxonomy. Because as you said, like, so much of this is just about finding a word, right. Like, so much of it is about building a shared vocabulary so that we can start to have these conversations. And so when we looked at the political space, right, elections, so much of what it means to win an election is kind of two things. It’s building an image of a candidate, right, or changing the image of your opposition and telling a story, right.

BADANES: Mm-hmm.

DAEPP: And so if you think about image creation, of course, there are deepfakes. Like, of course, there are malicious representations of a person. But we also saw a lot of what we’re calling auth fakes, like authorized fakes, right. Candidates who would actually go to a consultancy and, like, get their bodies scanned so that videos could be made of them. They’d get their voices, a bunch of snippets of their voices, recorded so that then there could be personalized phone calls, right. So these are authorized uses of their image and likeness. Then we saw a term I’ve heard in, sort of, the ether is soft fakes. So again, likenesses of a candidate, this time not necessarily authorized but promotional. They weren’t … people on Twitter—I guess, X—on Instagram, they were sharing images of the candidate that they supported that were really flattering or silly or, you know, just really sort of in support of that person. So not with malicious intent, right, with promotional intent. And then the last one, and this, I think, was Robert’s term, but in this image creation category, you know, one thing we talked about was just the way that people were also making fun of candidates. And in this case, this is a bit malicious, right. Like, they’re making fun of people; they’re satirizing them. But it’s not deceptive because, …

BADANES: Right …

DAEPP: … you know, often it has that hyper-saturated meme aesthetic. It’s very clearly AI or just, you know, per like, sort of, US standards for satire, like, a reasonable person would know that it was silly. And so Robert said, you know, oh, these influencers, they’re not trying to deceive people; like, they’re not trying to lie about candidates. They’re trying to roast them. [LAUGHTER] And so we called it a deep roast. So that’s, kind of, the images of candidates. I will say we also looked at narrative building, and there, one really important set of things that we saw was what we call text to b-roll. So, you know, a lot of folks think that you can’t really make AI videos because, like, Sora isn’t out yet[1]. But in fact, what there is a lot of is tooling to, sort of, use AI to pull from stock imagery and b-roll footage and put together a 90-second video. You know, it doesn’t look like AI; it’s a real video. So text to b- roll, AI pasta? So if you know the threat intelligence space, there’s this thing called copy pasta, where people just …

BADANES: Sure.

DAEPP: … it’s just a fun word for copy-paste. People just copy-paste terms in order to get a hashtag trending. And we talked to an ex-influencer who said, you know, we’re using AI to do this. And I asked him why. And he said, well, you know, if you just do copy-paste, the fact-checkers catch it. But if you use AI, they don’t. And so AI pasta. And there’s also some research showing that this is potentially more persuasive than copy-paste …

BADANES: Interesting.

DAEPP: … because people think there’s a social consensus. And then the last one, this is my last of the big taxonomy, and, Robert, of course, jump in on anything you want to go deeper on, but Fake News 2.0. You know, I’m sure you’ve seen this, as well. Just this, like, creation of news websites, like entire new newspapers that nobody’s ever heard of. AI avatars that are newscasters. And this is something that was happening before. Like, there’s a long tradition of pretending to be a real news pamphlet or pretending to be a real outlet. But there’s some interesting work out of … Patrick Warren at Clemson has looked at some of these and shown the quality and quantity of articles on these things has gotten a lot better and, you know, improves as a step function of, sort of, when new models come out.

NESS: And then on the flip side, you have people using the same technologies but stated clearly that it’s AI generated, right. So we mentioned the AI avatars. In India, there’s this … there’s Bhoomi, which is a AI news anchor for agricultural news, and it states there in clear terms that she’s not real. But of course, somebody who wanted to be deceptive could use the same technology to portray something that looks like a real news broadcast that isn’t. You know, and, kind of, going back, Madeleine mentioned deep roasts, right, so, kind of, using this technology to create satirical depictions of, say, a political opponent. Somebody, a colleague, sent something across my desk. It was a Douyin account—so Douyin is the version of TikTok that’s used inside China; …

BADANES: OK.

NESS: … same company, but it’s the internal version of TikTok—that was posting AI-generated videos of politicians in Taiwan. And these were excellent, real good-quality AI-generated deepfakes of these politicians. But some of them were, first off, on the bottom of all of them, it said, this is AI-generated content.

BADANES: Oh.

NESS: And some of them were, kind of, obviously meant to be funny and were clearly fake, like still images that were animated to make somebody singing a funny song, for example. A very serious politician singing a very silly song. And it’s a still image. It’s not even, it’s not even …

BADANES: a video.

NESS: …like video.

BADANES: Right, right.

NESS: And so I messaged Puma Shen, who is one of the legislators in Taiwan who was targeted by these attacks, and I said, what do you think about this? And, you know, he said, yeah, they got me. [LAUGHTER] And I said, you know, do you think people believe this? I mean, there are people who are trying to debunk it. And he said, no, our supporters don’t believe it, but, you know, people who support the other side or people who are apolitical, they might believe it, or even if it says it’s fake—they know it’s fake—but they might still say that, yeah, but this is something they would do, right. This is …

BADANES: Yeah, it fits the narrative. Yeah.

NESS: … it fits the narrative, right. And that, kind of, that really, you know, I had thought of this myself, but just hearing somebody, you know, who’s, you know, a politician who’s targeted by these attacks just saying that it’s, like, even if they believe it’s … even if they know it’s fake, they still believe it because it’s something that they would do.

BADANES: Sure.

NESS: That’s, you know, as a form of propaganda, even relative to the canonical idea of deepfake that we have, this could be more effective, right. Like, just say it’s AI and then use it to, kind of, paint the picture of the opponent in any way you like.

BADANES: Sure, and this gets into that, sort of, challenging space I think we find ourselves in right now, which is people don’t know necessarily how to tell what’s real or not. And the case you’re describing, it has labeling, so that should tell you. But a lot of the content we come across online does not have labeling. And you cannot tell just based on your eyes whether images were generated by AI or whether they’re real. One of the things that I get asked a lot is, why can’t we just build good AI to detect bad AI, right? Why don’t we have a solution where I just take a picture and I throw it into a machine and it tells me thumbs-up or thumbs-down if this is AI generated or not? And the question around detection is a really tricky one. I’m curious what you all think about, sort of, the question of, can detection solve this problem or not?

NESS: So I’ll mention one thing. So Madeleine mentioned an application of this technology called text to b-roll. And so what this is, technically speaking, what this is doing is you’re taking real footage, you stick it in a database, it’s quote, unquote “vectorized” into these representations that the AI can understand, and then you say, hey, generate a video that illustrates this narrative for me. And you provide it the text narrative, and then it goes and pulls out a whole bunch of real video from a database and curates them into a short video that you could put on TikTok, for example. So this was a fully AI-generated product, but none of the actual content is synthetic.

BADANES: Ah, right.

NESS: So in that case, your quote, unquote “AI detection tool” is not going to work.

DAEPP: Yeah, I mean, something that I find really fascinating any time that you’re dealing with a sociotechnical system, right—a technical system embedded in social context—is folks, you know, think that things are easy that are hard and things are hard that are easy, right. And so with a lot of the detections work, right, like if you put a deepfake detector out, you make that available to anyone, then what they can do is they can run a bunch of stuff by it, …

BADANES: Yeah.

DAEPP: … add a little bit of random noise, and then the deepfake detector doesn’t work anymore. And so that detection, actually, technically becomes an arms race, you know. And we’re seeing now some detectors that, like, you know, work when you’re not looking at a specific image or a specific piece of text but you’re looking at a lot all at once. That seems more promising. But, just, this is a very, very technically difficult problem, and that puts us as researchers in a really tricky place because, you know, you’re talking to folks who say, why can’t you just solve this? If you put this out, then you have to put the detector out. And we’re like, that’s actually not, that’s not a technically feasible long-term solution in this space. And the solutions are going to be social and regulatory and, you know, changes in norms as well as technical solutions that maybe are about everything outside of AI, right.

BADANES: Yeah.

DAEPP: Not about fixing the AI system but fixing the context within which it’s used.

BADANES: It’s not just a technological solution. There’s more to it. Robert?

NESS: So if somebody were to push back there, they could say, well, great; in the long term, maybe it’s an arms race, but in the short term, right, we can have solutions out there that, you know, at least in the next election cycle, we could maybe prevent some of these things from happening. And, again, kind of harkening back to cybersecurity, maybe if you make it hard enough, only the really dedicated, really high-funded people are going to be doing it rather than, you know, everybody who wants to throw a bunch of deepfakes on the internet. But the problem still there is that it focuses really on video and images, right.

BADANES: Yeah. What about audio?

NESS: What about audio? And what about text? So …

BADANES: Yeah. Those are hard. I feel like we’ve talked a lot about definitions and theoretical, but I want to make sure we talk more about what you guys saw and researched and understood on the ground, in particular, your trips to India and Taiwan and even if you want to reflect on how those compare to the US environment. What did you actually uncover? What surprised you? What was different between those countries?

DAEPP: Yeah, I mean, right, so Taiwan … both of these places are young democracies. And that’s really interesting, right. So like in Taiwan, for example, when people vote, they vote on paper. And anybody can go watch. That’s part of their, like, security strategies. Like, anyone around the world can just come and watch. People come from far. They fly in from Canada and Japan and elsewhere just to watch Taiwanese people vote. And then similarly in India, there’s this rule where you have to be walking distance from your polling place, and so the election takes two months. And, like, your polling places move from place to place, and sometimes, it arrives on an elephant. And so these were really interesting places to, like, I as an American, just, like, found it very, very fascinating to and important to be outside of the American context. You know, we just take for granted that how we do democracy is how other people do it. But Taiwan was very much a joint, like, civil society–government everyday response to this challenge of having a lot of efforts to manipulate public opinion happening with, you know, real-world speeches, with AI, with anything that you can imagine. You know, and I think the Microsoft Threat Analysis Center released a report documenting some of the, sort of, video stuff[2]. There’s a use of AI to create videos the night before the election, things like this. But then India is really thinking of … so India, right, it’s the world’s biggest democracy, right. Like, nearly a billion people were eligible to vote.

BADANES: Yeah.

NESS: And arguably the most diverse, right?

DAEPP: Yeah, arguably the most diverse in terms of languages, contexts. And it’s also positioning itself as the AI laboratory for the Global South. And so folks, including folks at the MSR (Microsoft Research) Bangalore lab, are leaders in thinking about representing low-resource languages, right, thinking about cultural representation in AI models. And so there you have all of these technologists who are really trying to innovate and really trying to think about what’s the next clever application, what’s the next clever use. And so that, sort of, that taxonomy that we talked about, like, I think just every week, every interview, we, sort of, had new things to add because folks there were just constantly trying all different kinds of ways of engaging with the public.

NESS: Yeah, I think for me, in India in particular, you know, India is an engineering culture, right. In terms of, like, the professional culture there, they’re very, kind of, engineering skewed. And so I think one of the bigger surprises for me was seeing people who were very experienced and effective campaign operatives, right, people who would go and, you know, hit the pavement; do door knocking; kind of, segment neighborhoods by demographics and voter block, these people were also, you know, graduated in engineering from an IIT (Indian Institute of Technology), …

BADANES: Sure.

NESS: … right, and so … [LAUGHS] so they were happy to pick up these tools and leverage them to support their expertise in this work, and so some of the, you know, I think a lot of the narrative that we tell ourselves in AI is how it’s going to be, kind of, replacing people in doing their work. But what I saw in India was that people who were very effective had a lot of domain expertise that you couldn’t really automate away and they were the ones who are the early adopters of these tools and were applying it in ways that I think we’re behind on in terms of, you know, ideas in the US.

BADANES: Yeah, I mean, there’s, sort of, this sentiment that AI only augments existing problems and can enhance existing solutions, right. So we’re not great at translation tools, but AI will make us much better at that. But that also can then be weaponized and used as a tool to deceive people, which propaganda is not new, right? We’re only scaling or making existing problems harder, or adversaries are trying to weaponize AI to build on things they’ve already been doing, whether that’s cyberattacks or influence operations. And while the three of us are in different roles, we do work for the same company. And it’s a large technology company that is helping bring AI to the world. At the same time, I think there are some responsibilities when we look at, you know, bad actors who are looking to manipulate our products to create and spread this kind of deceptive media, whether it’s in elections or in other cases like financial fraud or other ways that we see this being leveraged. I’m curious what you all heard from others when you’ve been doing your research and also what you think our responsibilities are as a big tech company when it comes to keeping actors from using our products in those ways.

DAEPP: You know, when I started using GPT-4, one of the things I did was I called my parents, and I said, if you hear me on a phone call, …

BADANES: Yeah.

DAEPP: … like, please double check. Ask me things that only I would know. And when I walk around Building 99, which is, kind of, a storied building in which a lot of Microsoft researchers work, everybody did that call. We all called our parents.

BADANES: Interesting.

DAEPP: Or, you know, we all checked in. So just as, like, we have a responsibility to the folks that we care about, I think as a company, that same, sort of, like, raising literacy around the types of fraud to expect and how to protect yourself from them—I think that gets back to that fraud space that we talked about—and, you know, supporting law enforcement, sharing what needs to be shared, I think that without question is a space that we need to work in. I will say a lot of the folks we talked with, they were using Llama on a local GPU, right.

BADANES: OK.

DAEPP: They were using open-source models. They were sometimes … they were testing out Phi. They would use Phi, Grok, Llama, like anything like that. And so that raises an interesting question about our guardrails and our safety practices. And I think there, we have an, like, our obligation and our opportunity actually is to set the standard, right. To say, OK, like, you know, if you use local Llama and it spouts a bunch of stuff about voter suppression, like, you can get in trouble for that. And so what does it mean to have a safe AI that wins in the marketplace, right? That’s an AI that people can feel confident and comfortable about using and one that’s societally safe but also personally safe. And I think that’s both a challenge and a real opportunity for us.

BADANES: Yeah … oh, go ahead, Robert, yeah …

NESS: Going back to the point about fraud. It was this year, in January, when that British engineering firm Arup, when somebody used a deepfake to defraud that company of about $25 million, …

BADANES: Yeah.

NESS: … their Hong Kong office. And after that happened, some business managers in Microsoft reached out to me regarding a major client who wanted to start red teaming. And by red teaming, I mean intentionally targeting your executives and employees with these types of attacks in order to figure out where your vulnerabilities as an organization are. And I think, yeah, it got me thinking like, wow, I would, you know, can we do this for my dad? [LAUGHS] Because I think that was actually a theme that came out from a lot of this work, which was, like, how can we empower the people who are really on the frontlines of defending democracy in some of these places in terms of the tooling there? So we talked about, say, AI detection tools, but the people who are actually doing fact-checking, they’re looking more than at just the video or the images; they’re actually looking at a, kind of, holistic … taking a holistic view of the news story and doing some proper investigative journalism to see if something is fake or not.

BADANES: Yeah.

NESS: And so I think as a company who creates products, can we take a more of a product mindset to building tools that support that entire workflow in terms of fact-checking or investigative journalism in the context of democratic outcomes …

BADANES: Yeah.

NESS: … where maybe looking at individual deepfake content is just a piece of that.

BADANES: Yeah, you know, I think there’s a lot of parallels here to cybersecurity. That’s also what we’ve found, is this idea that, first of all, the “no silver bullet,” as we were talking about earlier with the detection piece. Like, you can’t expect your system to be secure just because you have a firewall, right. You have to have this, like, defense in-depth approach where you have lots of different layers. And one of those layers has been on the literacy side, right. Training and teaching people not to click on a phishing link, understanding that they should scroll over the URL. Like, these are efforts that have been taken up, sort of, in a broad societal sense. Employers do it. Big tech companies do it. Governments do it through PSAs and other things. So there’s been a concerted effort to get a population who might not have been aware of the fact that they were about to be scammed to now know not to click on that link. I think, you know, you raised the point about literacy. And I think there’s something to be said about media literacy in this space. It’s both AI literacy—understanding what it is—but also understanding that people may try to defraud you. And whether that is in the political sense or in the financial sense, once you have that, sort of, skill set in place, you’re going to be protected. One thing that I’ve heard, though, as I have conversations about this challenge … I’ve heard a couple things back from people specifically in civil society. One is not to put the impetus too much on the end consumer, which I think I’m hearing that we also recognize there’s things that we as technology companies should be focusing on. But the other thing is the concern that in, sort of, the long run, we’re going to all lose trust in everything we see anyway. And I’ve heard some people refer to that as the trust deficit. Have you all seen anything promising in the space to give you a sense around, can we ever trust what we’re looking at again, or are we actually just training everyone to not believe anything they see? Which I hope is not the case. I am an optimist. But I’d love to hear what you all came across. Are there signs of hope here where we might actually have a place where we can trust what we see again?

DAEPP: Yeah. So two things. There is this phenomenon called the liar’s dividend, right, …

BADANES: Sure, yeah.

DAEPP: … which is where that if you educate folks about how AI can be used to create fake clips, fake audio clips, fake videos, then if somebody has a real audio clip, a real video, they can claim that it’s AI. And I think we talk, you know, again, this is, like, in a US-centric space, we talk about this with politicians, but the space in which this is really concerning, I think, is war crimes, right …

BADANES: Oh, yeah.

DAEPP: … I think are these real human rights infractions where you can prevent evidence from getting out or being taken seriously. And we do see that right after invasions, for example, these days. But this is actually a space … like, I just told you, like, oh, like, detection is so hard and not technically, like, that’ll be an arms race! But actually, there is this wonderful project, Project Providence, that is a Microsoft collaboration with a company called Truepic that … it’s, like, an app, right. And what happens is when you take a photo using this app, it encrypts the, you know, hashes the GPS coordinates where the photo was taken, the time, the day, and uploads that with the pixels, with the image, to Azure. And then later, when a journalist goes to use that image, they can see that the pixels are exactly the same, and then they can check the location and they can confirm the GPS. And this actually meets evidentiary standards for the UN human rights tribunal, right.

BADANES: Right.

DAEPP: So this is being used in Ukraine to document war crimes. And so, you know, what if everybody had that app on their phone? That means you don’t … you know, most photos you take, you can use an AI tool and immediately play with. But in that particular situation where you need to confirm provenance and you need to confirm that this was a real event that happened, that is a technology that exists, and I think folks like the C2PA coalition (Coalition for Content Provenance and Authenticity) can make that happen across hardware providers.

NESS: And I think the challenge for me is, we can’t separate this problem from some of the other, kind of, fundamental problems that we have in our media environment now, right. So, for example, if I go on to my favorite social media app and I see videos from some conflicts around the world, and these videos could be not AI generated and I still could be, you know, the target of some PR campaign to promote certain content and suppress other ones. The videos could be authentic videos, but not actually be accurate depictions of what they claim to be. And so I think that this is a … the AI presents a complicating factor in an already difficult problem space. And I think, you know, trying to isolate these different variables and targeting them individually is pretty tricky. I do think that despite the liar’s dividend that media literacy is a very positive area to, kind of, focus energy …

BADANES: Yeah.

NESS: … in the sense that, you know, you mentioned earlier, like, using this term fraud, again, going back to this analogy with cybersecurity and cybercrime, that it tends to resonate with people. We saw that, as well, especially in Taiwan, didn’t we, Madeleine? Well, in India, too, with the sextortion fears. But in Taiwan, a lot of just cybercrime in terms of defrauding people of money. And one of the things that we had observed there was that talking about generative AI in the context of elections was difficult to talk to people about it because people, kind of, immediately went into their political camps, right.

BADANES: Yeah.

NESS: And so you had to, kind of, penetrate … you know, people were trying to, kind of, suss out which side you were on when you’re trying to educate them about this topic.

BADANES: Sure.

NESS: But if you talk to—but everybody’s, like, fraud itself is a lot less partisan.

BADANES: Yeah, it’s a neutral term.

NESS: Exactly. And so it becomes a very useful way to, kind of, get these ideas out there.

BADANES: That’s really interesting. And I love the provenance example because it really gets to the question about authenticity. Like, where did something come from? What is the origin of that media? Where has it traveled over time? And if AI is a component of it, then that’s a noted fact. But it doesn’t put us into the space of AI or not AI, which I think is where a lot of the, sort of, labeling has gone so far. And I understand the instinct to do that. But I like the idea of moving more towards how do you know more about an image of which whether there was AI involved or not is a component but does not have judgment. That does not make the picture good or bad. It doesn’t make it true or false. It’s just more information for you to consume. And then, of course, the media literacy piece, people need to know to look for those indicators and want them and ask for them from the technology company. So I think that’s a good, that’s a good silver lining. You gave me the light at the end of the tunnel I think I was looking for on the post-truth world. So, look, here’s the big question. You guys have been spending this time focusing on AI and democracy in this big, massive global election year. There was a lot of hype. [LAUGHS] There was a lot of hype. Lots of articles written about how this was going to be the AI election apocalypse. What say you? Was it? Was it not?

NESS: I think it was, well, we definitely have documented cases where this happened. And I’m wary of this question, particularly again from the cybersecurity standpoint, which is if you were not the victim of a terrible hack that brought down your entire company, would you say, like, well, it didn’t happen, so it’s not going to happen, right. You would never …

BADANES: Yeah.

NESS: That would be a silly attitude to have, right. And also, you don’t know what you don’t know, right. So, like, a lot of the, you know, we mentioned sextortion; we mentioned these cybercrimes. A lot of these are small-dollar crimes, which means they don’t get reported or they don’t get reported for reasons of shame. And so we don’t even have numbers on a lot of that. And we know that the political techniques are going to mirror the criminal techniques.

BADANES: Yeah.

NESS: And also, I worry about, say, down-ballot elections. Like, so much of, kind of, our election this year, a lot of the focus was on the national candidates, but, you know, if local poll workers are being targeted, if disinformation campaigns are being put out about local candidates, it’s not going get the kind of play in the national media such that you and I might hear about it. And so I’m, you know, so I’ll hand it off to Madeleine, but yeah.

DAEPP: So absolutely agree with Robert’s point, right. If your child was affected by sextortion, if you are a country that had an audio clip go viral, this was the deepfake deluge for you, right. That said, something that happened, you know, in India as in the United States, there were major prosecutions very early on, right.

BADANES: Yeah.

DAEPP: So in India, there was a video. It turned out not to be a deepfake. It turned out to be a “cheap fake,” to your point about, you know, the question isn’t whether there’s AI involved; the question is whether this is an attempt to defraud. And five people were charged for this video.

BADANES: Yeah.

DAEPP: And in the United States, right, those Biden robocalls using Biden’s voice to tell folks not to vote, like, that led to a million-dollar fine, I think, for the telecoms and $6 million for the consultant who created that. And when we talk to people in India, you know, people who work in this space, they said, well, I’m not going to do that; like, I’m going to focus on other things. So internal actors pay attention to these things. That really changes what people do and how they do it. And so that, I do think the work that your team did, right, to educate candidates about looking out for the stuff, the work that the MTAC (Microsoft Threat Analysis Center) did to track usage and report it, all of that, I think, was, actually, those interventions, I think, worked. I think they were really important, and I do think that what we are … this absence of a deluge is actually a huge number of people making a very concerted effort to prevent it from happening.

BADANES: That’s encouraging.

NESS: Madeleine, you made a really important point that this deterrence from prosecution, it’s effective for internal actors, …

BADANES: Yeah.

DAEPP: Yeah, that’s right.

NESS: … right. So for foreign states who are trying to interfere with other people’s elections, the fear of prosecution is not going to be as much of a deterrent.

BADANES: That is true. I will say what we saw in this election cycle, in particular in the US, was a concerted effort by the intelligence community to call out and name nation-state actors who were either doing cyberattacks or influence operations, specific videos that they identified, whether there was AI involved or not. I think that level of communication with the public while maybe doesn’t lead to those actors going to jail—maybe someday—but does in fact lead to a more aware public and therefore hopefully a less effective campaign. If people on the other end … and it’s a little bit into the literacy space, and it’s something that we’ve seen government again in this last cycle do very effectively, to name and shame essentially when they see these things in part, though, to make sure voters are aware of what’s happening. We’re not quite through this big global election year; we have a couple more elections before we really hit the end of the year, but it’s winding down. What is next for you all? Are you all going to continue this work? Are you going build on it? What comes next?

DAEPP: So our research in India actually wasn’t focused specifically on elections. It was about AI and digital communications.

BADANES: Ahh.

DAEPP: Because, you know, again, like India is this laboratory.

BADANES: Sure.

DAEPP: And I think what we learned from that work is that, you know, this is going to be a part of our digital communications and our information system going forward without question. And the question is just, like, what are the viable business models, right? What are the applications that work? And again, that comes back to making sure that whatever AI … you know, people when they build AI into their entire, you know, newsletter-writing system, when they build it into their content production, that they can feel confident that it’s safe and that it meets their needs and that they’re protected when they use it. And similarly, like, what are those applications that really work, and how do you empower those lead users while mitigating those harms and supporting civil society and mitigating those harms? I think that’s an incredible, like, that’s—as a researcher—that’s, you know, that’s a career, right.

BADANES: Yeah.

DAEPP: That’s a wonderful research space. And so I think understanding how to support AI that is safe, that enables people globally to have self-determination in how models represent them, and that is usable and powerful, I think that’s broadly …

BADANES: Where this goes.

DAEPP: … what I want to drive.

BADANES: Robert, how about you?

NESS: You know, so I mentioned earlier on these AI alignment issues.

BADANES: Yeah.

NESS: And I was really fascinated by how local and contextual those issues really are. So to give an example from Taiwan, we train these models on training data that we find from the internet. Well, when it comes to, say, Mandarin Chinese, you can imagine the proportion of content, of just the quantity of content, on the internet that comes from China is a lot more than the quantity that comes from Taiwan. And of course, what’s politically correct in China is different from what’s politically correct in Taiwan. And so when we were talking to Taiwanese, a lot of people had these concerns about, you know, having these large language models that reflected Taiwanese values. We heard the same thing in India about just people on different sides of the political spectrum and, kind of, looking at … a YouTuber in India had walked us through this … how, for example, a founding father of India, there was a disparate literature in favor of this person and some more critical of this person, and he had spent time trying to suss out whether GPT-4 was on one side or the other.

BADANES: Oh. Whose side are you on? [LAUGHS]

NESS: Right, and so I think for our alignment research at Microsoft Research, this becomes the beginning of, kind of, a very fruitful way of engaging with local stakeholders and making sure that we can reflect these concerns in the models that we develop and deploy.

BADANES: Yeah. Well, first, I just want to thank you guys for all the work you’ve done. This is amazing. We’ve really enjoyed partnering with you. I’ve loved learning about the research and the efforts, and I’m excited to see what you do next. I always want to end these kinds of conversations on a more positive note, because we’ve talked a lot about the weaponization of AI and, you know, how … ethical areas that are confusing and … but I am sure at some point in your work, you came across really positive use cases of AI when it comes to democracy, or at least I hope you have. [LAUGHS] Do you have any examples or can you leave us with something about where you see either it going or actively being used in a way to really strengthen democratic processes or systems?

DAEPP: Yeah, I mean, there is just a big paper in Science, right, which, as researchers, when something comes out in Science, you know your field is about to change, right, …

BADANES: Yeah.

DAEPP: … showing that an AI model in, like, political deliberations, small groups of UK residents talking about difficult topics like Brexit, you know, climate crisis, difficult topics, that in these conversations, an AI moderator created, like, consensus statements that represented the majority opinion, still showed the minority opinion, but that participants preferred to a human-written statement and in fact preferred to their original opinion.

BADANES: Wow.

DAEPP: And that this, you know, not only works in these randomized controlled trials but actually works in a real citizens deliberation. And so that potential of, like, carefully fine-tuned, like, carefully aligned AI to actually help people find points of agreement, that’s a really exciting space.

BADANES: So next time my kids are in a fight, I’m going to point them to Copilot and say, work with Copilot to mediate. [LAUGHS] No, that’s really, really interesting. Robert, how about you?

NESS: She, kind of, stole my example. [LAUGHTER] But I’ll take it from a different perspective. So, yes, like how these technologies can enable people to collaborate and ideally, I think, from a democratic standpoint, at a local level, right. So, I mean, I think so much of our politics were, kind of, focused at the national-level campaign, but our opportunity to collaborate is much more … we’re much more easily … we can collaborate much more easily with people who are in our local constituencies. And I think to myself about, kind of, like, the decline particularly of local newspapers, local media.

BADANES: Right.

NESS: And so I wonder, you know, can these technologies help address that problem in terms of just, kind of, information about, say, your local community, as well as local politicians. And, yeah, and to Madeleine’s point, so Madeleine started the conversation talking about her background in urban planning and some of the work she did, you know, working on a local level with local officials to bring technology to the level of cities. And I think, like, well, you know, politics are local, right. So, you know, I think that that’s where there’s a lot of opportunity for improvement.

BADANES: Well, Robert, you just queued up a topic for a whole other podcast because our team also does a lot of work around journalism, and I will say we have seen that AI at the local level with local news is really a powerful tool that we’re starting to see a lot of appetite and interest for in order to overcome some of the hurdles they face right now in that industry when it comes to capacity, financing, you know, not able to be in all of the places they want to be at once to make sure that they’re reporting equally across the community. This is, like, a perfect use case for AI, and we’re starting to see folks who are really using it. So maybe we’ll come back and do this again another time on that topic. But I just want to thank you both, Madeleine and Robert, for joining us today and sharing your insights. This was really a fascinating conversation. I know I learned a lot. I hope that our listeners learned a lot, as well.

[MUSIC]

And, listeners, I hope that you tune in for more episodes of Ideas, where we continue to explore the technologies shaping our future and the big ideas behind them. Thank you, guys, so much.

DAEPP: Thank you.

NESS: Thank you.

[MUSIC FADES] [1] The video generation model Sora was released publicly earlier this month.

[2] For a summary of and link to the report, see the Microsoft On the Issues blog post China tests US voter fault lines and ramps AI content to boost its geopolitical interests.

The post Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness appeared first on Microsoft Research.

Research Focus: Week of December 16, 2024

December 18, 2024

by Brenda Potts Microsoft AI

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering

The Compute Express Link (CXL) open standard interconnect enables integration of diverse types of memory into servers via its byte-addressable SerDes links. To fully utilize CXL-based heterogeneous memory systems (which combine different types of memory with varying access speeds), it’s necessary to implement efficient memory tiering—a strategy to manage data placement across memory tiers for optimal performance. Efficiently managing these memory systems is crucial, but has been challenging due to the lack of precise and efficient tools for understanding how memory is accessed.

In a recent paper: NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering researchers from Microsoft propose a novel solution which features a hardware/software co-design to address this problem. NeoMem offloads memory profiling functions to CXL device-side controllers, integrating a dedicated hardware unit called NeoProf, which monitors memory accesses and provides the operating system (OS) with crucial page hotness statistics and other system state information. On the OS kernel side, the researchers designed a revamped memory-tiering strategy, enabling accurate and timely hot page promotion based on NeoProf statistics. Implemented on a real FPGA-based CXL memory platform and Linux kernel v6.3, NeoMem demonstrated 32% to 67% geomean speedup over several existing memory tiering solutions.

Read the paper

NEW RESEARCH

Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases

Planning and conducting chemical syntheses is a significant challenge in the discovery of functional small molecules, which limits the potential of generative AI for molecular inverse design. Although early machine learning-based retrosynthesis models have shown the ability to predict reasonable routes, they are less accurate for infrequent, yet important reactions.

In a recent paper: Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases, researchers from Microsoft and external colleagues address this limitation, with a new framework for building highly accurate reaction models. Chimera incorporates two newly developed models, each achieving state-of-the-art performance in their respective categories. Evaluations by PhD-level organic chemists show that Chimera’s predictions are preferred for their higher quality compared to baseline models.

The researchers further validate Chimera’s robustness by applying its largest-scale model to an internal dataset from a major pharmaceutical company, demonstrating its ability to generalize effectively under distribution shifts. This new framework shows the potential to substantially accelerate the development of even more accurate and versatile reaction prediction models.

Read the paper

NEW RESEARCH

The GA4GH Task Execution API: Enabling Easy Multicloud Task Execution

In bioinformatics and computational biology, data analysis often involves chaining command-line programs developed by specialized teams at different institutions. These tools, which vary widely in age, software stacks, and dependencies, lack a common programming interface, which makes integration, workflow management and reproducibility challenging.

A recent article (opens in new tab) emphasizes the development, adoption and implementation of the Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API, created in collaboration with researchers at Microsoft and other institutions. The TES API offers a unified schema and interface for submitting and managing tasks, seamlessly bridging gaps between on-premises high-performance and high-throughput computing systems, cloud platforms, and hybrid infrastructures. Its flexibility and extensibility have already made it a critical asset for applications ranging from federated data analysis to load balancing across multi-cloud systems.

Adopted by numerous service providers and integrated into several workflow engines, TES empowers researchers to execute complex computational tasks through a single, abstracted interface. This eliminates compatibility hurdles, accelerates research timelines, reduces costs and enables “compute to data” solutions—essential for tackling the challenges of distributed data analysis.

Read the paper

NEW RESEARCH

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

Increasing use of code agents for AI-assisted coding and software development has brought safety and security concerns, such as generating or executing malicious code, which have become significant barriers to real-world deployment of these agents.

In a recent paper: RedCode: Risky Code Execution and Generation Benchmark for Code Agents, published at NeurIPS 2024, researchers from Microsoft and external colleagues propose comprehensive and practical evaluations on the safety of code agents. RedCode is an evaluation platform with benchmarks grounded in four key principles: real interaction with systems, holistic evaluation of unsafe code generation and execution, diverse input formats, and high-quality safety scenarios and tests.

This research evaluated three agents based on various large language models (LLMs), providing insights into code agents’ vulnerabilities. For instance, results showed that agents are more likely to reject executing unsafe operations on the operating system. Unsafe operations described in natural text lead to a lower rejection rate than those in code format. Additional evaluations revealed that more capable base models and agents with stronger overall coding abilities, such as GPT-4, tend to produce more sophisticated harmful software.

These findings highlight the need for stringent safety evaluations for diverse code agents. The underlying dataset and related code are publicly available at https://github.com/AI-secure/RedCode (opens in new tab).

Read the paper

NEW RESEARCH

Towards industrial foundation models: Integrating large language models with industrial data intelligence

Although large language models (LLMs) excel at language-focused tasks like news writing, document summarization, customer service, and supporting virtual assistants, they can face challenges when it comes to learning and inference on numeric and structured industry data, such as tabular and time series data. To address these issues, researchers from Microsoft propose a new approach to building industrial foundation models (IFMs). As outlined in a recent blog post, they have successfully demonstrated the feasibility of cross-domain universal in-context learning on tabular data and the significant potential it could achieve.

The researchers designed Generative Tabular Learning (opens in new tab) (GTL), a new framework that integrates multi-industry zero-shot and few-shot learning capabilities into LLMs. This approach allows the models to adapt and generalize to new fields, new data, and new tasks more effectively, flexibly responding to diverse data science tasks. This technical paradigm has been open-sourced (opens in new tab) to promote broader use.

Read the paper

Microsoft Research in the news

Microsoft’s smaller AI model beats the big guys: Meet Phi-4, the efficiency king

December 12, 2024

Microsoft launched a new artificial intelligence model today that achieves remarkable mathematical reasoning capabilities while using far fewer computational resources than its larger competitors.

Microsoft researcher Ece Kamar discusses the future of AI agents in 2025

Tech Brew | December 12, 2024

With AI agents widely expected to take off in 2025, the director of Microsoft’s AI Frontiers lab weighs in on the future of this technology, the safeguards needed, and the year ahead in AI research.

A new frontier awaits — computing with light

December 12, 2024

In the guts of a new type of computer, a bunch of tiny LEDs emit a green glow. Those lights have a job to do. They’re performing calculations. Right now, this math is telling the computer how to identify handwritten images of numbers. The computer is part of a research program at Microsoft.

View more news and awards

The post Research Focus: Week of December 16, 2024 appeared first on Microsoft Research.

NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

December 17, 2024

by Brenda Potts Microsoft AI

Illustrated headshots of Lidong Zhou and Eliza Strickland

The Microsoft Research Podcast offers its audience a unique view into the technical advances being pursued at Microsoft through the insights and personal experiences of the people committed to those pursuits.

Just after his keynote at the 38th annual Conference on Neural Information Processing Systems (NeurIPS), Microsoft Corporate Vice President Lidong Zhou joins guest host Eliza Strickland of IEEE Spectrum at the conference to further explore the topic of his talk: the co-evolution of systems and AI. Zhou, who is also chief scientist of the Microsoft Asia-Pacific Research and Development Group and managing director of Microsoft Research Asia, discusses how rapidly advancing AI impacts the systems supporting it; AI as a tool for improving systems engineering itself; and how budding computer scientists can prepare for innovating in a world where AI and systems grow together.

Learn more: 

Verus: A Practical Foundation for Systems Verification
Publication, November 2024

SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
Publication, July 2024

BitNet: Scaling 1-bit Transformers for Large Language Models
Publication, October 2023

Transcript

[MUSIC]

ELIZA STRICKLAND: Welcome to the Microsoft Research Podcast, where Microsoft’s leading researchers bring you to the cutting edge. This series of conversations showcases the technical advances being pursued at Microsoft through the insights and experiences of the people driving them.

I’m Eliza Strickland, a senior editor at IEEE Spectrum and your guest host for a special edition of the podcast.

[MUSIC FADES]

Joining me today in the Microsoft Booth at the 38th annual Conference on Neural Information Processing Systems, or NeurIPS, is Lidong Zhou. Lidong is a Microsoft corporate vice president, chief scientist of the Microsoft Asia-Pacific Research and Development Group, and managing director of Microsoft Research Asia. Earlier today, Lidong gave a keynote here at NeurIPS on the co-evolution of AI and systems engineering.

Lidong, welcome to the podcast.

LIDONG ZHOU: Thank you, Eliza. It’s such a pleasure to be here.

STRICKLAND: You said in your keynote that progress in AI is now outpacing progress in the systems supporting AI. Can you give me some concrete examples of where the current infrastructure is struggling to keep up?

ZHOU: Yeah. So actually, we have been working on supporting AI from the infrastructure perspective, and I can say, you know, there are at least three dimensions where it’s actually posing a lot of challenges. One dimension is that the scale of the AI systems that we have to support. You know, you heard about the scaling law in AI and, you know, demanding even higher scale every so often. And when we scale, as I mentioned in the talk this morning, every time you scale the system, you actually have to rethink how to design a system, develop a new methodology, revisit all the assumptions. And it becomes very challenging for the community to keep up. And the other dimension is if you look at AI systems, it’s actually a whole-stack kind of design. You have to understand not only the AI workloads, the model architecture, but also the software and also the underlying hardware. And you have to make sure they are all aligned to deliver the best performance. And the third dimension is the temporal dimension, where you really see accelerated growth and the pace of innovation in AI and not actually only in AI but also in the underlying hardware. And that puts a lot of pressure on how fast we innovate on the systems side because we really have to keep up in that dimension, as well. So all those three dimensions add up. It’s becoming a pretty challenging task for the whole systems community.

STRICKLAND: I like how in your talk you proposed a marriage between systems engineering and AI. What does this look like in practice, and how might it change the way we approach both fields?

ZHOU: Yeah, so I’m actually a big fan of systems community and AI community work together to tackle some of the most challenging problems. Of course, you know, we have been working on systems that support AI. But now increasingly, we’re seeing opportunities where AI can actually help developers to become more productive and develop systems that are better in many dimensions in terms of efficiency, in terms of reliability, in terms of trustworthiness. So I really want to see the two communities work together even more closely going forward. You know, I talk about, sort of, the three pillars, right—the efficiency; there’s trust; there’s also the infusion of the two (AI and systems engineering)—that are three ambitions that we are actually working on. And we see very encouraging early results that makes us believe that there’s much more to be achieved going forward with the two communities working together.

STRICKLAND: You mentioned the challenging of scaling. I think everyone at NeurIPS is talking about scaling. And you’ve highlighted efficiency as a key opportunity for improvement in AI. What kind of breakthroughs in systems engineering or new ideas in systems engineering could help AI achieve greater efficiencies?

ZHOU: Yeah, that’s another great question. I think there are a couple of aspects to efficiency. So this morning, I talked about some of the innovations in model architecture. So our researchers have been looking into BitNet, which is essentially try to use one bit or, actually, using a ternary representation for the weights in all those AI models rather than using FP16 and so on. And that potentially creates a lot of opportunities for efficiency and energy gains. But that cannot be done without rethinking about the software and even the hardware stack so that, you know, those innovations that you have in the model architecture can actually have the end-to-end benefits. And that’s, you know, one of the dimensions where we see the co–innovation of AI and underlying system to deliver some efficiency gains for AI models, for example. But there’s another dimension, which I think is also very important. With all the AI infrastructure that we build to support AI, there’s actually a huge room for improvement, as well. And this is where AI can actually be utilized to solve some of the very challenging systems problems, for optimization, for reliability, for trustworthiness. And I use some of the examples in my talk, but this is a very early stage. I think the potential is much larger going forward.

STRICKLAND: Yeah. It’s interesting to think about how GPUs and large language models are so intertwined at this point. You can’t really have one without the other. And you said in your talk you sort of see the need to decouple the architectures and the hardware. Is that right?

ZHOU: Yes. Yeah, so this is always, you know, like very system type of thinking where, you know, you really want to decouple some of the elements so that they can evolve and innovate independently. And this gives more opportunities, you know, larger design space, for each field. And what we are observing now, which is actually very typical in relatively mature fields, where we have GPUs that are dominating in the hardware land and all the model architecture has to be designed and, you know, proving very efficient on GPUs. And that limits the design space for model architecture. And similarly, you know, if you look at hardware, it’s very hard for hardware innovations to happen because now you have to show that those hardwares are actually great for all the models that have been actually optimized for GPUs. So I think, you know, from a systems perspective, it’s actually possible if you design the right abstraction between the AI and the hardware, it’s possible for this two domains to actually evolve separately and have a much larger design space, actually, to find the best solution for both.

STRICKLAND: And when you think about systems engineering, are there ways that AI can be used to optimize your own work?

ZHOU: Yes, I think there are. Two examples that I gave this morning, one is, you know, in systems there’s this what we call a holy grail of system research because we want to build trustworthy systems that people can depend on. And one of the approach is called verified systems. And this has been a very active research area in systems because there are a lot of advancements in formal methods in how we can infuse the formal method into building real systems. But it’s still very hard for the general system community because, you know, you really have to understand how formal methods works and so on. And so it’s still not within reach. You know, like when we build mission-critical systems, we want to be completely verified so, you know, you don’t have to do a lot of testing to show that there are no bugs. You’ll never be able to show there’s no bugs with testing. But if you …

STRICKLAND: Sorry, can I pause you for one moment? Could you define formal verification for our listeners, just in case they don’t know?

ZHOU: Yeah, that’s a good point. I think the easy way to think about this is formal verification, it uses mathematical logic to describe, say, a program and, you know, it can represent some properties in math, essentially, in logic. And then you can use a proof to show that the program has certain properties that you desire, and a simple form, like, a very preliminary form of formal (specification for) verification is, you know, just assertions in the program, right, where it, say, asserts A is not equal to zero. And that’s a very simple form of logic that must hold (or be proven to hold), and then, you know, the proof system is also much more complicated to talk about more advanced properties of programs, their correctness, and so on.

STRICKLAND: Mm-hm.

ZHOU: So I think that the opportunity that we’re seeing is that with the help of AI, I think we are on the verge of providing the capability of building verified systems, at least for some of the mission-critical pieces of systems. And that would be a very exciting area for systems and AI to tackle together. And I think we’re going to see a paradigm shift in systems where some pieces of system components will actually be implemented using AI. [What] is interesting is, you know, system is generally deterministic because, so, you know, when you look at the traditional computer system, you want to know that it’s actually acting as you expected, but AI, you know, it can be stochastic, right. And it might not always give you the same answer. But how you combine these two is another area where I see a lot of opportunities for breakthroughs.

STRICKLAND: Yeah, yeah. I wanted to back up in your career a little bit and talk about the concept of gray failures because you were really instrumental in defining this concept, which for people who don’t know, gray failures are subtle and partial failures in cloud-scale systems. They can be very difficult to detect and can lead to major problems. I wanted to see if you’re still thinking about gray failures in the context of your thinking about AI and systems. Are gray failures having an impact on AI today?

ZHOU: Yes, definitely. So when we were looking at cloud systems, we realized the … so in systems, we developed a lot of mechanisms for reliability. And when we look at the cloud systems, when they reach a certain scale, a lot of methodology we develop in systems for reliability actually no longer applies. One of the reasons is we have those gray failures. And then we moved to looking at AI infrastructure. The problem is actually even worse because what we realize is there’s a lot of built-in redundancy at every level, like in GPUs, memory, or all the communication channels. And because of those built-in redundancies, sometimes the system is experience failures, but they’re being masked because of the redundancies. And that makes it very hard for us to actually maintain the system, debug the system, or to troubleshooting. And for AI infrastructure, what we have developed is a very different approach using proactive validation rather than reactive repair. And this is actually a paper that we wrote recently in USENIX ATC that talks about how we approach reliability in AI infrastructure, where the same concept happens to apply in a new meaning.

STRICKLAND: Mm. I like that. Yeah. So tell me a little bit about your vision for where AI goes from here. You talked a little bit in your keynote about AI-infused systems. And what would that look like?

ZHOU: Yeah, so I think AI is going to transform almost everything, and that includes systems. That’s why I’m so happy to be here to learn more from the AI community. But I also believe that for every domain that AI is going to transform, you really need the domain expertise and, sort of, the combination of AI and that particular domain. And the same for systems. So when we look at what we call AI-infused systems, we really see the opportunity where there are a lot of hard system challenges can be addressed by AI. But we need to define the right interface between the system and the AI so that we can leverage the advantage of both, right. Like, AI is creative. It comes up with solutions that, you know, people might not think of, but it’s also a little bit random sometimes. It could, you know, give you wrong answers. But systems are very grounded and very deterministic. So we need to figure out what is the design paradigm that we need to develop so that we can get the best of both worlds.

STRICKLAND: Makes sense. In your talk you gave an example of OptiFlow. Could you tell our listeners a bit about that?

ZHOU: Yeah. This is a pretty interesting project that is actually done in Microsoft Research Asia jointly with the Azure team where we look at collective communication, which is a major part of AI infrastructure. And it turns out, you know, there’s a lot of room for optimization. It was initially done manually. So an expert had to take a look at the system and look at the different configurations and do all kinds of experiments, and, you know, it takes about two weeks to come up with a solution. This is why I say, you know, the productivity is becoming a bottleneck for our AI infrastructure because people are in the loop who have to develop solutions. And it turns out that this is a perfect problem for AI, where AI can actually come up with various solutions. It can actually develop good system insights based on the observations from the system. And so OptiFlow, what it does is it comes up with the, sort of, the algorithm or the schedule of communications for different collective communication primitives. And it turns out to be able to discover algorithms that’s much better than the default one or, you know, for different settings. And it’s giving us the benefits of the productivity; also, efficiency.

STRICKLAND: And you said that this is in production today, right?

ZHOU: Yes. It is in production.

STRICKLAND: That’s exciting. So thinking still to the future, how might the co-evolution of AI and systems change the skills needed for future computer scientists?

ZHOU: Yeah, that’s a very deep question. As I mentioned, I think being fluent in AI is very important. But I also believe that domain expertise is probably undervalued in many ways. And I see a lot of needs for this interdisciplinary kind of education where someone who not only understands AI and what AI technology can do but also understands a particular domain very well. And those are the people who will be able to figure out the future for that particular domain with the power AI. And I think for students, certainly it’s no longer sufficient for you to be an expert in a very narrow domain. I think we see a lot of fields sort of merging together, and so you have to be an expert in multiple domains to see new opportunities for innovations.

STRICKLAND: So what advice would you give to a high school student who’s just starting out and thinks, ah, I want to get into AI?

ZHOU: Yeah, I mean certainly there’s a lot of excitement over AI, and it would be great for high school students to, actually, to have the firsthand experience. And I think it’s their world in the future. Because they probably can imagine a lot of things from scratch. I think they probably have the opportunity to disrupt a lot of the things that we take for granted today. So I think just use their imagination. And I don’t think we have really good advice for the young generation. It’s going to be their creativity and their imagination. And AI is definitely going to empower them to do something that’s going to be amazing.

STRICKLAND: Something that we probably can’t even imagine.

ZHOU: Right.

STRICKLAND: Yeah.

ZHOU: I think so.

STRICKLAND: I like that. So as we close, I’m hoping you can look ahead and talk about what excites you most about the potential of AI and systems working together, but also if you have any concerns, what concerns you most?

ZHOU: Yeah, I think in terms of AI systems, I’m certainly pretty excited about what we can do together, you know, with a combination of AI and systems. There are a lot of low-hanging fruit, and there are also a lot of potential grand challenges that we can actually take on. I mentioned a couple in this morning’s talk. And certainly, you know, we also want to look at the risks that could happen, especially when we have systems and AI start to evolve together. And this is also in an area where having some sort of trust foundation is very important so we can have some assurance of the kind of system or AI system that we are going to build. And this is actually fundamental in how we think about trust in systems. And I think that concept can be very useful for us to guard against unintended consequences or unintended issues.

[MUSIC]

STRICKLAND: Well, Lidong Zhou, thank you so much for joining us on the podcast. I really enjoyed the conversation.

ZHOU: It’s such a pleasure, Eliza.

STRICKLAND: And to our listeners, thanks for tuning. If you want to learn more about research at Microsoft, you can check out the Microsoft Research website at Microsoft.com/research. Until next time.

[MUSIC FADES]

The post NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou appeared first on Microsoft Research.

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

December 17, 2024

by Brenda Potts Microsoft AI

A diagram illustrating the joint optimization process of instructions and in-context examples in PromptWizard. The figure demonstrates how the framework iteratively refines both components, integrating feedback to enhance the overall prompt effectiveness and adaptability across tasks.

The challenge of effective prompting

AI is reshaping industries—from education to healthcare—thanks to advancements in large language models (LLMs). These models rely on prompts, carefully crafted inputs that guide them to produce relevant and meaningful outputs. While the impact of prompts is profound, creating prompts that can help with complex tasks is a time-intensive and expertise-heavy process, often involving months of trial and error.

This challenge grows as new tasks arise and models evolve rapidly, making manual methods for prompt engineering increasingly unsustainable. The question then becomes: How can we make prompt optimization faster, more accessible, and more adaptable across diverse tasks?

Download

PromptWizard

To address this challenge, we developed PromptWizard (PW), a research framework that automates and streamlines the process of prompt optimization. We are open sourcing the PromptWizard codebase (opens in new tab) to foster collaboration and innovation within the research and development community.

Introducing PromptWizard

PromptWizard (PW) is designed to automate and simplify prompt optimization. It combines iterative feedback from LLMs with efficient exploration and refinement techniques to create highly effective prompts within minutes.

PromptWizard optimizes both the instruction and the in-context learning examples. Central to PW is its self-evolving and self-adaptive mechanism, where the LLM iteratively generates, critiques, and refines prompts and examples in tandem. This process ensures continuous improvement through feedback and synthesis, achieving a holistic optimization tailored to the specific task at hand. By evolving both instructions and examples simultaneously, PW ensures significant gains in task performance.

Three key insights behind PromptWizard:

Feedback-driven refinement: At its core, PW leverages an iterative feedback loop where the LLM generates, critiques, and refines its own prompts and examples. This continuous improvement mechanism ensures that each iteration is better than the last, leading to highly effective prompts and examples.
Joint optimization and synthesis of diverse examples: PW generates synthetic examples that are not only robust and diverse but also task-aware. By optimizing prompts and examples together, it ensures they work in tandem to address specific task requirements effectively.
Self-generated chain-of-thought (CoT) steps: Incorporating CoT reasoning improves the problem-solving capabilities of the model. By using selected few-shot examples, PW generates a detailed reasoning chain for each example, facilitating nuanced and step-by-step problem-solving approaches.

Fig 1: A diagram providing an overview of the PromptWizard process. It illustrates the main components, including iterative prompt generation, feedback-based refinement, and joint optimization of instructions and examples. The workflow emphasizes modularity and adaptability, demonstrating how PromptWizard evolves prompts to improve performance across diverse tasks. — Figure 1. Overview of PromptWizard

How PromptWizard works

PromptWizard begins with a user input: a problem description, an initial prompt instruction, and a few training examples that serve as a foundation for the task at hand.

Its output is a refined, optimized set of prompt instructions paired with carefully curated in-context few-shot examples. These outputs are enriched with detailed reasoning chains, task intent, and an expert profile that bridges human-like reasoning with the AI’s responses.

Stage 1: Refinement of prompt instruction

The first stage focuses on refining the task instructions of a prompt. PromptWizard generates multiple candidate instructions, evaluates them using feedback from the LLM, and iteratively synthesizes improved versions. This process balances exploration—trying diverse ideas—and exploitation—refining the most promising ones.

For example, if an initial instruction yields suboptimal results, PW incorporates feedback to identify its shortcomings and generates an improved version. Over three to five iterations, this iterative cycle ensures that the instruction converges to an optimal state.

Fig 2: A visualization of the refinement process for prompt instructions in PromptWizard. The figure highlights iterative improvements, where initial instructions are critiqued, adjusted based on feedback, and fine-tuned to achieve greater accuracy and alignment with task objectives. — Figure 2. Refinement of prompt instruction

Stage 2: Joint optimization of instructions and examples

The refined prompt obtained from Stage 1 is combined with carefully selected examples, and both are optimized together. Through the critique-and-synthesis mechanism, PromptWizard ensures alignment between the prompt and examples, simultaneously synthesizing new examples to enhance task performance.

This structured approach makes PromptWizard highly versatile, adapting to tasks as varied as solving math problems or generating creative content.

Fig 3: A diagram illustrating the joint optimization process of instructions and in-context examples in PromptWizard. The figure demonstrates how the framework iteratively refines both components, integrating feedback to enhance the overall prompt effectiveness and adaptability across tasks. — Figure 3. Joint optimization of instructions and examples

Results

PromptWizard stands out for its feedback-driven refinement and systematic exploration, delivering exceptional results across a wide variety of tasks while maintaining computational efficiency.

Comprehensive evaluation across tasks

PromptWizard was rigorously evaluated on over 45 tasks, spanning both general and domain-specific challenges. Benchmarked against state-of-the-art techniques—including Instinct, InstructZero, APE, PromptBreeder, EvoPrompt, DSPy, APO, and PromptAgent—PW consistently outperformed competitors in accuracy, efficiency, and adaptability. Please see detailed results in our paper.

Accuracy: PW consistently outperformed other methods, maintaining performance close to the best across all tasks. Figure 4 shows the performance profile curve that highlights PW’s reliability, demonstrating how frequently it achieves near-best accuracy compared to other approaches for BigBench Instruction Induction dataset (BBII).
Efficiency: Beyond accuracy, PW demonstrates its computational efficiency. Unlike many baseline methods that require extensive API calls and computational resources, PW achieves superior results with minimal overhead by striking an effective balance between exploration and exploitation. Table 1 demonstrates PW’s cost-effectiveness, with significantly reduced token usage for input and output while optimizing prompts effectively.

Figure 4. Performance Profile curve on BBII dataset

Methods	API calls	Total tokens
Instinct	1730	115k
PromptBreeder	18600	1488k
EvoPrompt	5000	400k
PW	69	24k

Table 1. Cost analysis on BBII dataset

We also have conducted numerous experiments to highlight PromptWizard’s efficacy with limited training data and smaller LLMs.

Resilience with limited data

Real-world scenarios often lack abundant training data. PW excels in such conditions, requiring as few as five examples to produce effective prompts. Across five diverse datasets, PW demonstrated an average accuracy drop of only 5% when using five examples compared to 25 examples—highlighting its adaptability and efficiency (see Table 2).

Datasets	5 Examples	25 Examples
MMLU	80.4	89.5
GSM8k	94	95.4
Ethos	86.4	89.4
PubMedQA	68	78.2
MedQA	80.4	82.9
Average	81.9	87

Table 2. PW’s performance with varying number of examples

Leveraging smaller models for optimization

PromptWizard also reduces computational costs by using smaller LLMs for prompt generation, reserving more powerful models for inference. For example, using Llama-70B for prompt generation resulted in negligible performance differences compared to GPT-4, while significantly lowering resource usage (see Table 3).

Dataset	Prompt Gen: Llama-70B	Prompt Gen: GPT4
GSM8k	94.6	95.4
Ethos	89.2	89.4
Average	91.9	92.4

Table 3. Performance with smaller LLMs for prompt generation

PromptWizard shows that effective prompts combine optimized instructions refined through iterative feedback, thoughtfully chosen in-context examples, and a modular design that incorporates expert knowledge and task-specific intent. This approach enables the framework to handle a broad range of tasks, from simple to highly complex, with exceptional efficiency and flexibility.

Whether you are a researcher addressing cutting-edge challenges or an organization looking to streamline workflows, PromptWizard provides a practical, scalable, and impactful solution for enhancing model performance.

The post PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts appeared first on Microsoft Research.

Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users

December 16, 2024

by Brenda Potts Microsoft AI

GraphRAG blog hero - cluster of small circular nodes on a blue/green gradient background

Introducing GraphRAG 1.0

Microsoft debuted (opens in new tab) the pre-release version of GraphRAG (opens in new tab) in July 2024 to advance AI use in complex domains. Since that time, we’ve seen incredible adoption and community engagement (over 20k stars and 2k forks on GitHub as of this writing), with numerous fixes and improvements by the core team and community contributors. We’re deeply grateful for the contributions and feedback we’ve received and are excited to share a number of major ergonomic and structural improvements that culminate in the official release of GraphRAG 1.0.

Ergonomic refactors

Easier setup for new projects

When we first launched GraphRAG, most config was done using environment variables, which could be daunting, given the many options available. We’ve reduced the friction on setup by adding an init command (opens in new tab) that generates a simplified starter settings.yml file with all core required config already set. We recommend developers start here to ensure they get the clearest initial config. With this update, a minimal starting config does not require the user to have expertise with GraphRAG for a quick setup, only an OpenAI API key in their environment.

New and expanded command line interface

We expanded the functionality and ease of use of the command line interface (opens in new tab) (CLI) and adopted Typer (opens in new tab) to provide better inline documentation and a richer CLI experience. The original CLI was intended as a starter demo for users to try GraphRAG on a sample dataset. We’ve since learned from the community that most people actually want to use this as their primary interaction mode for GraphRAG, so as part of this milestone release, we’ve incorporated enhancements that result in a more streamlined experience. From this work, CLI startup times dropped from an average of 148 seconds to 2 seconds.

Consolidated API layer

In August 2024 we introduced a standalone API layer to simplify developer usage. The original CLI contained all the code required to instantiate and execute basic indexing and query commands, which users often needed to replicate. The API layer is still considered provisional as we gather feedback, but is intended to be the primary entry point for developers who wish to integrate GraphRAG functionality into their own applications without deep pipeline or query class customization. In fact, the CLI and Accelerator (opens in new tab) are built entirely on top of the API layer, acting as a documented example of how to interact with the API. We have also added examples of how to use this API to our notebook collection (opens in new tab) that we will continue to update as we iterate in future releases.

Simplified data model

GraphRAG creates several output artifacts to store the indexed knowledge model. The initial model contained a large number of files, fields, and cross-references based on experimental ideas during the early research, which can be overwhelming for both new and routine users. We performed a comprehensive review of the data model and incorporated fixes to add clarity and consistency, remove redundant or unused fields, improve storage space, and simplify the data models. Previously, the output lacked standardization, and relevant outputs could easily be confused with non-critical intermediary output files. Now with GraphRAG 1.0, the output will only include relevant outputs that are easily readable and traceable.

Streamlined vector stores

Embeddings and their vector stores are some of the primary drivers of GraphRAG’s storage needs. Our original data model stored all embeddings within the parquet output files after data ingestion and indexing. This made the files portable, which was convenient for early research, but for many users it became unnecessary as they configured their own vector stores and the scale of data ingestion grew. We have updated the GraphRAG pipeline to create a default vector store during indexing, so no post-processing is needed, and the query library shares this configuration for seamless use. The benefit of this change is that those vectors (which can be quite large) no longer need to be loaded when the output files are read from disk, saving read time and memory during every query. Coupled with the simplified data model, this resulted in output parquet disk savings of 80%, and total disk space (including embeddings in the vector store) reduction of 43%. GraphRAG supports LanceDB and Azure AI Search out-of-the-box for vector stores. For simple startup, LanceDB is used as the default, and is written to a local database alongside the knowledge model artifacts.

Flatter, clearer code structure

A key initiative on the road to version 1.0 has been to simplify the codebase so it is easier to maintain and more approachable for third-party users. We’ve removed much of the code depth from the organization to make it easier to browse, and co-located more code that our own usage patterns indicate was not required to be in separate functional areas.

We have also found that very few users need the declarative configuration that the underlying DataShaper (opens in new tab) engine provides, so we collapsed these 88 verbose workflow definitions into a smaller set of 11 workflows that operate in a functional versus composed manner. This makes the pipeline easier to understand and is a step toward an architecture that is better suited for our future research plans and improves performance across the board. By collapsing workflows, we now have fewer unused output artifacts, reduced data duplication, and fewer disk I/O operations. This streamlining has also reduced the in-memory footprint of the pipeline, enabling users to index and analyze larger datasets with GraphRAG.

Incremental ingest

Until now, an evolving dataset needed complete re-indexing every time new information was acquired in order to re-generate the knowledge model. In GraphRAG 1.0 we are including a new update command in the CLI that computes the deltas between an existing index and newly added content and intelligently merges the updates to minimize re-indexing. GraphRAG uses an LLM caching mechanism to save as much cost as possible when re-indexing, so re-runs over a dataset are often significantly faster and cheaper than an initial run. Adding brand new content can alter the community structure such that much of an index needs to be re-computed – the update command (opens in new tab) resolves this while also improving answer quality.

Availability

GraphRAG version 1.0 is now available on GitHub (opens in new tab), and published to PyPI (opens in new tab). Check out the Getting Started (opens in new tab) guide to use GraphRAG 1.0 today. today.

Migrating

We recommend users migrate to GraphRAG 1.0, which offers a streamlined experience including multiple improvements for both users and developers. However, because of the breadth of its updates, version 1.0 is not backwards compatible. If you’ve used GraphRAG prior to version 1.0 and have existing indexes, there are a handful of breaking changes that need to be addressed, but this should be a straightforward process. To support the community in this migration, we’ve created a migration guide (opens in new tab) in the repository with more information.

Future directions

We recently posted about a brand-new approach to GraphRAG called LazyGraphRAG, which performs minimal up-front indexing to avoid LLM usage until user queries are executed. This avoids LLM-based summarization of large volumes of content that may not be interesting to users – and therefore never explored even after expensive processing. This approach shows strong performance at a fraction of the cost of GraphRAG, and will be added to the core GraphRAG codebase in the near future as a new option for users.

Additionally, Microsoft has been active in exploring how GraphRAG can advance the rate of scientific progress, and is in the process of building relevant GraphRAG capabilities to align with our broader work in AI-enabled scientific discovery (opens in new tab).

We continue to refine the codebase and investigate architectural changes that will enable users to use their own language model APIs, storage providers, and vector stores. We’re excited about this major milestone, and the foundation that this refactoring lays for our continued research in the GraphRAG space.

The post Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users appeared first on Microsoft Research.

AI meets materials discovery

NEW RESEARCH

Communication Efficient Secure and Private Multi-Party Deep Learning

NEW RESEARCH

JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

NEW RESEARCH

Convergence to Equilibrium of No-regret Dynamics in Congestion Games

NEW RESEARCH

RD-Agent: An open-source solution for smarter R&D

MedFuzz: Exploring the robustness of LLMs on medical challenge problems

Microsoft Research | In case you missed it

Subscribe to the Microsoft Research Podcast:

Transcript

A novel diffusion architecture

Outperforming screening

Microsoft research copilot experience

Handling compositional disorder

Experimental lab verification

AI emulator and generator flywheel

Making MatterGen available

Looking ahead

Acknowledgement

Collaborators: Silica in space with Richard Black and Dexter Greene

New AutoGen framework

Developer tools

Migrating to AutoGen v0.4

Looking forward

Acknowledgments

Agent-cloud interface (ACI)

Example shows how to onboard an agent to AIOpsLab

Service

Workload generator

Fault generator

Microsoft Research Newsletter

Observability

Example of ACI on diagnosis task

Next steps

Acknowledgements

Subscribe to the Microsoft Research Podcast:

Transcript

NEW RESEARCH

NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering

NEW RESEARCH

Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases

Microsoft research copilot experience

NEW RESEARCH

The GA4GH Task Execution API: Enabling Easy Multicloud Task Execution

NEW RESEARCH

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

NEW RESEARCH

Towards industrial foundation models: Integrating large language models with industrial data intelligence

Microsoft Research in the news

Subscribe to the Microsoft Research Podcast:

Transcript

The challenge of effective prompting

Introducing PromptWizard

How PromptWizard works

Abstracts: August 15, 2024

Results

Comprehensive evaluation across tasks

Resilience with limited data

Leveraging smaller models for optimization

Introducing GraphRAG 1.0

Ergonomic refactors

Easier setup for new projects

New and expanded command line interface

Consolidated API layer

Simplified data model

Abstracts: August 15, 2024

Streamlined vector stores

Flatter, clearer code structure

Incremental ingest

Availability

Migrating

Future directions

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.