Abstracts: November 5, 2024

Abstracts: November 5, 2024

Outlined illustrations of Chris Hawblitzel and Jay Lorch for the Microsoft Research Podcast, Abstracts series.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Microsoft senior principal researchers Chris Hawblitzel and Jay Lorch join host Amber Tingle to discuss “Verus: A Practical Foundation for Systems Verification,” which received the Distinguished Artifact Award at this year’s Symposium on Operating Systems Principles, or SOSP. In their research, Hawblitzel, Lorch, and their coauthors leverage advances in programming languages and formal verification with two aims. The first aim is to help make software verification more accessible for systems developers so they can demonstrate their code will behave as intended. The second aim is to provide the research community with sound groundwork to tackle the application of formal verification to large, complex systems. 

Transcript 

[MUSIC] 

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers. 

[MUSIC FADES] 

Our guests today are Chris Hawblitzel and Jay Lorch. They are both senior principal researchers at Microsoft and two of the coauthors on a paper called “Verus: A Practical Foundation for Systems Verification.” This work received the Distinguished Artifact Award at the 30th Symposium on Operating Systems Principles, also known as SOSP, which is happening right now in Austin, Texas. Chris and Jay, thank you for joining us today for Abstracts and congratulations!

JAY LORCH: Thank you for having us. 

CHRIS HAWBLITZEL: Glad to be here. 

TINGLE: Chris, let’s start with an overview. What problem does this research address, and why is Verus something that the broader research community should know about? 


HAWBLITZEL: So what we’re trying to address is a very simple problem where we’re trying to help developers write software that doesn’t have bugs in it. And we’re trying to provide a tool with Verus that will help developers show that their code actually behaves the way it’s supposed to; it obeys some sort of specification for what the program is supposed to do. 

TINGLE: How does this publication build on or differ from other research in this field, including your previous Verus-related work? 

HAWBLITZEL: So formal verification is a process where you write down what it is that you want your program to do in mathematical terms. So if you’re writing an algorithm to sort a list, for example, you might say that the output of this algorithm should be a new list that is a rearrangement of the elements of the old list, but now this rearrangement should be in sorted order. So you can write that down using standard mathematics. And now given that mathematical specification, the challenge is to prove that your piece of software written in a particular language, like Java or C# or Rust, actually generates an output that meets that mathematical specification. So this idea of using verification to prove that your software obeys some sort of specification, this has been around for a long time, so, you know, even Alan Turing talked about ways of doing this many, many decades ago. The challenge has always been that it’s really hard to develop these proofs for any large piece of software. It simply takes a long time for a human being to write down a proof of correctness of their software. And so what we’re trying to do is to build on earlier work in verification and recent developments in programming languages to try to make this as easy as possible and to try to make it as accessible to ordinary software developers as possible. So we’ve been using existing tools. There are automated theorem provers—one of them from Microsoft Research called Z3—where you give it a mathematical formula and ask it to prove that the formula is valid. We’re building on that. And we’re also taking a lot of inspiration from tools developed at Microsoft Research and elsewhere, like Dafny and F* and so on, that we’ve used in the past for our previous verification projects. And we’re trying to take ideas from those and make them accessible to developers who are using common programming languages. In this case, the Rust programming language is what we’re focusing on. 

TINGLE: Jay, could you describe your methodology for us and maybe share a bit about how you and your coauthors tested the robustness of Verus.

LORCH: So the question we really want to answer is, is Verus suitable for systems programming? So that means a variety of things. Is it amenable to a variety of kinds of software that you want to build as part of a system? Is it usable by developers? Can they produce compact proofs? And can they get timely feedback about those proofs? Can the verifier tell you quickly that your proof is correct or, if it’s wrong, that it’s wrong and guide you to fix it? So the main two methodological techniques we used were millibenchmarks and full systems. So the millibenchmarks are small pieces of programs that have been verified by other tools in the past, and we built them in Verus and compared to what other tools would do to find whether we could improve usability. And we found generally that we could verify the same things but with more compact proofs and proofs that would give much snappier feedback. The difference between one second and 10 seconds might not seem a lot, but when you’re writing code and working with the verifier, it’s much nicer to get immediate feedback about what is wrong with your proof so you can say, oh, what about this? And it can say, oh, well, I still see a problem there. And you could say, OK, let me fix that. As opposed to waiting 10, 20 seconds between each such query to the verifier. So the millibenchmarks helped us evaluate that. And the macrobenchmarks, the building entire systems, we built a couple of distributed systems that had been verified before—a key value store and a node replication system—to show that you could do them more effectively and with less verification time. We also built some new systems, a verified OS page table, a memory allocator, and a persistent memory append-only log. 

TINGLE: Chris, the paper mentions that successfully verifying system software has required—you actually use the word heroic to describe the developer effort. Thinking of those heroes in the developer community and perhaps others, what real-world impact do you expect Verus to have? What kind of gains are we talking about here? 

HAWBLITZEL: Yeah, so I think, you know, traditionally verification or this formal software verification that we’re doing has been considered a little bit of a pie-in-the-sky research agenda. Something that people have applied to small research problems but has not necessarily had a real-world impact before. And so I think it’s just, you know, recently, in the last 10 or 15 years, that we started to see a change in this and started to see verified software actually deployed in practice. So on one of our previous projects, we worked on verifying the cryptographic primitives that people use when, say, they browse the web or something and their data is encrypted. So in these cryptographic primitives, there’s a very clear specification for exactly what bytes you’re supposed to produce when you encrypt some data. And the challenge is just writing software that actually performs those operations and does so efficiently. So in one of our previous projects that we worked on called HACL* and EverCrypt, we verified some of the most commonly used and efficient cryptographic primitives for things like encryption and hashing and so on. And these are things that are actually used on a day-to-day basis. So we, kind of, took from that experience that the tools that we’re building are getting ready for prime time here. We can actually verify software that is security critical, reliability critical, and is in use. So some of the things that Jay just mentioned, like verifying, you know, persistent memory storage systems and so on, those are the things that we’re looking at next for software that would really benefit from reliability and where we can formally prove that your data that’s written to disk is read correctly back from disk and not lost during a crash, for example. So that’s the kind of software that we’re looking to verify to try to have a real-world impact. 

LORCH: The way I see the real-world impact, is it going to enable Microsoft to deal with a couple of challenges that are severe and increasing in scale? So the first challenge is attackers, and the second challenge is the vast scale at which we operate. There’s a lot of hackers out there with a lot of resources that are trying to get through our defenses, and every bug that we have offers them purchase, and techniques like this, that can get rid of bugs, allow us to deal with that increasing attacker capability. The other challenge we have is scale. We have billions of customers. We have vast amounts of data and compute power. And when you have a bug that you’ve thoroughly tested but then you run it on millions of computers over decades, those rare bugs eventually crop up. So they become a problem, and traditional testing has a lot of difficulty finding those. And this technology, which enables us to reason about the infinite possibilities in a finite amount of time and observe all possible ways that the system can go wrong and make sure that it can deal with them, that enables us to deal with the vast scale that Microsoft operates on today.

HAWBLITZEL: Yeah, and I think this is an important point that differentiates us from testing. Traditionally, you find a bug when you see that bug happen in running software. With formal verification, we’re catching the bugs before you run the software at all. We’re trying to prove that on all possible inputs, on all possible executions of the software, these bugs will not happen, and it’s much cheaper to fix bugs before you’ve deployed the software that has bugs, before attackers have tried to exploit those bugs. 

TINGLE: So, Jay, ideally, what would you like our listeners and your fellow SOSP conference attendees to tell their colleagues about Verus? What’s the key takeaway here? 

LORCH: I think the key takeaway is that it is possible now to build software without bugs, to build systems code that is going to obey its specification on all possible inputs always. We have that technology. And this is possible now because a lot of technology has advanced to the point where we can use it. So for one thing, there’s advances in programming languages. People are moving from C to Rust. They’ve discovered that you can get the high performance that you want for systems code without having to sacrifice the ability to reason about ownership and lifetimes, concurrency. The other thing that we build on is advances in computer-aided theorem proving. So we can really make compact and quick-to-verify mathematical descriptions of all possible behaviors of a program and get fast answers that allow us to rapidly turn around proof challenges from developers. 

TINGLE: Well, finally, Chris, what are some of the open questions or future opportunities for formal software verification research, and what might you and your collaborators tackle next? I heard a few of the things earlier. 

HAWBLITZEL: Yes, I think despite, you know, the effort that we and many other researchers have put into trying to make these tools more accessible, trying to make them easier to use, there still is a lot of work to prove a piece of software correct, even with advanced state-of-the-art tools. And so we’re still going to keep trying to push to make that easier. Trying to figure out how to automate the process better. There’s a lot of interest right now in artificial intelligence for trying to help with this, especially if you think about artificial intelligence actually writing software. You ask it to write a piece of software to do a particular task, and it generates some C code or some Rust code or some Java code, and then you hope that that’s correct because it could have generated any sort of code that performs the right thing or does total nonsense. So it would be really great going forward if when we ask AI to develop software, we also expect it to create a proof that the software is correct and does what the user asked for. We’ve started working on some projects, and we found that the AI is not quite there yet for realistic code. It can do small examples this way. But I think this is still a very large challenge going forward that could have a large payoff in the future if we can get AI to develop software and prove that the software is correct. 

LORCH: Yeah, I see there’s a lot of synergy between—potential synergy—between AI and verification. Artificial intelligence can solve one of the key challenges of verification, namely making it easy for developers to write that code. And verification can solve one of the key challenges of AI, which is hallucinations, synthesizing code that is not correct, and Verus can verify that that code actually is correct. 

TINGLE: Well, Chris Hawblitzel and Jay Lorch, thank you so much for joining us today on the Microsoft Research Podcast to discuss your work on Verus. 

[MUSIC] 

HAWBLITZEL: Thanks for having us. 

LORCH: Thank you. 

TINGLE: And to our listeners, we appreciate you, too. If you’d like to learn more about Verus, you’ll find a link to the paper at aka.ms/abstracts or you can read it on the SOSP website. Thanks for tuning in. I’m Amber Tingle, and we hope you’ll join us again for Abstracts.

[MUSIC FADES] 

The post Abstracts: November 5, 2024 appeared first on Microsoft Research.

Read More

Abstracts: November 4, 2024

Abstracts: November 4, 2024

Outlined illustrations of Shan Lu and Bogdan Stoica for the Microsoft Research Podcast.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Principal Research Manager Shan Lu and Bogdan Stoica, a PhD candidate at the University of Chicago, join host Gretchen Huizinga to discuss “If At First You Don’t Succeed, Try, Try, Again … ? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems.” In the paper, which was accepted at this year’s Symposium on Operating Systems Principles, or SOSP, Lu, Stoica, and their coauthors examine typical retry issues and present techniques that leverage traditional program analysis and large language models to help detect them.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Dr. Shan Lu, a senior principal research manager at Microsoft Research, and Bogdan Stoica, also known as Bo, a doctoral candidate in computer science at the University of Chicago. Shan and Bogdan are coauthors of a paper called “If at First You Don’t Succeed, Try, Try, Again …? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems.” And this paper was presented at this year’s Symposium on Operating Systems Principles, or SOSP. Shan and Bo, thanks for joining us on Abstracts today!


SHAN LU: Thank you.

BOGDAN STOICA: Thanks for having us.

HUIZINGA: Shan, let’s kick things off with you. Give us a brief overview of your paper. What problem or issue does it address, and why should we care about it?

LU: Yeah, so basically from the title, we are looking at retry bugs in software systems. So what retry means is that people may not realize for big software like the ones that run in Microsoft, all kinds of unexpected failures—software failure, hardware failure—may happen. So just to make our software system robust, there’s often a retry mechanism built in. So if something unexpected happens, a task, a request, a job will be re-executed. And what this paper talks about is, it’s actually very difficult to implement this retry mechanism correctly. So in this paper, we do a study to understand what are typical retry problems and we offer a solution to detecting these problems.

HUIZINGA: Bo, this clearly isn’t a new problem. What research does your paper build on, and how does your research challenge or add to it?

STOICA: Right, so retry is a well-known mechanism and is widely used. And retry bugs, in particular, have been identified in other papers as root causes for all sorts of failures but never have been studied as a standalone class of bugs. And what I mean by that, nobody looked into, why is it so difficult to implement retry? What are the symptoms that occur when you don’t implement retry correctly? What are the causes of why developers struggle to implement retry correctly? We built on a few key bug-finding ideas that have been looked at by other papers but never in this context. We use fault injection. We repurpose existing unit tests to trigger this type of bugs as opposed to asking developers to write specialized tests to trigger retry bugs. So we’re, kind of, making the developer’s job easier in a sense. And in this pipeline, we also rely on large language models to augment the program and the code analysis that goes behind the fault injection and the reutilization of existing tests.

HUIZINGA: Have large language models not been utilized much in this arena?

LU: I want to say that, you know, actually this work was started about two years ago. And at that time, large language model was really in its infancy and people just started exploring what large language model can help us in terms of improving software reliability. And our group, and together with, you know, actually same set of authors from Microsoft Research, we actually did some of the first things in a workshop paper just to see what kind of things that we were able to do before like, you know, finding bugs can now be replicated by using large language model.

HUIZINGA: OK …

LU: But at that time, we were not very happy because, you know, just use large language model to do something people were able to do using traditional program analysis, I mean, it seems cool, right, but does not add new functionality. So I would say what is new, at least when we started this project, is we were really thinking, hey, are there anything, right, are there some program analysis, are there some bug finding that we were not able to do using traditional program analysis but actually can be enabled by large language model.

HUIZINGA: Gotcha …

LU: And so that was at, you know, what I feel like was novel at least, you know, when we worked on this. But of course, you know, large language model is a field that is moving so fast. People are, you know, finding new ways to using it every day. So yeah.

HUIZINGA: Right. Well, in your paper, you say that retry functionality is commonly undertested and thus prone to problems slipping into production. Why would it be undertested if it’s such a problem?

STOICA: So testing retry is difficult because what you need is to simulate the systemwide conditions that lead to retry. That often means simulating external transient errors that might happen on the system that runs your application. And to do this during testing and capture this in a small unit test is difficult.

LU: I think, actually, Bogdan said this very well. It’s like, why do we need a retry? It’s, like, when unexpected failure happen, right. And this is, like, something like Bogdan mentioned, like external transient error such as my network card suddenly does not work, right. And this may occur, you know, only for, say, one second and then it goes back on. But this one second may cause some job to fail and need retry. So during normal testing, these kind of unexpected things rarely, rarely happen, if at all, and it’s also difficult to simulate. That’s why it’s just not well tested.

HUIZINGA: Well, Shan, let’s talk about methodology. Talk a bit about how you tackled this work and why you chose the approach you did for this particular problem.

LU: Yeah, so I think this work includes two parts. One is a systematic study. We study several big open-source systems to see whether there are retry-related problems in this real system. Of course there are. And then we did a very systematic categorization to understand the common characteristics. And the second part is about, you know, detecting. And in terms of method, we have used, particularly in the detecting part, we actually used a hybrid of techniques of traditional static program analysis. We used this large language model-enabled program analysis. In this case, imagine we just asked a large language model saying, hey, tell us, are there any retry implemented in this code? If there is, where it is, right. And then we also use, as Bogdan mentioned, we repurposed unit test to help us to execute, you know, the part of code that large language model tell us there may be a retry. And addition to that, we also used fault injection, which means we simulate those transient, external, environmental failures such as network failures that very rarely would occur by itself.

HUIZINGA: Well, Bo, I love the part in every paper where the researchers say, “And what we found was …” So tell us, what did you find?

STOICA: Well, we found that implementing retry is difficult and complex! Not only find new bugs because, yes, that was kind of the end goal of the paper but also try to understand why these bugs are happening. As Shan mentioned, we started this project with a bug study. We looked at retry bugs across eight to 10 applications that are widely popular, widely used, and that the community is actively contributing to them. And the experiences of both users and developers, if we can condense that—what do you think about retries?—is that, yeah, they’re frustrated because it’s a simple mechanism, but there’s so many pitfalls that you have to be aware of. So I think that’s the biggest takeaway. Another takeaway is that when I was thinking about bug-finding tools, I was having this somewhat myopic view of, you know, you instrument at the program statement level, you figure out relationships between different lines of code and anti-patterns, and then you build your tools to find those anti-patterns. Well, with retry, this kind of gets thrown out the window because retry is a mechanism. It’s not just one line of code. It is multiple lines of code that span multiple functions, multiple methods, and multiple files. And you need to think about retry holistically to find these issues. And that’s one of the reasons we used large language models, because traditional static analysis or traditional program analysis cannot capture this. And, you know, large language models turns out to be actually great at this task, and we try to harness the, I would say, fuzzy code comprehension capabilities of large language models to help us find retry bugs.

HUIZINGA: Well, Shan, research findings are important, but real-world impact is the ultimate goal here. So who will this research help most and why?

LU: Yeah, that’s a great question. I would consider several groups of people. One is hopefully, you know, people who actually build, design real systems will find our study interesting. I hope it will resonate with them about those difficulties in implementing retry because we studied a set of systems and there was a little bit of comparison about how different retry mechanisms are actually used in different systems. And you can actually see that, you know, this different mechanism, you know, they have pros and cons, and we have a little bit of, you know, suggestion about what might be good practice. That’s the first group. The second group is, our tool actually did find, I would say, a relatively large number of retry problems in the latest version of every system we tried, and we find these problems, right, by repurposing existing unit tests. So I hope our tool will be used, you know, in the field by, you know, being maybe integrated with future unit testing so that our future system will become more robust. And I guess the third type of, you know, audience I feel like may benefit by reading our work, knowing our work: the people who are thinking about how to use large language model. And as I mentioned, I think a takeaway is large language model can repeat, can replace some of things we were able to do using traditional program analysis and it can do more, right, for those fuzzy code comprehension–related things. Because for traditional program analysis, we need to precisely describe what I want. Like, oh, I need a loop. I need a WRITE statement, right. For large language model, it’s imprecise by nature, and that imprecision sometimes actually match with the type of things we’re looking for.

HUIZINGA: Interesting. Well, both of you have just, sort of, addressed nuggets of this research. And so the question that I normally ask now is, if there’s one thing you want our listeners to take away from the work, what would it be? So let’s give it a try and say, OK, in a sentence or less, if I’m reading this paper and it matters to me, what’s my big takeaway? What is my big “aha” that this research helps me with?

STOICA: So the biggest takeaway of this paper is not to be afraid to integrate large language models in your bug-finding or testing pipelines. And I’m saying this knowing full well how imprecise large language models can be. But as long as you can trust but verify, as long as you have a way of checking what these models are outputting, you can effectively insert them into your testing framework. And I think this paper is showing one use case and bring us closer to, you know, having it integrated more ubiquitously.

HUIZINGA: Well, Shan, let’s finish up with ongoing research challenges and open questions in this field. I think you’ve both alluded to the difficulties that you face. Tell us what’s up next on your research agenda in this field.

LU: Yeah, so for me, personally, I mean, I learned a lot from this project and particularly this idea of leveraging large language model but also as a way to validate its result. I’m actually working on how to leverage large language model to verify the correctness of code, code that may be generated by large language model itself. So it’s not exactly, you know, a follow-up of this work, but I would say at idea, you know, philosophical level, it is something that is along this line of, you know, leverage large language model, leverage its creativity, leverage its … sometimes, you know … leverage its imprecision but has a way, you know, to control it, to verify it. That’s what I’m working on now.

HUIZINGA: Yeah … Bo, you’re finishing up your doctorate. What’s next on your agenda?

STOICA: So we’re thinking of, as Shan mentioned, exploring what large language models can do in this bug-finding/testing arena further and harvesting their imprecision. I think there are a lot of great problems that traditional code analysis has tried to tackle, but it was difficult. So in that regard, we’re looking at performance issues and how large language models can help identify and diagnose those issues because my PhD was mostly focused, up until this point, on correctness. And I think performance inefficiencies are such a wider field and with a lot of exciting problems. And they do have this inherent imprecision and fuzziness to them that also large language models have, so I hope that combining the two imprecisions maybe gives us something a little bit more precise.

HUIZINGA: Well, this is important research and very, very interesting.

[MUSIC]

Shan Lu, Bogdan Stoica, thanks for joining us today. And to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts. And you can also find it on the SOSP website. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: November 4, 2024 appeared first on Microsoft Research.

Read More

AI-powered microgrids facilitate energy resilience and equity in regional communities

AI-powered microgrids facilitate energy resilience and equity in regional communities

Three icons that represent (left to right) ecology and environment, economics, and technology for emerging markets.

The rise of affordable small-scale renewable energy, like rooftop solar panels, is reshaping energy systems around the world. This shift away from fossil fuel-powered grids creates new opportunities for energy distribution that prioritize decentralized energy ownership and community empowerment. Despite this progress, centralized energy systems still dominate, often failing to provide vulnerable communities with reliable, affordable renewable energy. In response, Microsoft researchers are collaborating with local communities to explore how AI can enable community-scale energy solutions focused on energy availability and equity as well as decarbonization.

AI-powered microgrids support resilient communities

Microgrids, small and localized energy systems, hold promise as a solution to the challenges of centralized energy systems. These microgrids can operate independently from the larger grid, providing participants with resilience and control. Figure 1 shows how these systems integrate renewable energy sources and storage to efficiently manage local energy needs.

Figure 1: The image shows a microgrid system with interconnected assets, including rooftop solar panels, battery storage locations, electric vehicle chargers, wind turbines, and large solar farms, all supporting a small community and tied to the central power grid.
Figure 1. An example of the decentralized nature of a microgrid power system

AI improves energy reliability by integrating data about energy consumption, market prices, and weather forecasts, necessary when using wind and solar power, which rely on weather conditions. Advanced forecasting predicts renewable energy availability, while AI-driven analytics determine when to generate, store, or sell electricity. This increases efficiency and stabilizes the grid by balancing supply and demand.

When powered by AI, microgrids can also contribute to energy equity. In many rural parts of the US, flat-rate billing models are still common, often leading to unfair pricing. AI-enabled microgrids provide an alternative by allowing communities to pay only for the energy they use. By analyzing consumption patterns, AI can ensure optimized distribution that promotes equitable pricing and access. These systems also improve resilience during crises, enabling communities to manage energy distribution more effectively and reduce reliance on centralized utilities. AI allows microgrids to predict energy demands, identify system vulnerabilities, and recover quickly during outages.

Evaluating AI’s impact on microgrid efficiency and equity

To explore AI’s potential in improving efficiency and equity in energy management, a team of Microsoft researchers collaborated with community organizations on simulations and a case study. They built a tabletop simulator to test whether AI could effectively determine when to generate, store, or sell electricity based on real-time data. The AI model was optimized for resilience and efficiency, using reinforcement learning to control grid and battery processes, enabling microgrids adapt to changing energy conditions and market dynamics.

This simulation used a theoretical model with external data to show how an AI-driven microgrid could autonomously buy and sell energy based on strategic design parameters. By controlling when the battery is charged and discharged based on energy production and consumption patterns, the model maximized efficiency and maintained local power availability. Figure 2 shows the AI-controlled grid’s optimal decisions using open-source data from the California Independent System Operator (CAISO), serving as a proof of concept (PoC) for AI-driven microgrids operating under real-world conditions.

Figure 2 (A): Graph depicting peak and off-peak net power bought or sold over one week using simulations of the AI controller on historical CAISO data. The graph shows a direct correlation that when solar is available then more power is bought than sold, whereas, during nighttime the controller relies on stored energy in battery to power consumption, making fewer transactions  

Figure 2 (B) The graph shows battery levels on a simulated AI controller for the historical CAISO data. During peak hours, the battery discharges as reserves are sold, while solar power supplies the load. At night, the battery conserves power, minimizing purchases and optimizing reserves for daytime selling.
Figure 2. (A) Peak and off-peak net power bought or sold over one week using simulations of the AI controller on historical CAISO data. (B) Peak and off-peak battery levels over one week using simulations of the AI controller on historical CAISO data. 

Case study: AI-powered microgrid for community energy transition

Microsoft researchers, in partnership with community-based organizations Remix: The Soul of Innovation (opens in new tab), Maverick IQ (opens in new tab) and Ayika Solutions (opens in new tab), are designing and implementing an AI-powered microgrid system in West Atlanta. Working closely with the Vicars Community Center (VCC) resilience hub (opens in new tab), they aim to address challenges faced by the community due to rapid development. West Atlanta, like many Atlanta neighborhoods, faces rising housing prices and energy costs that disproportionately affect long-time residents. Communities relying on centralized grids are more vulnerable to outages, with slow recovery times, highlighting systemic inequalities in energy distribution.

The VCC resilience hub is tackling these issues by helping to establish a solar microgrid for the West Atlanta Watershed Alliance (opens in new tab) (WAWA) community farm and surrounding neighborhoods. Microsoft researchers and collaborators are integrating AI into the microgrid to achieve energy savings, improve resilience, and create local job opportunities. Figure 3 shows the VCC resilience hub and WAWA community farm powered by the microgrid, highlighting key infrastructure for installing distributed energy resources (DERs).

Figure 3 (A) and 3 (B)  shows pictures of the VCC resilience hub, with solar panels  and batteries for energy storage 

 

Figure 3 (C) and 3 (D) shows pictures of the community farm, and volunteers at WAWA, a key center to support the future of community agriculture to be supported by the microgrid
Figure 3. A and B show the VCC resilience hub, with solar panels (left) and batteries for energy storage (right) – photographs by Erica Holloman-Hill. C and D show the WAWA community farm and community members holding freshly harvested crops. 

Project phases

Co-innovation design

Microsoft researchers, architects, and community partners held a participatory design session with state and utility representatives to define the project’s mission and key metrics. The CDC’s Social Vulnerability Index informed the site selection, supporting the project’s diversity, equity, and inclusion goals. 

Renewables and microgrid siting

A renewable siting survey conducted by community partners identified the VCC as a key resilience hub for solar panel and battery installation.

To deliver these benefits, the site first needed upgrades. Older homes required energy-efficiency improvements, such as electrical upgrades and better insulation, before they could be integrated into the microgrid. As a PoC, the team collaborated with community partners to modernize an older home with inefficient energy consumption. Sensors were installed to track energy usage and environmental conditions (Figure 4).

Figure 4: A graph showing estimated cost of electricity per day based on a legacy household in West Atlanta through kilowatt-hour usage between July 29, 2024 and August 13, 2023. Data validates the family’s experience about high energy bills, inefficient heating and cooling, and high humidity in the basement.
Figure 4. Estimated daily electricity costs based on a home’s kilowatt-hour usage between July 29 and August 13, 2023. The data confirms the residents’ experience of high energy bills, inefficient heating and cooling, and high humidity in the basement. Used by permission from Erica Holloman-Hill.

Students from Morehouse College (opens in new tab) used this data to create a digital twin of the home, which provided actionable insights (Figure 5). The analysis confirmed issues like high radon levels and energy drains from outdated appliances. Guided by these findings, the team upgraded the house into a “smart home” where AI monitors energy and environmental conditions, enabling it to join the microgrid and making it eligible for LEED certification (opens in new tab).

Figure 5: 2 Figures showing snapshots of digital twin created for Dr. Erica Holloman-Hill’s home, provided by courtesy of Dr. Erica L Holloman-Hill, owner of Ayika Solutions Inc. The first figure shows the sensor readings of pollutants and weather in various parts of the home. The second figure shows the measurements in detail for the  basement. The detailed environmental data—including climatic conditions, appliance-level energy usage, and pollutant levels—provide actionable insights for identifying targeted areas for grid modernization.
Figure 5. Smart electrification: Snapshots of digital twin created for the PoC home. Panel A shows the digital twin for the entire home. Panel B shows detailed views for the first floor and basement, respectively. The detailed environmental data—including climatic conditions, appliance-level energy usage, and pollutant levels—provide actionable insights for identifying targeted areas for grid modernization. Used by permission from Erica Holloman-Hill.

Microgrid simulation phase

To prepare the AI-powered microgrid, Microsoft researchers built a simplified tabletop prototype simulating the setup using real data from the design and siting phases. This prototype demonstrated the control mechanism’s ability to manage DERs—solar panels, batteries, and appliances—and the interface between the microgrid and the larger grid. Figure 6 shows the tabletop model during prototyping.

Figure 7 illustrates the results of this simulation, showing power bought and sold and the battery charge-discharge profile. The AI controller made optimal buying and selling decisions, promoting efficiency and reliability.

Figure 6 (A): Graph depicting peak and off-peak net power bought or sold over one week using simulations of the AI controller on data generated during runs of tabletop microgrid model. The graph shows a direct correlation that when solar is available then more power is bought than sold, whereas, during night time the controller relies on stored energy in battery to power consumption, making fewer transactions. 

Figure 6 (B) The graph shows battery levels on a simulated microgrid controller powered by AI. During peak hours, the battery discharges as reserves are sold, while solar power supplies the load. At night, the battery conserves power, minimizing purchases and optimizing reserves for daytime selling.
Figure 7. (A) Peak and off-peak net power bought or sold over one week using AI-controller simulations. (B) Corresponding battery levels.

Erica Holloman-Hill, director of WAWA, CEO of Ayika Solutions and owner of the PoC home, reflected: “This study helped me understand how our home’s outdated condition affects our quality of life. Upgrading homes like mine could make a significant difference. Thanks to partnerships like this one, controlling and sharing the electricity the community generates is within reach, highlighting the potential of AI-supported technologies like microgrids for communities like ours.”

Building on the simulation’s success, the VCC resilience hub and local organizations are continuing to install solar panels to power the microgrid. AI will play a key role in siting and controlling the system as it expands. Efforts are also underway to establish sustainable financing models and assess homes for modernization to enable broader participation in the microgrid.

AI: A path to equity and resilience

The transition to decentralized microgrids offers new opportunities for energy efficiency, with AI playing a critical role in managing these systems. Yet additional efforts are needed for communities to fully realize these benefits. Residents of aging homes are burdened with outdated wiring, inefficient appliances, and poor insulation—factors that drive up energy costs. Their dependence on centralized grids offers little relief, underscoring the need for community-focused energy solutions. 

The West Atlanta project illustrates AI’s potential to create resilient, equitable, community-driven energy systems, paving the way for a more inclusive and sustainable future. Microsoft researchers are continuing to collaborate with local organizations to promote smarter energy management.

For additional details, please review the project report.

Acknowledgements

I would like to thank all the collaborators on these projects: West Atlanta microgrid: Erica L. Holloman-Hill, John Jordan Jr, Markese Bryant. I also want to thank Karin Strauss for reviewing and providing feedback on this blog post; Andalib Samandari, the intern who supported this project; Vaishnavi Ranganathan for helping to brainstorm throughout the project; AI & Society Fellows program for supporting projects in this domain; and Microsoft’s Datacenter Community Affairs team, Jon McKenley and Kelly Lanier Arnold for supporting the project in West Atlanta. 

The post AI-powered microgrids facilitate energy resilience and equity in regional communities appeared first on Microsoft Research.

Read More

Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency

Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency

Three icons that represent local and global search and GraphRAG. These icons sit on a blue to pink gradient.

GraphRAG is a technique that uses large language models (LLMs) to create knowledge graphs and summaries from unstructured text documents and leverages them to improve retrieval-augmented generation (RAG) operations on private datasets. It offers comprehensive global overviews of large, private troves of unstructured text documents while also enabling exploration of detailed, localized information. By using LLMs to create comprehensive knowledge graphs that connect and describe entities and relationships contained in those documents, GraphRAG leverages semantic structuring of the data to generate responses to a wide variety of complex user queries. Uncharted (opens in new tab), one of Microsoft’s research collaborators, has recently been expanding the frontiers of this technology by developing a new approach to processing local queries: DRIFT search (Dynamic Reasoning and Inference with Flexible Traversal). This approach builds upon Microsoft’s GraphRAG technique, combining characteristics of both global and local search to generate detailed responses in a method that balances computational costs with quality outcomes.

How GraphRAG works

GraphRAG has two primary components, an indexing engine and a query engine.

The indexing engine breaks down documents into smaller chunks, converting them into a knowledge graph with entities and relationships. It then identifies communities within the graph and generates summaries—or “community reports”—that represent the global data structure. 

The query engine utilizes LLMs to build graph indexes over unstructured text and query them in two primary modes: 

  • Global search handles queries that span the entire dataset. This mode synthesizes information from diverse underlying sources to answer questions that require a broad understanding of the whole corpus. For example, in a dataset about tech company research efforts, a global query could be: “What trends in AI research have emerged over the past five years across multiple organizations?” While effective for connecting scattered information, global search can be resource intensive. 
  • Local search optimizes for targeted queries, drawing from a smaller subset of documents that closely match the user’s input. This mode works best when the answer lies within a small number of text units. E.g. a query asking: “What new features and integrations did Microsoft’s Cosmos DB team release on October 4th?”

The creation of these summaries often involves a human in the loop (HITL), as user input shapes how information is summarized (e.g., what kinds of entities and relationships are extracted). To index documents using GraphRAG, a clear description of the intended user persona (as defined in the indexing phase) is needed, as it influences how nodes, edges, and community reports are structured.

Introducing DRIFT Search

DRIFT Search introduces a new approach to local search queries by including community information in the search process. This greatly expands the breadth of the query’s starting point and leads to retrieval and usage of a far higher variety of facts in the final answer. This addition expands the GraphRAG query engine by providing a more comprehensive option for local search, which uses community insights to refine a query into detailed follow-up questions. These follow-ups allow DRIFT to handle queries that may not fully align with the original extraction templates defined by the user at index time.

Answer details Drift (DS_Default) Local (LS)
Supply Chain Traced back to cinnamon in Ecuador and Sri Lanka
[Redacted Brand] and [Redacted Brand] Brands Impacted
Products sold at [Redacted Brand] and [Redacted Brand]
Plants in Ecuador
Contamination Levels 2000 times higher than FDA max Blood lead levels ranging from 4 to 29 micrograms per deciliter
Actions Recalls and health advisories
Investigating plant in Ecuador
Issued warnings to retailers
Recalls and health advisories
Table 1: An example of summarized responses from two search techniques (DRIFT and Local Search) on a dataset of AP News articles to the query: “Describe what actions are being taken by the U.S. Food and Drug Administration and the Centers for Disease Control and Prevention to address the lead contamination in apple cinnamon fruit puree and applesauce pouches in the United States during November 2023”. As shown in the table, DRIFT search was able to surface details not immediately available with the two other approaches.

Spotlight: Blog post

MedFuzz: Exploring the robustness of LLMs on medical challenge problems

Medfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.


DRIFT Search: A step-by-step process 

  1. Primer: When a user submits a query, DRIFT compares it to the top K most semantically relevant community reports. This generates an initial answer along with several follow-up questions, which act as a lighter version of global search. To do this, we expand the query using Hypothetical Document Embeddings (HyDE), to increase sensitivity (recall), embed the query, look up the query against all community reports, select the top K and then use the top K to try to answer the query. The aim is to leverage high-level abstractions to guide further exploration.
  2. Follow-Up: With the primer in place, DRIFT executes each follow-up using a local search variant. This yields additional intermediate answers and follow-up questions, creating a loop of refinement that continues until the search engine meets its termination criteria, which is currently configured for two iterations (further research will investigate reward functions to guide terminations). This phase represents a globally informed query refinement. Using global data structures, DRIFT navigates toward specific, relevant information within the knowledge graph even when the initial query diverges from the indexing persona. This follow-up process enables DRIFT to adjust its approach based on emerging information. 
  3. Output Hierarchy: The final output is a hierarchy of questions and answers ranked on their relevance to the original query. This hierarchical structure can be customized to fit specific user needs. During benchmark testing, a naive map-reduce approach aggregated all intermediate answers, with each answer weighted equally. 
An image that shows a hierarchical tree with each node represented as a pie chart of weighting.
Figure 1. An entire DRIFT search hierarchy highlighting the three core phases of the DRIFT search process. A (Primer): DRIFT compares the user’s query with the top K most semantically relevant community reports, generating a broad initial answer and follow-up questions to steer further exploration. B (Follow-Up): DRIFT uses local search to refine queries, producing additional intermediate answers and follow-up questions that enhance specificity, guiding the engine towards context-rich information. A glyph on each node in the diagram shows the confidence the algorithm has to continue the query expansion step.  C (Output Hierarchy): The final output is a hierarchical structure of questions and answers ranked by relevance, reflecting a balanced mix of global insights and local refinements, making the results adaptable and comprehensive.

Why DRIFT search is effective

DRIFT search excels by dynamically combining global insights with local refinement, enabling navigation from high-level summaries down to original text chunks within the knowledge graph. This layered approach ensures that detailed, context-rich information is preserved even when the initial query diverges from the persona used during indexing. By decomposing broad questions into fine-grained follow-ups, DRIFT captures granular details and adjusts based on the emerging context, making it adaptable to diverse query types. This makes it particularly effective when handling queries that require both breadth and depth without losing specific details.

Benchmarking DRIFT search

As shown, we tested the effectiveness of DRIFT search by performing a comparative analysis across a variety of use cases against GraphRAG local search and a highly tuned variant of semantic search methods. The analysis evaluated each method’s performance based on key metrics such as:  

  • Comprehensiveness: Does the response answer all aspects of the question?
  • Diversity of responses: Does the response provide different perspectives and insights on the question?

In our results, DRIFT search provided significantly better results on both comprehensiveness and diversity in the metrics. We set up an experiment where we ingested 5K+ news articles from the Associated Press and ingested those articles using GraphRAG. After ingestion, we generated 50 “local” questions on this dataset and used both DRIFT and Local Search to generate answers for each of these questions. These “local” questions were questions that target specific details in the dataset that could be attributed to a small number of text units containing the answer. These answers were then used with an LLM judge to score for comprehensiveness and diversity.

  • On comprehensiveness, DRIFT search outperformed Local Search 78% of the time.
  • On diversity, DRIFT search outperformed Local Search 81% of the time.

Availability

DRIFT search is available now on the GraphRAG GitHub (opens in new tab).

Future research directions

A future version of DRIFT will incorporate an improved version of Global Search that will allow it to more directly address questions currently serviced best by global search. The hope is to then move towards a single query interface that can service questions of both local and global varieties. This work will further evolve DRIFT’s termination logic, potentially through a reward model that balances novel information with redundancy. Additionally, executing follow-up queries using either global or local search modes could improve efficiency. Some queries require broader data access, which can be achieved by leveraging a query router and a lite-global search variant that uses fewer community reports, tokens, and overall resources.

DRIFT search is the first of several major optimizations to GraphRAG that are being explored.  It shows how a global index can even benefit local queries. In our future work, we plan to explore more approaches to bring greater efficiency to the system by leveraging the knowledge graph that GraphRAG creates.

The post Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency appeared first on Microsoft Research.

Read More

Intern Insights: Vaishnavi Ranganathan with Angela Busheska

Intern Insights: Vaishnavi Ranganathan with Angela Busheska

Outline illustrations of Angela Busheska, an undergraduate engineering student at Lafayette College and Vaishnavi Ranganathan, a Senior Researcher at Microsoft.

Every year, interns from academic institutions around the world apply and grow their knowledge as members of the research community at Microsoft. In this Microsoft Research Podcast series, these students join their internship supervisors to share their experience working alongside some of the leading researchers in their respective fields. 

In this episode, Angela Busheska, an undergraduate engineering student at Lafayette College, talks to Senior Researcher Vaishnavi Ranganathan about her work on TerraTrace, a platform that brings together statistics and large language models to track land use over time for agricultural and forestry applications. Busheska discusses the personal loss that drew her to climate activism, the chain of events that led to a memorable face-to-face meeting with Microsoft’s chief sustainability officer, and her advice for going after the internship you want and making the experience count. 

Angela Busheska standing to the left of the Microsoft sign on the Microsoft campus in Redmond, Washington.
Angela Busheska, pictured on the Microsoft campus in Redmond, Washington, was a part of the Microsoft Research Undergraduate Research Intern Program. During her time in the internship program, she helped develop a platform for tracking land use across time for agricultural and forestry applications. 
Angela Busheska and Melanie Nakagawa standing in front of a fence
During her internship, Busheska met with Microsoft Chief Sustainability Officer Melanie Nakagawa at the Bloomberg Green Festival in Seattle and spoke with the Microsoft executive about her sustainability work. 

[1] (opens in new tab) For more information, see “Regulation on Deforestation-free products” on the European Commission website (opens in new tab).

The post Intern Insights: Vaishnavi Ranganathan with Angela Busheska appeared first on Microsoft Research.

Read More

Research Focus: Week of October 7, 2024

Research Focus: Week of October 7, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus | October 7, 2024

Securely Training Decision Trees Efficiently

In a recent paper: Securely Training Decision Trees Efficiently that will appear at ACM CCS 2024, researchers from Microsoft significantly reduce the communication complexity of secure decision tree training. Decision trees are an important class of supervised learning algorithms. In this approach, a classification or regression tree is built based on a set of features or attributes present in the training dataset. As with many learning algorithms, the accuracy of decision trees can be greatly improved with larger volumes of data. However, this can be a challenge, since data may come from multiple independent sources and require attention to data privacy concerns. In this case, the use of a privacy-enhancing technology, such as secure multi-party computation (MPC), can help protect the underlying training data.  

When the number of elements in the dataset is 𝑁, the number of attributes is 𝑚 and the height of the tree to be built is ℎ, the researchers construct a protocol with communication complexity O(𝑚𝑁 log 𝑁 + ℎ𝑚𝑁 + ℎ𝑁 log 𝑁 ), thereby achieving an improvement of ≈ min(ℎ, 𝑚, log 𝑁 ) over the previous state of the art. The essential feature is an improved protocol to regroup sorted private elements further into additional groups (according to a flag vector) while maintaining their relative ordering. Implementing this protocol in the MP-SPDZ framework shows that it requires 10× lesser communication and is 9× faster than existing approaches.


Multi-label audio classification with a noisy zero-shot teacher

Improving the real-world accuracy of audio content detection (ACD) is an important problem for streaming platforms, operating systems and playback devices. It’s similar to audio tagging, i.e., labeling sounds present in a given audio segment of several seconds length or longer. However, ACD may consist of a small number of higher-level labels or super-classes, e.g. speech, music, traffic, machines, animals, etc., where each label can include a multitude of specific sounds.

In a recent paper: Multi-label audio classification with a noisy zero-shot teacher, researchers from Microsoft propose a novel training scheme using self-label correction and data augmentation methods to deal with noisy labels and improve real-world accuracy on a polyphonic audio content detection task. The augmentation method reduces label noise by mixing multiple audio clips and joining their labels, while being compatible with multiple active labels. The researchers show that performance can be improved by a self-label correction method using the same pretrained model. They also show that it is feasible to use a strong zero-shot model such as CLAP to generate labels for unlabeled data and improve the results using the proposed training and label enhancement methods. The resulting model performs similar to CLAP while providing an efficient mobile device friendly architecture which can be quickly adapted to unlabeled sound classes. 


Tabularis Revilio: Converting Text to Tables

Tables are commonly used to store and present data. These tables are often moved as free-form text when copied from documents and applications without proper tabular support like PDF documents, web pages, or images. Users are dependent on manual effort or programming abilities to parse this free-form text back into structured tables.

In a recent paper: Tabularis Revilio: Converting Text to Tables, researchers from Microsoft present a novel neurosymbolic system for reconstructing tables when their column boundaries have been lost. Revilio addresses this task by detecting headers, generating an initial table sketch using a large language model (LLM), and using that sketch as a guiding representation during an enumerate-and-test strategy that evaluates syntactic and semantic table structures. Revilio was evaluated on a diverse set of datasets, demonstrating significant improvements over existing table parsing methods. Revilio outperforms traditional techniques in both accuracy and scalability, handling large tables with over 100,000 rows. The researchers’ experiments using publicly available datasets show an increase in reconstruction accuracy by 5.8–11.3% over both neural and symbolic baseline state-of-the-art systems. 

on-demand event

Microsoft Research Forum Episode 4

Learn about the latest multimodal AI models, advanced benchmarks for AI evaluation and model self-improvement, and an entirely new kind of computer for AI inference and hard optimization.


Confidential Container Groups: Implementing Confidential Computing on Azure Container Instances

Container-based technologies empower cloud tenants to develop highly portable software and deploy services in the cloud at a rapid pace. Cloud privacy, meanwhile, is important as a large number of container deployments operate on privacy-sensitive data, but challenging due to the increasing frequency and sophistication of attacks. State-of-the-art confidential container-based designs leverage process-based trusted execution environments (TEEs), but face security and compatibility issues that limit their practical deployment.

In a recent article in Communications of the ACM: Confidential Container Groups: Implementing Confidential Computing on Azure Container Instances (opens in new tab), researchers from Microsoft with external colleagues present the Parma architecture, which provides lift-and-shift deployment of unmodified containers while providing strong security protection against a powerful attacker who controls the untrusted host and hypervisor. Parma leverages VM-level isolation to execute a container group within a unique VM-based TEE. Besides container integrity and user data confidentiality and integrity, Parma also offers container attestation and execution integrity based on an attested execution policy. This policy, which is specified by the customer, delimits the actions that the cloud service provider is allowed to take on their behalf when managing the container group. 

The result is that customers receive the security protections of TEEs for their container workloads with minimal costs to perfromance. To learn more, check out Confidential Containers on Azure Container Instances (opens in new tab), which is based on Microsoft’s Parma architecture. 


AI for Business Transformation with Peter Lee and Vijay Mital

Generative AI is changing how businesses operate and how stakeholders talk to each other. The building blocks for large scale AI transformation are now in place, but we are only beginning to imagine how it will unfold. Learn what Microsoft research leaders discovered from some early AI innovation in healthcare, and how businesses can prepare for what’s ahead.

In this new three-part video series, Microsoft Research President Peter Lee and Corporate Vice President Vijay Mital discuss how Microsoft is helping businesses navigate this transformation, along with the critical role of data and how emerging multimodal AI models could turbocharge business innovation.


The post Research Focus: Week of October 7, 2024 appeared first on Microsoft Research.

Read More

Data Formulator: Exploring how AI can help analysts create rich data visualizations 

Data Formulator: Exploring how AI can help analysts create rich data visualizations 

white outline icons (representing AI and human computer interaction) on a blue to purple to pink gradient background.

Transforming raw data into meaningful visuals, such as charts, is key to uncovering hidden trends and valuable insights, but even with advances in AI-powered tools, this process remains complex. Integrating AI into the iterative nature of the data visualization process is particularly challenging, as data analysts often struggle to describe complicated tasks in a single text prompt while lacking the direct control of traditional tools. This highlights the need for smarter, more intuitive solutions that combine AI’s precision with the flexibility of hands-on methods.

To address this, we’re excited to release Data Formulator as an open-source research project. This update builds on last year’s release by combining user interface (UI) interactions for designing charts with natural language input for refining details. Unlike the previous version, which required users to choose between two methods, this unified approach allows them to iteratively solve complex tasks and with less effort.

  • Download

    Data Formulator 

    Transform data and create rich visualizations iteratively with AI.

Figure 1: This figure shows the user interface of Data Formulator. There are four callouts in the figure highlighting key components of the user interface. The first call out describes “1. Concept Encoding Shelf: specify charts with field encodings and NL instructions”. The second callout describes “2. (Local) Data Threads: backtrack and revise inputs”. The third describes “3. Data Threads: navigate data derivation history”. The fourth callout contains “4. Data View: inspect original and derived data”. The user interface contains a visualization in the center that shows renewable percentage.
Figure 1. Data Formulator’s UI

Creating and refining charts with the Concept Encoding Shelf and data threads

With Data Formulator, data analysts can now create charts from scratch or select from existing designs through data threads. The UI features a pane called the “Concept Encoding Shelf,” where users can build their chart by dragging various data fields into it and defining them or by creating new ones. A large language model (LLM) on the backend processes this input, generating the necessary code to produce the visual and updating the data threads for future use. This process is illustrated in Figure 2.

Figure 2: This figure shows the user experience workflow in Data Formulator. On the left it shows Data Threads, and the user clicks a line chart that visualizes the renewable percentage of 20 countries and expands it in the main panel. In the middle it shows “Concept Encoding Shelf”, and the user provides an instruction “Show only top 5 CO2 emission countries”. On the right it shows the result produced from running the user instruction with AI: the result is a table with three columns “Year” “Entity” “Renewable Percentage” and int contains only top 5 CO2 countries’ values; a line chart that only contains these five countries trends is also generated. The line chart is added to data threads.
Figure 2. To create a new chart, users can select a previously created chart from the data threads and then use a combination of UI elements and language to describe their intent.

Data threads enable users to review and modify charts they created previously. This iterative process streamlines the editing and refinement process, as the LLM adapts past code to new contexts. Without this feature, users would need to provide more detailed prompts to recreate designs from scratch. This iterative mechanism also allows users to continue updating their charts until they’re satisfied.

Figure 3: This figure illustrates how Data Formulator’s data threads work. On the left side, it shows two data threads, one is the derivation process of electricity produced from each energy source from each country from 2000 to 2020, the other is the thread showing that the user derives the renewable percentage of each country per year followed by a line chart that shows the rankings of these countries. The figure illustrates that each of the plots is backed by a python data transformation code to derive data appropriate to the user instruction. On the right it shows actions users can take in local data threads: (a) the user can click and rerun a previous instruction, (b) the user can provide a new instruction to follow up, (c) the user can click the previous card and revise instruction and rerun.
Figure 3: Data Formulator’s data threads support complex navigation, quick editing, and the rerunning of previous instructions. 

Data Formulator’s framework

Data Formulator’s architecture separates data transformation from chart configuration, improving both the user experience and AI performance. Upon receiving user specifications, the system follows a three-step process: (1) it generates a Vega-Lite script, which defines how data is visualized; (2) it instructs the AI to handle data transformation; and (3) it creates the chart using the converted data, as illustrated in Figure 4.

Figure 4: This figure shows data formulator architecture. The left side shows user’s chart specification with Year on x-axis, rank on y-axis, Entity on color with instruction “rank by renewable percentage”. In the first step, Data Formulator generates a Vega-Lite line chart template with field names. In step 2, Data Formulator compiles a prompt containing “system prompt”, “Context (data fields + sample data + dialog history)” and “Goal (user instruction + expected fields)”, and AI takes this prompt to generate a python code to transform the data. In step 3, Data Formulator combines the data and the Vega-Lite spec to create a line chart that shows ranking of the countries from 2000 to 2020.
Figure 4: Behind the scenes, Data Formulator compiles a Vega-Lite script from the Concept Encoding Shelf (1), prompts the LLM to generate the necessary code for preparation (2), and, upon creating new data, creates the chart (3).

Implications and looking forward

Refining how users interact with AI-powered tools is essential for improving how they communicate their requirements, paving the way for more efficient and effective collaboration. By integrating UI elements and natural language input, we designed Data Formulator to let users to define their visualization needs with precision, leading to better results and reducing the need for multiple clarifications.

While Data Formulator addresses some challenges in data transformation and visualization authoring, others remain. For example, how can AI assist in cleaning unstructured data without losing critical information? And how can it help users define clear data analysis goals when starting with ambiguous or undefined objectives? We’re actively investigating these research questions and invite you to contribute by building on the Data Formulator codebase (opens in new tab).

Learn more about our research efforts on human-AI interaction by exploring how we design dynamic UI widgets (opens in new tab) for visualization editing. You can also view a demo of the Data Formulator project on GitHub Codespace (opens in new tab).

Acknowledgements

We’d like to thank Bongshin Lee, John Thompson, and Gonzalo Ramos for their feedback and contributions to this project. 

The post Data Formulator: Exploring how AI can help analysts create rich data visualizations  appeared first on Microsoft Research.

Read More

Stress-testing biomedical vision models with RadEdit: A synthetic data approach for robust model deployment

Stress-testing biomedical vision models with RadEdit: A synthetic data approach for robust model deployment

This paper has been accepted at the 18th European Conference on Computer Vision (ECCV 2024) (opens in new tab), the premier gathering on computer vision and machine learning.

 On the left is a simple drawing of the lungs. The drawing shows the borders of the left and right lung as well as the trachea and the left and right main stem bronchi. The text under the drawing reads: Original image. To the right of the drawing are the 3 additional inputs of RadEdit. They are arranged vertically. On top there is an example editing prompt. It reads

Biomedical vision models are computational tools that analyze medical images, like X-rays, MRIs, and CT scans, and are used to predict medical conditions and outcomes. These models assist medical practitioners in disease diagnosis, treatment planning, disease monitoring, and risk assessment. However, datasets used to train these models can be small and not representative of real-world conditions, which often leads to these models performing worse in actual medical settings. To avoid misdiagnoses and other errors, these models must be rigorously tested and adjusted to perform reliably across different conditions.

To mitigate the dataset challenge of not having enough diverse data and to improve the testing of biomedical vision models, we developed “RadEdit: Stress-testing biomedical vision models via diffusion image editing,” presented at ECCV 2024. Aligned with the Microsoft Responsible AI principles of reliability and safety, RadEdit helps researchers identify when and how models might fail before they are deployed in a medical setting. RadEdit uses generative image editing to simulate different dataset shifts (e.g., a shift in the patients’ demographics), helping researchers to identify weaknesses in the model. By employing text-to-image diffusion models trained on a wide array of chest X-ray datasets, RadEdit can generate synthetic yet realistic X-rays.

RadEdit’s approach involves using multiple image masks (binary images representing designated regions of a reference image), as illustrated in Figure 1, to limit changes to specific areas of the image, therefore preserving their integrity. It generates synthetic datasets free from spurious correlations and artifacts, addressing shortcomings in existing editing techniques. Traditional editing techniques often overlook biases within the generative model, leading to synthetic data that perpetuate these biases. Alternatively, these other editing techniques restrict edits to the point of unrealistic outputs.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


How RadEdit works

RadEdit improves biomedical image editing using three key inputs, as illustrated in Figure 1:

  • Text prompt: Defines the desired modifications. For example, a disease can be added with a description like “Consolidation”
  • Edit mask: A binary mask indicating the main area to be modified, such as the “right lung”
  • Keep mask: A binary mask outlining parts of the original image to be preserved, like the “left lung”
 On the left is a simple drawing of the lungs. The drawing shows the borders of the left and right lung as well as the trachea and the left and right main stem bronchi. The text under the drawing reads: Original image. To the right of the drawing are the 3 additional inputs of RadEdit. They are arranged vertically. On top there is an example editing prompt. It reads
Figure 1: RadEdit’s inputs and outputs. By using separate “edit” and “keep” masks, RadEdit can make the desired modifications to an image with precise spatial control and realistic output.

RadEdit depends on a diffusion model for image editing, where the image is first converted to a latent noise representation by inverting the diffusion generative process. The noise representation is then iteratively denoised over multiple time steps. During each step, RadEdit:

  1. Uses the text prompt to conditionally generate pixels within the edit mask with classifier-free guidance.
  2. Generates the remaining pixels based on the original image and edited area.
  3. Replicates the content of the original image within the “keep” mask, ensuring that this area remains unaltered.

Finally, a quality check ensures that the edited image is faithful to the editing prompt. RadEdit uses Microsoft’s BioViL-T to compute an image-text alignment score that we can then use to filter out low-quality and unfaithful edits.

Simulating dataset shifts

A key feature of RadEdit is its ability to simulate dataset shifts with precise spatial control for comprehensive model performance evaluation. This includes differences in image acquisition, the appearance of underlying pathologies, and population characteristics.

Particularly notable is RadEdit’s ability to simulate image variations from different sources (e.g., different hospitals), helping researchers identify potential biases in models trained solely on data from one source. For example, in a COVID-19 study, if all positive cases in a dataset come from a single hospital and all negative cases come from a different hospital, a model trained on detecting COVID-19 might over-rely on hospital-specific indicators from the X-ray images. Among others, we considered the laterality markers in the corners of an X-ray (e.g., a highly visible letter “L” on the left side of the X-ray) as well as the amount of black space on the image edges to be hospital-specific indicators. To test if a model relies too much on differences in image acquisition, we created synthetic data using RadEdit, where we removed COVID-19 features while retaining hospital-specific indicators. After creating the synthetic dataset with the COVID-19 features no longer present, we can test if the COVID-10 detection model still predicts COVID-19. This would indicate that the model is biased with respect to hospital-specific indicators.

RadEdit can also remove specific diseases, like pneumothorax (collapsed lung), from an image while keeping treatment features like chest drains. This helps researchers understand how models detect and understand “visual shortcuts.” Because RadEdit maintains the size and location of the main anatomical structures (like lungs, ribs, and heart), it can also be used to stress-test segmentation models. For example, RadEdit can add rare abnormalities or medical devices to lung images to test how well segmentation models handle new variations, ensuring they generalize accurately across different populations. Figure 2 illustrates these three examples of stress-testing scenarios.

All drawings of lungs are the same as in Figure 1. The drawing shows the borders of the left and right lung as well as the trachea and the left and right main stem bronchi. In the first row on the left there are two drawings of a lung. The first drawing of a lung labelled
Figure 2: Stress-testing models by simulating dataset shifts via image editing.

Stress-testing multimodal models

We have used RadEdit to stress-test image classification and segmentation models, and we see potential for future applications in complex multimodal tasks like generating radiology reports. RadEdit can help identify limitations in multimodal large language models (MLLMs) like Microsoft’s MAIRA-1 and MAIRA-2, especially when dealing with rare conditions or unusual combinations of findings not well-represented in the training data. These MLLMs take one or more radiological images and relevant clinical information as input to produce detailed text reports.

RadEdit can generate synthetic image-report pairs for challenging scenarios. For example, manually editing a report to describe a rare combination of findings and then using RadEdit to edit the corresponding image, creates a valuable test case for the MLLM. This approach allows us to stress-test MLLM with diverse synthetic data, identifying weaknesses or biases and ensuring the model is more robust in real-world scenarios. This is a crucial step for using these models safely and effectively in clinical settings.

Implications and looking forward

RadEdit offers significant advantages for the biomedical research community. It helps identify biases and blind spots before deployment, helping to ensure that biomedical vision models perform reliably in clinical settings. By simulating dataset shifts, RadEdit reduces the need to collect additional evaluation data, saving time and resources.

RadEdit is applicable to a wide range of settings and can be used to stress-test state-of-the-art foundation models like Microsoft’s Rad-DINO (opens in new tab) and BiomedParse (opens in new tab). By integrating RadEdit into their research workflow, researchers can validate that their biomedical vision models are not only state-of-the-art but also more prepared for the complexities of real-world deployment. In the future, we envision RadEdit being applied to more complex multimodal tasks, such as generating radiology reports.

The code for RadEdit as well as the weights of the diffusion model we used can be found under https://huggingface.co/microsoft/radedit (opens in new tab).

Acknowledgments

We would like to thank our paper coauthors: Fernando Pérez-García, Sam Bond-Taylor, Pedro P. Sanchez, Boris van Breugel, Harshita Sharma, Valentina Salvatelli, Maria T. A. Wetscherek, Hannah Richardson, Matthew P. Lungren, Aditya Nori, and Ozan Oktay, as well as all our collaborators across Microsoft Cloud for Healthcare and Microsoft Health Futures.

RadEdit is intended for research purposes only and not for any commercial or clinical use.

The post Stress-testing biomedical vision models with RadEdit: A synthetic data approach for robust model deployment appeared first on Microsoft Research.

Read More

Microsoft Research Forum Episode 4: The future of multimodal models, a new “small” language model, and other AI updates

Microsoft Research Forum Episode 4: The future of multimodal models, a new “small” language model, and other AI updates

Microsoft Research Forum is a continuous exchange of ideas about science and technology research in the era of general AI. In the latest episode (opens in new tab), researchers discussed the latest multimodal AI models, advanced benchmarks for AI evaluation and model self-improvement, and an entirely new kind of computer for AI inference and hard optimization. Researchers at Microsoft are working to explore breakthrough technology that can help advance everything from weather prediction to materials design. 

Below is a brief recap of the event, including select quotes from the presentations. Register to join future Research Forum episodes and view previous sessions. Transcripts and additional resources can be found in the Research Forum briefing book.

Keynote

Phi-3-Vision: A highly capable and “small” language vision model (opens in new tab)

Research Forum | Episode 4 Keynote | Jianfeng Gao

Jianfeng Gao introduced Phi-3-Vision, an advanced and economical open-source multimodal model. As a member of the Phi-3 model family, Phi-3-Vision enhances language models by integrating multisensory skills, seamlessly combining language and vision capabilities.

“Phi-3-Vision is the first multimodal model in the Phi small model family. It matches and sometimes exceeds some of the capabilities of much larger models … at a much lower cost. And to help everyone build more affordable and accessible AI systems, we have released the model weights into the open-source community.”

Jianfeng Gao, Distinguished Scientist and Vice President, Microsoft Research Redmond


Panel Discussion

Beyond language: The future of multimodal models in healthcare, gaming, and AI (opens in new tab)

Research Forum | Episode 4 Panel | John Langford, Hoifung Poon, Katja Hofmann, Jianwei Yang

This discussion examined the transformative potential and core challenges of multimodal models across various domains, including precision health, game intelligence, and foundation models. Microsoft researchers John Langford, Hoifung Poon, Katja Hofmann, and Jianwei Yang shared their thoughts on future directions, bridging gaps, and fostering synergies within the field. 

“One of the really cutting-edge treatments for cancer these days is immunotherapy. That works by mobilizing the immune system to fight the cancer. And then one of the blockbuster drugs is a KEYTRUDA, that really can work miracles for some of the late- stage cancers … Unfortunately, only 20 to 30 percent of the patients actually respond. So that’s … a marquee example of what are the growth opportunity in precision health.”
Hoifung Poon, General Manager, Microsoft Research Health Futures

“We experience the world through vision, touch, and all our other senses before we start to make sense of any of the language that is spoken around us. So, it’s really, really interesting to think through the implications of that, and potentially, as we start to understand more about the different modalities that we can model and the different ways in which we combine them.”
Katja Hofmann, Senior Principal Researcher, Microsoft Research

“To really have a capable multimodal model, we need to encode different information from different modalities, for example, from vision, from language, from even audio, speech, etc. We need to develop a very capable encoder for each of these domains and then … tokenize each of these raw data.”
Jianwei Yang, Principal Researcher, Microsoft Research Redmond


Lightning Talks

Analog optical computing for sustainable AI and beyond (opens in new tab)

Research Forum | Episode 4 Talk 1 | Francesca Parmigiani and Jiaqi Chu

This talk presented a new kind of computer—an analog optical computer—that has the potential to accelerate AI inference and hard optimization workloads by 100x, leveraging hardware-software co-design to improve the efficiency and sustainability of real-world applications. 

“Most likely, you or your loved ones have been inside an MRI scan not really a great place to be in. Imagine if you can reduce that amount of time from 20 to 40 minutes to less than five minutes.”
Francesca Parmigiani, Principal Researcher, Microsoft Research Cambridge

“I’m really excited to share that we have just completed the second generation of [this] computer. It is much smaller in physical size, and this is a world first in that exactly the same computer is simultaneously solving hard optimization problems and accelerating machine learning inference. Looking ahead, we estimate that at scale, this computer can achieve around 450 tera operations per second per watt, which is a 100-times improvement as compared to state-of-the-art GPUs.”
Jiaqi Chu, Principal Researcher, Microsoft Research Cambridge


Direct Nash Optimization: Teaching language models to self-improve with general preferences (opens in new tab)

Research Forum | Episode 4 Talk 2 | Corby Rosset

This talk explored teaching language models to self-improve using AI preference feedback, challenging the model to play against itself and a powerful teacher until it arrives at a Nash equilibrium, resulting in state-of-the-art win rates against GPT-4 Turbo on benchmarks such as AlpacaEval and MT-Bench. 

“The traditional way to fine-tune an LLM for post-training … basically tells the model to emulate good behaviors, but it does not target or correct any mistakes or bad behaviors that it makes explicitly. … Self-improving post-training explicitly identifies and tries to correct bad behaviors or mistakes that the model makes.”
Corby Rosset, Senior Researcher, Microsoft Research AI Frontiers


Project Aurora: The first large-scale foundation model of the atmosphere (opens in new tab)

Research Forum | Episode 4 Talk 3 | Megan Stanley

This talk presented Aurora, a cutting-edge foundation model that offers a new approach to weather forecasting that could transform our ability to predict and mitigate the impacts of extreme events, air pollution, and the changing climate.

“If we look at Aurora’s ability to predict pollutants such as nitrogen dioxide that are strongly related to emissions from human activity, we can see that the model has learned to make these predictions with no emissions data provided. It’s learned the implicit patterns that cause the gas concentrations, which is very impressive.”
Megan Stanley, Senior Researcher, Microsoft Research AI for Science


A generative model of biology for in-silico experimentation and discovery (opens in new tab)

Research Forum | Episode 4 Talk 4 | Kevin Yang

This talk explored how deep learning enables generation of novel and useful biomolecules, allowing researchers and practitioners to better understand biology. This includes EvoDiff, a general-purpose diffusion framework that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models to generate new proteins, given a protein sequence.

“Often, protein engineers want proteins that perform a similar function to a natural protein, or they want to produce a protein that performs the same function but has other desirable properties, such as stability. By conditioning EvoDiff with a family of related sequences, we can generate new proteins that are very different in sequence space to the natural proteins but are predicted to fold into similar three-dimensional structures. These may be good starting points for finding new functions or for discovering versions of a protein with desirable properties.”
Kevin Yang, Senior Researcher, Microsoft Research New England


Fostering appropriate reliance on AI (opens in new tab)

Research Forum | Episode 4 Talk 5 | Mihaela Vorvoreanu

Since AI systems are probabilistic, they can make mistakes. One of the main challenges in human-AI interaction is to avoid overreliance on AI and empower people to determine when to accept or not accept an AI system’s recommendation. This talk explores Microsoft’s work in this area.

“This is where I think it is our responsibility as people working in UX disciplines—as people researching UX and human-computer interaction—to really, really step up to the front and see how it is our moment to shine and to address this problem.”
Mihaela Vorvoreanu, Director UX Research and Responsible AI Education, Microsoft AI Ethics and Effects in Engineering and Research (Aether)

The post Microsoft Research Forum Episode 4: The future of multimodal models, a new “small” language model, and other AI updates appeared first on Microsoft Research.

Read More

Research Focus: Week of September 23, 2024

Research Focus: Week of September 23, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus | September 23, 2024

ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons

Time-series forecasting is a technique used to predict future values based on previously observed data points over time. It has extensive applications for traffic flow, renewable energy, retail, finance, and climate, among other uses. For these applications, it is crucial to provide forecasts across different prediction horizons, addressing both short- and long-term planning needs. Many decision-making processes also require not only point forecasts to quantify planning efficiency but also robust distributional estimations to manage uncertainty effectively. 

Delivering precise point and distributional forecasts across a spectrum of prediction horizons is a significant challenge. Prior research on developing deep learning models for time-series forecasting has often concentrated on isolated aspects, such as long-term point forecasting or short-term probabilistic estimations. This may result in skewed methodological choices and hinder the adaptability of these models to uncharted scenarios. While there is a rising trend in developing universal forecasting models, a thorough understanding of their advantages and drawbacks is still lacking.  

In a recent paper: ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons, researchers from Microsoft and external collaborators present a platform to evaluate these fundamental forecasting needs and to conduct a rigorous comparative analysis of related recent studies. They examine the latest models for universal time-series forecasting and discover that their analyses of methodological strengths and weaknesses are also applicable to these universal models. They then outline the limitations inherent in current research and underscore several avenues for future exploration. 


SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

Information retrieval (IR) involves identifying and retrieving recorded data that is relevant to an information need. Large-scale test collections play a crucial role in IR research. However, existing IR research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments – a time-intensive and expensive process. Recent studies have shown the strong capability of large language models (LLMs) in producing reliable relevance judgments with human accuracy but at a greatly reduced cost.

In a recent paper: SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval (opens in new tab), researchers from Microsoft and external colleagues address the missing large-scale ad-hoc document retrieval dataset. They extend the TREC Deep Learning Track (opens in new tab) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Such a test collection includes more than 1,900 test queries from previous tracks. The researchers compare system evaluation with past human labels and show that their synthetically created large-scale test collection can lead to highly correlated system rankings. 

Spotlight: Blog post

Research Focus: Week of September 9, 2024

Investigating vulnerabilities in LLMs; A novel total-duration-aware (TDA) duration model for text-to-speech (TTS); Generative expert metric system through iterative prompt priming; Integrity protection in 5G fronthaul networks.


Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling

LLMs are used for a wide variety of tasks and scenarios, such as chat, question answering, code generation, summarization and reasoning. These tasks exhibit variations in their input and output characteristics. Requests for different tasks with distinct input and output characteristics are often served concurrently at a single model instance, which can lead to spikes in end-to-end latency, time to generate the first token, and time between tokens (in the case of a streaming request). Understanding the interplay between requests of different characteristics is important for optimizing the end-to-end performance during LLM inference.

In a recent preprint, Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling, researchers from Microsoft propose a heuristic-guided reinforcement learning-based intelligent router for data-driven and workload-aware scheduling. This router leverages a trainable response-length predictor, and a novel formulation for estimating the impact of mixing different workloads to schedule queries across LLM instances and achieve over 11% lower end-to-end latency than existing approaches.


INTERNSHIP OPPORTUNITY

Apply now: Microsoft Research Undergrad Internship Program – Summer 2025

The Microsoft Research Undergrad Internship Program offers 12-week internships in Redmond, Washington; New York City; or Cambridge, Massachusetts, for rising college juniors and seniors who are passionate about technology and champion diversity and inclusion.

Come work alongside world-class researchers on state-of-the-art projects. Participants will collaborate with an extended network of visiting faculty, postdoctoral researchers, data and applied scientists, engineers, designers, and doctoral students to make important contributions to new and ongoing research. On-the-job learning will be augmented with mentoring, community building, and networking opportunities. Candidates from groups currently underrepresented in engineering and computer science are strongly encouraged to apply.

Applications (opens in new tab) will be accepted until October 21, 2024. Apply now!

The post Research Focus: Week of September 23, 2024 appeared first on Microsoft Research.

Read More