Research Focus: Week of July 3, 2023

Research Focus: Week of July 3, 2023

Microsoft Research Focus 19 | Week of July 3, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

The Best of Both Worlds: Unlocking the Potential of Hybrid Work for Software Engineers

The era of hybrid work has created new challenges and opportunities for developers. Their ability to choose where they work and the scheduling flexibility that comes with remote work can be offset by the loss of social interaction, reduced collaboration efficiency and difficulty separating work time from personal time. Companies must be equipped to maintain a successful and efficient hybrid workforce by accentuating the positive elements of hybrid work, while also addressing the challenges.

In a new study: The Best of Both Worlds: Unlocking the Potential of Hybrid Work for Software Engineers, researchers from Microsoft aim to identify which form of work – whether fully in office, fully at home, or blended – yields the highest productivity and job satisfaction among developers. They analyzed over 3,400 survey responses conducted across 28 companies in seven countries, in partnership with Vista Equity Partners, a leading global asset manager with experience investing in software, data, and technology-enabled organizations.

The study found that developers face many of the same challenges found in other types of hybrid workplaces. The researchers provide recommendations for addressing these challenges and unlocking more productivity while improving employee satisfaction.

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

NEW INSIGHTS

Prompt Engineering: Improving our ability to communicate with LLMs

Pretrained natural language generation (NLG) models are powerful, but in the absence of contextual information, responses are necessarily generic. The prompt is the primary mechanism for access to NLG capabilities. It is an enormously effective and flexible tool, yet in order to be actively converted to the expected output, a prompt must meet expectations for how information is conveyed. If the prompt is not accurate and precise, the model is left guessing. Prompt engineering aims to bring more context and specificity to generative AI models, providing enough information in the model instructions that the user gets the exact result they want.

In a recent blog post: Prompt Engineering: Improving our Ability to Communicate with an LLM, researchers from Microsoft explain how they use retrieval augmented generation (RAG) to do knowledge grounding, use advanced prompt engineering to properly set context in the input to guide large language models (LLMs), implement a provenance check for responsible AI, and help users deploy scalable NLG service more safely, effectively, and efficiently.


NEW RESOURCE

Overwatch: Learning patterns in code edit sequences

Integrated development environments (IDEs) provide tool support to automate many source code editing tasks. IDEs typically use only the spatial context, i.e., the location where the developer is editing, to generate candidate edit recommendations. However, spatial context alone is often not sufficient to confidently predict the developer’s next edit, and thus IDEs generate many suggestions at a location. Therefore, IDEs generally do not actively offer suggestions. The developer must click on a specific icon or menu and then select from a large list of potential suggestions. As a consequence, developers often miss the opportunity to use the tool support because they are not aware it exists or forget to use it. To better understand common patterns in developer behavior and produce better edit recommendations, tool builders can use the temporal context, i.e., the edits that a developer was recently performing.

To enable edit recommendations based on temporal context, researchers from Microsoft created Overwatch, a novel technique for learning edit sequence patterns from traces of developers’ edits performed in an IDE. Their experiments show that Overwatch has 78% precision and that it not only completed edits when developers missed the opportunity to use the IDE tool support, but also predicted new edits that have no tool support in the IDE.


UPDATED RESOURCE

Qlib updates harness adaptive market dynamics modeling and reinforcement learning to address key challenges in financial markets

Qlib is an open-source framework built by Microsoft Research that empowers research into AI technologies applicable to the financial industry. Qlib initially supported diverse machine learning modeling paradigms, including supervised learning. Now, a series of recent updates have added support for market dynamics modeling and reinforcement learning, enabling researchers and engineers to tap into more sophisticated learning methods for advanced trading system construction.

These updates broaden Qlib’s capabilities and its value proposition for researchers and engineers, empowering them to explore ideas and implement effective quantitative trading strategies. The updates, available on GitHub, make Qlib the first platform to offer diverse learning paradigms aimed at helping researchers and engineers solve key financial market challenges.

A significant update is the introduction of adaptive concept drift technology for modeling the dynamic nature of financial markets. This feature can help researchers and engineers invent and implement algorithms that can adapt to changes in market trends and behavior over time, which is crucial for maintaining a competitive advantage in trading strategies.

Qlib’s support for reinforcement learning enables a new feature designed to model continuous investment decisions. This feature assists researchers and engineers in optimizing their trading strategies by learning from interactions with the environment to maximize some notion of cumulative reward.

Related research:

DDG-DA: Data Distribution Generation for Predictable Concept Drift Adaptation

Universal Trading for Order Execution with Oracle Policy Distillation

The post Research Focus: Week of July 3, 2023 appeared first on Microsoft Research.

Read More

Distributional Graphormer: Toward equilibrium distribution prediction for molecular systems

Distributional Graphormer: Toward equilibrium distribution prediction for molecular systems

Distributional Graphormer (DiG) animated logo

Structure prediction is a fundamental problem in molecular science because the structure of a molecule determines its properties and functions. In recent years, deep learning methods have made remarkable progress and impact on predicting molecular structures, especially for protein molecules. Deep learning methods, such as AlphaFold and RoseTTAFold, have achieved unprecedented accuracy in predicting the most probable structures for proteins from their amino acid sequences and have been hailed as a game changer in molecular science. However, this method provides only a single snapshot of a protein structure, and structure prediction cannot tell the complete story of how a molecule works.

Proteins are not rigid objects; they are dynamic molecules that can adopt different structures with specific probabilities at equilibrium. Identifying these structures and their probabilities is essential in understanding protein properties and functions, how they interact with other proteins, and the statistical mechanics and thermodynamics of molecular systems. Traditional methods for obtaining these equilibrium distributions, such as molecular dynamics simulations or Monte Carlo sampling (which uses repeated random sampling from a distribution to achieve numerical statistical results), are often computationally expensive and may even become intractable for complex molecules. Therefore, there is a pressing need for novel computational approaches that can accurately and efficiently predict the equilibrium distributions of molecular structures from basic descriptors.

A schematic diagram illustrating the goal of Distributional Graphormer (DiG). A molecular system is represented by a basic descriptor D, such as the amino acid sequence for a protein. DiG transforms D into a structural ensemble S, which consists of multiple possible conformations and their probabilities. S is expected to follow the equilibrium distribution of the molecular system. A legend shows a example of D and S for Adenylate kinase protein.
Figure 1. The goal of Distributional Graphormer (DiG). DiG takes the basic descriptor, D, of a molecular system, such as the amino acid sequence for a protein, as input to predict the structures and their probabilities following equilibrium distribution.

In this blog post, we introduce Distributional Graphormer (DiG), a new deep learning framework for predicting protein structures according to their equilibrium distribution. It aims to address this fundamental challenge and open new opportunities for molecular science. DiG is a significant advancement from single structure prediction to structure ensemble modeling with equilibrium distributions. Its distribution prediction capability bridges the gap between the microscopic structures and the macroscopic properties of molecular systems, which are governed by statistical mechanics and thermodynamics. Nevertheless, this is a tremendous challenge, as it requires modeling complex distributions in high-dimensional space to capture the probabilities of different molecular states.

DiG achieves a novel solution for distribution prediction through an advancement of our previous work, Graphormer, which is a general-purpose graph transformer that can effectively model molecular structures. Graphormer has shown excellent performance in molecular science research, demonstrated by applications in quantum chemistry and molecular dynamics simulations, as reported in our previous blog posts (see here and here for more details). Now, we have advanced Graphormer to create DiG, which has a new and powerful capability: using deep neural networks to directly predict target distribution from basic descriptors of molecules.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

DiG tackles this challenging problem. It is based on the idea of simulated annealing, a classic method in thermodynamics and optimization, which has also motivated the recent development of diffusion models that achieved remarkable breakthroughs in AI-generated content (AIGC). Simulated annealing produces a complex distribution by gradually refining a simple distribution through the simulation of an annealing process, allowing it to explore and settle in the most probable states. DiG mimics this process in a deep learning framework for molecular systems. AIGC models are often based on the idea of diffusion models, which are inspired by statistical mechanics and thermodynamics.

DiG is also based on the idea of diffusion models, but we bring this idea back to thermodynamics research, creating a closed loop of inspiration and innovation. We imagine scientists someday will be able to use DiG like an AIGC model for drawing, inputting a simple description, such as an amino acid sequence, and then using DiG to quickly generate realistic and diverse protein structures that follow equilibrium distribution. This will greatly enhance scientists’ productivity and creativity, enabling novel discoveries and applications in fields such as drug design, materials science, and catalysis.

How does DiG work?

A schematic diagram illustrating the design and backbone architecture of DiG. The diagram shows a molecular system with two possible conformations as an example. The top row shows the energy function of the molecular system as a curve, with two local minima corresponding to the two conformations. The bottom row shows the probability distribution of the molecular system as a bar chart, with two peaks corresponding to the two conformations. The diagram also shows a diffusion process that transforms the probability distribution from a simple uniform one to the equilibrium one that matches the energy function. The diffusion process consists of several intermediate time steps, labeled as i=0,1,…,T. At each time step, a deep-learning model, Graphormer, is used to construct a forward diffusion step that converts the distribution at the previous time step to the next one, indicated by blue arrows. The Graphormer model is learned to match the distribution at each time step to a predefined backward diffusion step that converts the equilibrium distribution to the simple one, indicated by orange arrows. The backward diffusion step is computed by adding Gaussian noise to the equilibrium distribution and normalizing it. The learning of the Graphormer model is supervised by both the samples and the energy function of the molecular system. The samples are obtained from a large-scale molecular simulation dataset that provides the initial samples and the corresponding energy labels. The energy function is used to calculate the energy scores for the generated samples and guide the diffusion process towards the equilibrium distribution. The diagram also shows a physics-informed diffusion pre-training (PIDP) method that is developed to pre-train DiG with only energy functions as inputs, without the data dependency. The PIDP method uses a contrastive loss function to minimize the distance between the energy scores and the probabilities of the generated samples at each time step. The PIDP method can enhance the generalization of DiG to molecular systems that are not in the dataset.
Figure 2. DiG’s design and backbone architecture.

DiG is based on the idea of diffusion by transforming a simple distribution to a complex distribution using Graphormer. The simple distribution can be a standard Gaussian, and the complex distribution can be the equilibrium distribution of molecular structures. The transformation is done step-by-step, where the whole process mimics the simulated annealing process.

DiG can be trained using different types of data or information. For example, DiG can use energy functions of molecular systems to guide transformation, and it can also use simulated structure data, such as molecular dynamics trajectories, to learn the distribution. More concretely, DiG can use energy functions of molecular systems to guide transformation by minimizing the discrepancy between the energy-based probabilities and the probabilities predicted by DiG. This approach can leverage the prior knowledge of the system and train DiG without stringent dependency on data. Alternatively, DiG can also use simulation data, such as molecular dynamics trajectories, to learn the distribution by maximizing the likelihood of the data under the DiG model.

DiG shows similarly good generalizing abilities on many molecular systems compared with deep learning-based structure prediction methods. This is because DiG inherits the advantages of advanced deep-learning architectures like Graphormer and applies them to the new and challenging task of distribution prediction.  Once trained, DiG can generate molecular structures by reversing the transformation process, starting from a simple distribution and applying neural networks in reverse order. DiG can also provide the probability estimation for each generated structure by computing the change of probability along the transformation process. DiG is a flexible and general framework that can handle different types of molecular systems and descriptors.

Results

We demonstrate DiG’s performance and potential through several molecular sampling tasks covering a broad range of molecular systems, such as proteins, protein-ligand complexes, and catalyst-adsorbate systems. Our results show that DiG not only generates realistic and diverse molecular structures with high efficiency and low computational costs, but it also provides estimations of state densities, which are crucial for computing macroscopic properties using statistical mechanics. Accordingly, DiG presents a significant advancement in statistically understanding microscopic molecules and predicting their macroscopic properties, creating many exciting research opportunities in molecular science.

One major application of DiG is to sample protein conformations, which are indispensable to understanding their properties and functions. Proteins are dynamic molecules that can adopt diverse structures with different probabilities at equilibrium, and these structures are often related to their biological functions and interactions with other molecules. However, predicting the equilibrium distribution of protein conformations is a long-standing and challenging problem due to the complex and high-dimensional energy landscape that governs probability distribution in the conformation space. In contrast to expensive and inefficient molecular dynamics simulations or Monte Carlo sampling methods, DiG generates diverse and functionally relevant protein structures from amino acid sequences at a high speed and a significantly reduced cost.

Figure 3. This illustration shows DiG’s performance when generating multiple conformations of proteins. On the left, DiG-generated structures of the main protease of SARS-CoV-2 virus are projected into 2D space panned with two TICA coordinates. On the right, structures generated by DiG (thin ribbons) are compared with experimentally determined structures (cylindrical figures) in each case.

DiG can generate multiple conformations from the same protein sequence. The left side of Figure 3 shows DiG-generated structures of the main protease of SARS-CoV-2 virus compared with MD simulations and AlphaFold prediction results. The contours (shown as lines) in the 2D space reveal three clusters sampled by extensive MD simulations. DiG generates highly similar structures in clusters II and III, while structures in cluster I are undersampled. In the right panel, DiG-generated structures are aligned to experimental structures for four proteins, each with two distinguishable conformations corresponding to unique functional states. In the upper left, the Adenylate kinase protein has open and closed states, both well sampled by DiG. Similarly, for the drug transport protein LmrP, DiG also generates structures for both states. Here, note that the closed state is experimentally determined (in the lower-right corner, with PDB ID 6t1z), while the other is the AlphaFold predicted model that is consistent with experimental data. In the case of human B-Raf kinase, the major structural difference is localized in the A-loop region and a nearby helix, which are well captured by DiG. The D-ribose binding protein has two separated domains, which can be packed into two distinct conformations. DiG perfectly generated the straight-up conformation, but it is less accurate in predicting the twisted conformation. Nonetheless, besides the straight-up conformation, DiG generated some conformations that appear to be intermediate states.

Another application of DiG is to sample catalyst-adsorbate systems, which are central to heterogeneous catalysis. Identifying active adsorption sites and stable adsorbate configurations is crucial for understanding and designing catalysts, but it is also quite challenging due to the complex surface-molecular interactions. Traditional methods, such as density functional theory (DFT) calculations and molecular dynamics simulations, are time-consuming and costly, especially for large and complex surfaces. DiG predicts adsorption sites and configurations, as well as their probabilities, from the substrate and adsorbate descriptors. DiG can handle various types of adsorbates, such as single atoms or molecules being adsorbed onto different types of substrates, such as metals or alloys.

Figure 4. Adsorption prediction results of single C, H, and O atoms on catalyst surfaces. The predicted probability distribution on catalyst surface is compared to the interaction energy between the adsorbate molecules and the catalyst in the middle and bottom rows.
Figure 4. Adsorption prediction results of single C, H, and O atoms on catalyst surfaces. The predicted probability distribution on catalyst surface is compared to the interaction energy between the adsorbate molecules and the catalyst in the middle and bottom rows.

Applying DiG, we predicted the adsorption sites for a variety of catalyst-adsorbate systems and compared these predicted probabilities with energies obtained from DFT calculations. We found that DiG could find all the stable adsorption sites and generate adsorbate configurations that are similar to the DFT results with high efficiency and at a low cost. DiG estimates the probabilities of different adsorption configurations, in good agreement with DFT energies.

Conclusion

In this blog, we introduced DiG, a deep learning framework that aims to predict the distribution of molecular structures. DiG is a significant advancement from single structure prediction toward ensemble modeling with equilibrium distributions, setting a cornerstone for connecting microscopic structures to macroscopic properties under deep learning frameworks.

DiG involves key ML innovations that lead to expressive generative models, which have been shown to have the capacity to sample multimodal distribution within a given class of molecules. We have demonstrated the flexibility of this approach on different classes of molecules (including proteins, etc.), and we have shown that individual structures generated in this way are chemically realistic. Consequently, DiG enables the development of ML systems that can sample equilibrium distributions of molecules given appropriate training data.

However, we acknowledge that considerably more research is needed to obtain efficient and reliable predictions of equilibrium distributions for arbitrary molecules. We hope that DiG inspires additional research and innovation in this direction, and we look forward to more exciting results and impact from DiG and other related methods in the future.

The post Distributional Graphormer: Toward equilibrium distribution prediction for molecular systems appeared first on Microsoft Research.

Read More

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

black and white photos of Dr. Spencer Fowers, a member of the Special Projects Technical Staff at Microsoft Research and Dr. Kwame Darko, a plastic surgeon in the reconstructive plastic surgery and burns center in Ghana’s Korle Bu Teaching Hospital, next to the Microsoft Research Podcast

Episode 142 | July 6, 2023 

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.

In this episode, host Dr. Gretchen Huizinga welcomes Dr. Spencer Fowers, a member of the Special Projects Technical Staff at Microsoft Research, and Dr. Kwame Darko, a plastic surgeon in the reconstructive plastic surgery and burns center in Ghana’s Korle Bu Teaching Hospital. The two are part of an intercontinental research project to study Holoportation, a Microsoft 3D capture and communication technology, in the medical setting. The goal of the study—led by Korle Bu, NHS Scotland West of Scotland Innovation Hub, and Canniesburn Plastic Surgery and Burns Unit—is to make specialized healthcare more widely available, especially to those in remote or underserved communities. Fowers and Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.

Transcript

[TEASER] [MUSIC PLAYS UNDER DIALOGUE]

SPENCER FOWERS: I work with a team that does moonshots for a living, so I’m always looking for, what can we shoot for? And our goal really is like, gosh, where can’t we apply this technology? I mean, just anywhere that it is at all difficult to get, you know, medical expertise, we can ease the burden of doctors by making it so they don’t have to travel to provide this specialized care and increase the access to healthcare to these people that normally wouldn’t be able to get access to it.

KWAME DARKO: So yeah, the scope is as far as the mind can imagine it.

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research Podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.

[MUSIC ENDS]


On this episode, I’m talking to Dr. Spencer Fowers, a Principal Member of the Technical Staff at Microsoft Research, and Dr. Kwame Darko, a plastic surgeon at the National Reconstructive Plastic Surgery and Burns Centre at the Korle Bu Teaching Hospital in Accra, Ghana. Spencer and Kwame are working on 3D telemedicine, a project they hope will increase access to specialized healthcare in rural and underserved communities by using live 3D communication, or Holoportation. We’ll learn much more about that in this episode. But first, let’s meet our collaborators.

Spencer, I’ll start with you. Tell us about the technical staff in the Special Projects division of Microsoft Research. What kind of work do you do, what’s the research model, and what’s your particular role there?

SPENCER FOWERS: Hi, Gretchen. Thanks for having me on here. Yeah, um, so our group at Special Projects was kind of patterned after the Lockheed Martin Skunk Works methodology. You know, we are very much a sort of “try big moonshot projects” type group. Our goal is sort of to focus on any sort of pie-in-the-sky idea that has some sort of a business application. So you can imagine we’ve done things like build the world’s first underwater datacenter or do post-quantum cryptography, things like that. Anything that, uh, is a very ambitious project that we can try to iterate on and see what type of an application we can find for it in the real world. And I’m one of the, as you said, a principal member of the technical staff. That means I’m one of the primary researchers, so I wear a lot of different hats. My job is everything from, you know, managing the project, meeting with people like Kwame and the other surgeons that we’ve worked with, and then interfacing there and finding ways that we can take theoretical research and turn it into applied research, actually find a way that we can bring that theory into reality.

HUIZINGA: You know, that’s a really interesting characterization because normally you think of those things in two different buckets, right? The big moonshot research has got a horizon, a time horizon, that’s pretty far out, and the applied research is get it going so it’s productizable or monetizable fairly quickly, and you’re marrying those two kinds of research models?

FOWERS: Yeah. I mean, we fit kind of a really interesting niche here at Microsoft because we get to operate sort of like a startup, but we have the backing of a very large company, so we get to sort of, yeah, take on these moonshot projects that a, a smaller company might not be able to handle and really attack it with the full resources of, of a company like Microsoft.

HUIZINGA: So it’s a moonshot project, but hurry up, let’s get ’er done. [LAUGHS]

FOWERS: Right, yeah.

HUIZINGA: Well, listen, Kwame, you’re a plastic surgeon at Korle Bu in Ghana. In many circles, that term is associated with nonessential cosmetic surgery, but that’s not what we’re talking about here. What constitutes the bulk of your work, and what prompted you to pursue it in the first place?

KWAME DARKO: All right, thanks, Gretchen, and also thank you for having me on the show. So, um, just as you said, my name is Kwame Darko, and I am a plastic surgeon. I think a lot of my passion to become a plastic surgeon came from the fact that at the time, we didn’t have too many plastic surgeons in Ghana. I mean, at the time that I qualified as a plastic surgeon, I was the eighth person in the country. And at the time, there was a population of 20-something million. Currently, we’re around 33 million, 34 million, and we have … we’re still not up to 30 plastic surgeons. So there was quite a bit of work to be done, and my work scopes from all the way to what everybody tends to [associate] plastic surgery with, the cosmetic stuff, across to burns, across to trauma from people with serious accidents that need some parts of their body reconstructed, to tumors of all sorts. Um, one of my fortes is breast cancer and breast reconstruction, but not limiting to that. We also [do] tumors of the leg. And we also help other surgeons to cover up spaces or defects that may have been created when they’ve taken off some sort of cancer or tumor or whatever it may be. So it’s a wide scope, as well as burn surgery and burn care, as well. So that’s the scope of the kind of work that I do.

HUIZINGA: You know, this wasn’t on my list to ask you, but I’m curious, um, both of you … Spencer, where did you get your training? What … what’s your background?

FOWERS: I actually got my PhD at university in computer engineering, uh, focused on computer vision. So a lot of my academic research was in, uh, you know, embedded systems and low-power systems and how we can get a vision-based stuff to work without using a lot of processing. And it actually fits really well for this application here where we’re trying to find low-cost ways that we can bring really high-end vision stuff, you know, and put it inside a hospital.

HUIZINGA: Yeah. So, Kwame, what about you? Where did you get your training, and did you start out in plastic surgery thinking, “Hey, that’s what I want to be”? Or did you start elsewhere and say, “This is cool”?

DARKO: So my, my background is that I did my medical school training here in Ghana at the medical school in Korle Bu and then started my postgraduate training in surgery. Um, over here, you need to do a number of years in just surgery before you can branch out and do a specific type of surgery. So, after my three, four years in that, I decided to do plastic surgery once again here in Ghana. You spend another up to three years—minimum of three years—training in that, which I did. And then you become a plastic surgeon. But then I went on for a bit of extra training and more exposure from different places around the world. I spent some time in Cape Town in South Africa working in a hospital called Groote Schuur. Um, I’ve also had the opportunity to work in, in Glasgow, where this idea originated from, and various courses in different parts of the world from India, the US, and stuff like that.

HUIZINGA: Wow. You know, I could spend a whole podcast asking you what you’ve seen in your lifetime in terms of trauma and burns and all of that. But I won’t, uh, because let’s talk about how this particular project came about, and I’d like both of your perspectives on it. This is a sort of “how I met your mother” story. As I understand it, there were a lot of people and more than two countries involved. Spencer, how do you remember the meet-up?

FOWERS: Yeah, I mean, Holoportation has been around since 2015, but it was around 2018 that, uh, Steven Lo—he’s a plastic surgeon in Glasgow—he approached us with this idea, saying, “Hey, we want to, we want to use this Holoportation technology in a hospital setting.” At that point, he was already working with Kwame and, and other doctors. They have a partnership between the Canniesburn Plastic Surgery unit in Glasgow and the Korle Bu Teaching Hospital in Accra. And he, he approached us with this idea of saying we want to build a clinic remotely so that people can come and see this. There is, like Kwame mentioned, right, a very drastic lack of surgeons in Ghana for the amount of the population, and so he wanted to find a way that he could provide reconstructive plastic surgery consultation to patients even though they’re very far away. Currently, the, you know, the Canniesburn unit, they do these trips every year, every couple of years, where they fly down to Ghana, perform surgeries. And the way it works is basically the surgeons get on an airplane, they fly down to Ghana, and then, you know, the next day they’re in the hospital, all day long, meeting these people that they’re going to operate on the next day, right? And trying to decide the day before the surgery what they’re going to operate on and what they’re going to do and get the consent from these patients. Is there a better way? Could we actually talk to these patients ahead of time? And 2D video calls just didn’t cut it. It wasn’t good enough, and Kwame can talk more about that. But his idea was, can we use something like this to make a 3D model of a patient, have a live conversation with them in 3D so that the surgeon can evaluate them before they go to Ghana and get an idea of what they’re going to do and be able to explain to the patient what they want to do before the surgery has to happen.

HUIZINGA: Yeah. So Microsoft Research, how did that group get involved?

FOWERS: Well, so we started with this technology back in, you know, 2015. And when he approached us with this idea, we were looking for ways that we could apply Holoportation to different areas, different markets. This came up as like one of those perfect fits for the technology, where we wanted to be able to use the system to image someone, it needed to be a live conversation, not a recording, and so that was, right there … was where we started working with them and designing the cameras that would go into the system they’re using today.

HUIZINGA: Right, right, right. Well, Kwame, in light of, uh, the increase, as we’ve just referred to, in 2D telemedicine, especially during COVID and, and post-COVID, people have gotten pretty used to talking to doctors over a screen as opposed to going in person. But there are drawbacks and shortcomings of 2D in your world. So how does 3D fill in those gaps, and, and what was attractive to you in this particular technology for the application you need?

DARKO: OK. So great, um, just as you’re saying, COVID really did spark the, uh, the spread of 2D telemedicine all over the world. But for myself, as a surgeon and particularly so as a plastic surgeon, we’re trying to think about, how is 2D video going to help me solve my problem or plan towards solving my problem for a patient? And you realize there is a significant shortfall when we’re not just dealing with the human being as a 2D object, uh, but 3D perspective is so important. So one of the most common things we’ve used this system to help us with is when we’re assessing a patient to decide which part of the body we’re going to move and use it to fit in the space that’s going to be created by taking out some form of tumor. And not only taking it out in 3D for us to know that it’s going to fit and be big enough but also demonstrating to the patient so they have a deeper understanding of exactly what is going to go and be used to reconstruct whichever part of their body and what defect is going to be left behind. So as against when you’re just having a straightforward consultation back and forth, answer and response, question and response, in this situation, we get the opportunity and have the ability to actually turn the patient around and then measure out specific problem … parts of the body that we’re going to take off and then transpose that on a different part of the body to make sure that it’s also going to be big enough to switch around and transpose. And when I’m saying transpose, I’m talking about maybe sticking something off from the front part of your thigh and then filling that in with maybe massive parts of your back muscle.

FOWERS: To add on to what Kwame said, you know, for us, for Microsoft Research, when Steven approached us with this, I don’t think we really understood the impact that it could have. You know, we even asked him, why don’t you just use like a cell phone, or why don’t you just use a 2D telemedicine call? Like, why do you need all this technology to do this? And he explained it to us, and we said, OK, like we’re going to take your word for it. It wasn’t until I went over there the first time that it really clicked for me and we had set up the system and he brought in a patient that had had reconstructive plastic surgery. She had had a cancerous tumor that required the amputation of her entire shoulder. So she lost her arm and, you know, this is not something that we think of on a day-to-day basis, but you actually, you can’t wear a shirt if you don’t have a shoulder. And so he was actually taking her elbow and replacing the joint that he was removing with her elbow joint. So he did this entire transpose operation. The stuff that they can do is amazing. But …

HUIZINGA: Right.

FOWERS: … he had done this operation on her probably a year before. And so he was bringing her back in for just the postoperative consult to see how she was doing. He had her in the system, and while she’s sitting in the system, he’s able to rotate the 3D model of her around so that she can see her own back. And he drew on her: “OK, this is where your elbow is now, and this is where we took the material from and what we did.” And during the teleconference, she says, “Oh, that’s what you did. I never knew what you did!” Like … she had had this operation a year ago, never knew what happened to herself because she couldn’t see her own back that way and couldn’t understand it. And it finally clicked to us like, oh my gosh, like, this is why this is important. Like not just because it aids the doctors in planning for surgeries, but the tremendous impact that it has on patient satisfaction with their operation and patient understanding of what’s going to happen.

HUIZINGA: Wow. That’s amazing. Even as you describe that, it’s … ahh … we could go so deep into the strangeness of what they can do with plastic surgery. But let’s talk about technology for a minute. Um, this is a highly visual technology, and we’re just doing a podcast, and we will provide some links in the show notes for people to see this in action, I hope. But in the meantime, Spencer, can you give us a kind of word picture of 3D telemedicine and the technology behind Holoportation? How does it work?

FOWERS: Yeah, the idea behind this technology is, if we can take pictures of a person from multiple angles and we know where those cameras are very, very accurately, we can stitch all those images together to make like a 3D picture of a person. So we’re actually using, for the 3D telemedicine system, we’re using the Azure Kinect. So it’s like Version 3 of the Kinect sensor that was introduced back in the Xbox days. And what that gives us is it gives us not just a color picture like you’re seeing on your normal 2D phone call, but it’s also giving us a depth picture so it can tell how far away you are from the camera. And we take that depth and that color information from 10 different cameras spaced around the room and stitch them all together in real time. So while we’re talking at, you know, normal conversation speed, it’s creating this 3D image of a person that the doctor, in this case, can actually rotate, pan, zoom in, and zoom out and be able to see them from any angle that they want without requiring that patient to get up and move around.

HUIZINGA: Wow. And that speaks to what you just said. The patient can see it as well as the clinician.

FOWERS: Yeah, I mean, you also have this problem with a lot of these patients if they’d had, you know, a leg amputation or something, when we typically talk like we’re talking now on like a, you know, the viewer, the listeners can’t see it, but a typical 2D telemedicine call, you’re looking at me from like my shoulders up. Well, if that person has an amputation of their knee, how do you get it so that you can talk to them in a normal conversation and then look at their knee? You, you just can’t do that on a 2D call. But this system allows them to talk to them and turn and look at their knee and show them—if it’s on their back, wherever it is—what they’re going to do and explain it to them.

HUIZINGA: That’s amazing. Kwame, this project doesn’t just address geographical challenges for remote patients. It also addresses geographical challenges for remote experts. So tell us about the nature and makeup of what you call MDTs­—or multidisciplinary teams—that you collaborate with and how 3D telemedicine impacts the care you’re able to provide because of that.

DARKO: All right. So with an MDT, or multidisciplinary team, just as you said, the focus on medicine these days is to take out individual bias in how we’re going to treat a particular patient, an individual knowledge base. So now what we tend to do is we try and get a group of doctors who would be treating a particular ailment—more often than not, it’s a cancer case—and everybody brings their view on what is best to holistically find a solution to the patient’s … the most ideal remedy for the patient. Now let’s take skin cancer, for example. You’re going to need a plastic surgeon if you’re going to cut it out. You’re going to need a dermatologist who is going to be able to manage it. If it’s that severe, you’re also going to need an oncologist. You may even need a radiologist and, of course, a psychologist and your nursing team, as well. So with an MDT, you’d ideally have members from each of these specialties in a room at a time discussing individual patients and deciding what’s best to do for them. What happens when I don’t have a particular specialty? And what happens when, even though I am the representative of my specialty on this group, I may not have as in-depth knowledge as is needed for this particular patient? What do we do? Do we have access to other brains around the world? Well, with this system, yes, we do. And just as we said earlier, that unlike where this is just a regular let’s say Teams meeting or whatever form of, uh, telemedicine meeting, in this one where we have the 3D edge, we can actually have the patient around in the rig. And as we’re discussing and talking about—and people are giving their ideas—we can swing the patient around and say, well, on this aspect, it would work because this is far away from the ear or closer to the ear, or no, the ear is going to have to go with this; it’s too close. So what do we do? Can we get somebody else to do an ear reconstruction in addition? If it’s, um, something on the back, if we’re taking it all out, is this going to involve the muscle, as well? If so, how are we going to replace the muscle? It’s beyond my scope. But oh! What do you know? We have an expert who’s done this kind of things from, let’s say, Korea or Singapore. And then they would log on and be able to see everything and give their input, as well. So this is another application which just crosses boundary … um, borders and gives us so much more scope to the application of this, uh, this new device.

HUIZINGA: So, so when we’re talking about multidisciplinary teams and, and we look at it from an expert point of view of having all these different disciplines in the room from the medical side, uh, Spencer this collaboration includes technologists, as well as medical professionals, but it also includes patients. You, you talk about what you call a participatory development validation. What is the role of patients in developing this technology?

FOWERS: Well, similar to like that story I was mentioning, right, as we started using this system, the initial goal was to give doctors this better ability to be able to see patients in preparation for surgery. What we found as we started to show this to patients was that it drastically increased their satisfaction from the visits with the doctors because they were able to better understand the operation that was going to be performed. It’s surprising how many times like Kwame and Steven will talk to me and they’ll tell us stories about how like they explain a procedure to a patient about what they’re going to do, and the patient says, “Yeah, OK.” And then they get done and the patient’s like, “Wait, what did you do? Like that doesn’t … I didn’t realize you were going to do that,” you know, because it’s hard for them to understand when you’re just talking about them or whether you’re drawing on a piece of paper. But when you actually have a picture of yourself in front of you that’s live and the doctors indicating on you what’s going to happen and what the surgery is going to be, it drastically increases the patient satisfaction. And so that was actually the direction of the randomized controlled trial that we’re conducting in, in Scotland right now is, what kind of improvement in patient satisfaction does this type of a system provide?

HUIZINGA: Hmm. It’s kind of speaking UX to me, like a patient experience as opposed to a user experience. Um, has it—any of this—fed into sort of feedback loop on technology development, or is it more just on the user side of how I feel about it?

FOWERS: Um, as far as like technology that we use for the system, when we started with Holoportation, we were actually using kind of research-grade cameras and building our own depth cameras and stuff like that, which made for a very expensive system that wasn’t easy to use. That’s why we transitioned over to the Azure Kinect because it’s actually like the highest-resolution depth camera you can get on the market today for this type of information. And so, it’s, it’s really pushed us to find, what can we use that’s more of a compact, you know, all-in-one system so that we can get the data that we need?

HUIZINGA: Right, right, right. Well, Kwame, at about this time, I always ask what could possibly go wrong? But when we talked before, you, you kind of came at this from a cup-half-full outlook because of the nature of what’s already wrong in digital healthcare in general, but particularly for rural and underserved communities, um, and you’ve kind of said what’s wrong is why we’re doing this. So what are some of the problems that are already in the mix, and how does 3D telemedicine mitigate any of them? Things like privacy and connectivity and bandwidth and cost and access and hacking, consent—all of those things that we’re sort of like concerned about writ large?

DARKO: All right. So when I was talking about the cup being half full in terms of these, all of these issues, it’s because these problems already exist. So this technology doesn’t present itself and create a new problem. It’s just going to piggyback off the solutions of what is already in existence. All right? So you, you mentioned most of them anyway. I mean, talking about patient privacy, which is No. 1. Um, all of these things are done on a hospital server. They are not done on a public or an ad hoc server of any sort. So whatever fail-safes there are within the hospital in itself, whichever hospital network we’re using, whether here in Ghana, whether in Glasgow, whether somewhere remotely in India or in the US, doesn’t matter where, it would be piggybacking off a hospital server. So those fail-safes are there already. So if anybody can get into the network and observe or steal data from our system, then it’s because the hospital system isn’t secure, not because it’s our system, in a manner of speaking, is not secure. All right? And then when I was saying that it’s half full, it’s because whatever lapses we have already in 2D telemedicine, this supersedes it. And not only does it supersede the 2D lapses, it goes again and gives significant patient feedback like we were saying earlier, what Spencer also alluded to, is that now you have the ability to show the patient exactly what’s going on. And so in previous aspects where, think about it, even if it’s an in-person consultation where I would draw on a piece of paper and explain to them, “Well, I’m going to do this, this, this, and that,” now I actually have the patient’s own body, which they’re watching at the same time, being spun around and indicating that this actually is the spot I was talking about and this is how big my cut is going to be, and this is what I’m going to move out from here and use to fill in this space. So once again, my inclination on this is that, on our side, we can only get good, as against to looking for problems. The problems, I, I admit, will exist, but not as a separate entity from regular 2D medicine that’s … or 2D videography that we’re already encountering.

HUIZINGA: So you’re not introducing new risks with this. You’re just sort of serving on the other risks.

DARKO: We’re adding to the positives, basically.

HUIZINGA: Right. Yeah, Spencer, in the “what could go wrong” bucket on the other side of it, I’m looking at healthcare and the cost of it, uh, especially when you’re dealing with multiple specialists and complicated surgeries and so on. And I know healthcare policy is not on your research roadmap necessarily, but you have to be thinking about that as you’re, um, as you’re going on how this will ultimately be implemented across cultures and so on. So have you given any thought to how this might play out in different countries, or is this just sort of “we’re going to make the technology and let the policy people and the wonks work it out later?”

FOWERS: [LAUGHS] Yeah, it’s a good question, and I think it’s something that we’re really excited to see how it can benefit. Luckily enough, where we’re doing the tests right now, like in, uh, Glasgow and in Ghana, they already have partnerships and so there’s already standards in place for being able to share doctors and technology across that. But yeah, we’ve definitely looked into like, what kind of an impact does this have? And one of the benefits that we see is using something like 3D telemedicine even to provide greater access for specialty doctors in places like rural or remote United States, where they just don’t have access to those specialists that they need. I mean, you know, Washington state, where I am, has a great example where you’ve got people that live out in Eastern Washington, and if they have some need to go see like a pediatric specialist, they’re going to have to drive all the way into Seattle to go to Seattle Children’s to see that person. What if we can provide a clinic that allows them to, you know, virtually, through 3D telemedicine, interface with that doctor without having to make that drive and all that commute until they know what they need to do. And so we actually look at it as being beneficial because this provides greater access to these specialists, to other regions. So it’s actually improving, improving healthcare reach and accessibility for everyone.

HUIZINGA: Yeah. Kwame, can you speak to accessibility of these experts? I mean, you would want them all on your team for a 3D telemedicine call, but how hard is it to get them all on the same … I mean, it’s hard to get people to come to a meeting, let alone, you know, a big consultation. Does that enter the picture at all or, is that … ?

DARKO: It does. It does. And I think, um, COVID is something, is something else that’s really changed how we do everyday, routine stuff. So for example, we here in Ghana have a weekly departmental meeting and, um—within the plastic surgery department and also within the greater department of surgery, weekly meeting—everything became remote. So all of a sudden, people who may not be able to make the meeting for whatever reason are now logging on. So it’s actually made accessibility to them much, much easier and swifter. I mean, where they are, what they’re doing at the time, we have no idea, but it just means that now we have access to them. So extrapolating this on to us getting in touch with specialists, um, if we schedule our timing right, it actually makes it easier for the specialists to log on. Now earlier we spoke about international MDTs, not just local, but then, have we thought about what would have happened if we did not have this ability to have this online international MDT? We’re talking about somebody getting a plane ticket, sitting on a plane, waiting in airports, airport delays, etcetera, etcetera, and flying over just to see the patient for 30 minutes and make a decision that, “Well, I can or cannot do this operation.” So now this jumps over all of this and makes it much, much easier for us. And now when we move on to the next stage of consultation, after the procedure has been done, when I’m talking about the surgery, now the patient doesn’t need to travel great distances for individual specialist review. Now in the case of plastic surgery, this may cover not only the surgeon but also the physiotherapist. And so, it’s not just before the consultation but also after the consultation.

HUIZINGA: Wow. Spencer, what you’re doing with 3D telemedicine through Holoportation is a prime example of how a technology developed for one thing turned out to have incredible uses for another. So give us just a brief history of the application for 3D communication and how it evolved from where it started to where it is now.

FOWERS: Yeah, I mean, 3D communication, at least from what we’re doing, really started with things like the original Xbox Kinect, right? With a gaming console and a way to kind of interact in a different way with your gaming system. What happened was, Microsoft released that initial Kinect and suddenly found that people weren’t buying the Kinect to play games with it. They were buying to put it on robots and buying to use, you know, for different kind of robotics applications and research applications. And that’s why the second Kinect, when it was released, they had an Xbox version and they actually had a Kinect for Windows version because they were expecting people to buy this sensor to plug it in their computers. And if you look at the form factor now with the Azure Kinect that we have, it’s a much more compact unit. It’s meant specifically for using on computers, and it’s built for robotics and computer vision applications, and so it’s been really neat to see how this thing that was developed as kind of a toy has become something that we now use in industrial applications.

HUIZINGA: Right. Yeah. And this … sort of the thing, the serendipitous nature of research, especially with, you know, big moonshot projects, is like this is going to be great for gaming and it actually turns out to be great for plastic surgery! Who’d have thunk? Um, Kwame, speaking to where this lives in terms of use, where is it now on the spectrum from lab to life, as I like to say?

DARKO: So currently, um, we have a rig in our unit, the plastic surgery unit in the Korle Bu Teaching Hospital. There’s a rig in Glasgow, and there’s a rig over in the Microsoft office. So currently what we’ve been able to do is to run a few tests between Ghana, Seattle, and Glasgow. So basically, we’ve been able to run MDTs and we’ve been able to run patient assessments, pre-op assessments, as well as post-operative assessments, as well. So that’s where we are at the moment. It takes quite a bit of logistics to initiate, but we believe once we’re on a steady roll, we’ll be able to increase our numbers that we’ve been able to do this on. I think currently those we’ve operated and had a pre-op assessment and post-op assessment have been about six or seven patients. And it was great, basically. We’ve done MDTs across with them, as well. So the full spectrum of use has been done: pre-op, MDT, and post-op assessments. So yeah, um, we have quite a bit more to do with numbers and to take out a few glitches, especially with remote access and stuff like that. But yeah, I think we’re, we’re, we’re making good progress.

HUIZINGA: Yeah. Spencer, do you see, or do you know of, hurdles that you’re going to have to jump through to get this into wider application?

FOWERS: For us, from a research setting, one of the things that we’ve been very clear about as we do this is that, while it’s being used in a medical setting, 3D telemedicine is actually just a communication technology, right? It’s a Teams call; it’s a communication device. We’re not actually performing surgery with the system, you know, or it’s not diagnosing or anything. So it’s not actually a medical device as much as it’s a telecommunication device that’s being used in a medical application.

HUIZINGA: Well, as we wrap up, I like to give each of you a chance to paint a picture of your preferred future. If your work is wildly successful, what does healthcare look like in five to 10 years? And maybe that isn’t the time horizon. It could be two to three; it could be 20 years. I don’t know. But how have you made a difference in specialized medicine with this communication tool?

FOWERS: Like going off of what Kwame was saying, right, back in November, when we flew down and were present for that first international MDT, it was really an eye-opening experience. I mean, these doctors, normally, they just get on an airplane, they fly down, and they meet these patients for the first time, probably the day before they’ve had surgery. And this time, they were able to meet them and then be able to spend time before they flew down preparing for the surgery. And then they did the surgeries. They flew back. And normally, they would fly back, they wouldn’t see that patient again. With 3D telemedicine, they jumped back on a phone call and there was the person in 3D, and they were able to talk to them, you know, turn them around, show them where the procedure was, ask them questions, and have this interaction that made it so much better of an experience for them and for the doctors involved. So when I look at kind of the future of where this goes, you know, our vision is, where else do we need this? Right now, it’s been showcased as this amazing way to bring international expertise to one operating theater, you know, with specialists from around the world, as needed. And I think that’s great. And I think we can apply that in so many different locations, right? Rural United States is a great example for us. We hope to expand out what we’re doing in Scotland, to rural areas of Scotland that, you know, it’s very hard for people in the Scottish isles to be able to get to their hospitals. You know, other possible applications … like can we make this system mobile? You can imagine like a clinical unit where this system drives out to remote villages and is able to allow people that can’t make it in to a hospital to get that initial consultation, to know whether they should make the trip or whether they need other work done before they can start surgery. So kind of the sky’s the limit, right? I mean, it’s always good to look at like, what’s … I work with a team that does moonshots for a living, so I’m always looking for what can we shoot for? And our goal really is like, gosh, where can’t we apply this technology? I mean, it just anywhere that it is at all difficult to get, you know, medical expertise, we can ease the burden of doctors by making it so they don’t have to travel to provide this specialized care and increase the access to healthcare to these people that normally wouldn’t be able to get access to it.

HUIZINGA: Kwame, what’s your, what’s your take?

DARKO: So to me, I just want to describe what the current situation is and what I believe the future situation will be. So, the current situation—and like, like, um, Spencer was saying, this just doesn’t apply to Ghana alone; it can apply in some parts of the US and some parts of the UK, as well—where a patient has a problem, is seen by a GP in the local area, has to travel close to 24 hours, sometimes sleep over somewhere, just to get access to a specialist to see what’s going on. The specialist now diagnoses, sees what’s happening, and then runs a barrage of tests and makes a decision, “Well, you’re going to have an operation, and the operation is going to be in, let’s say, four weeks, six weeks.” So what happens? The patient goes, spends another 24 hours-plus going all the way back home, waiting for the operation day or operation period, and then traveling all the way back. You can imagine the time and expense. And if this person can’t travel alone, that means somebody else needs to take a day off work to bring the person back and forth. So now what would happen in the future if everything goes the way we’re planning? We’d have a rig in every, let’s say, district or region. The person just needs to travel, assumedly, an hour or two to the rig. Gets the appointment. Everything is seen in 3D. All the local blood tests and stuff that can be done would be done locally, results sent across. Book a theater date. So the only time that the person really needs to travel is when they’re coming for the actual operation. And once again, if an MDT has to be run on this, on this patient, it will be done. And, um, they would be sitting in their rig remotely in the town or wherever it is. Those of us in the teaching hospitals across the country would also be in our places, and we’d run the MDT to be sure. Postoperatively, if it’s a review of the patient, we’d be able to do that, even if it’s an MDT review, as well, we could do that. And the extra application, which I didn’t highlight too much and I mentioned it, but I didn’t highlight it, is if this person needs to have physiotherapy and we need to make sure that they’re succeeding and doing it properly, we can actually do it through a 3D call and actually see the person walking in motion or wrist movement or hand extension or neck movements, whatever it is. We can do all this in 3D. So yeah, the, the scope is, is as far as the mind can imagine it!

HUIZINGA: You know, I’m even imagining it, and I hate to bring up The Jetsons as, you know, my, my anchor analogy, but, you know, at some point, way back, nobody thought they’d have the technology we have all in our rooms and on our bodies now. Maybe this is just like the beginning of everybody having 3D communication everywhere, and no one has to go to the doctor before they get the operation. [LAUGHS] I don’t know. Spencer Fowers, Kwame Darko. This is indeed a mind-blowing idea that has the potential to be a world-changing technology. Thanks for joining me today to talk about it.

DARKO: Thanks for having us, Gretchen.

FOWERS: Thanks.

The post Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko appeared first on Microsoft Research.

Read More

Breaking cross-modal boundaries in multimodal AI: Introducing CoDi, composable diffusion for any-to-any generation

Breaking cross-modal boundaries in multimodal AI: Introducing CoDi, composable diffusion for any-to-any generation

Imagine an AI model that can seamlessly generate high-quality content across text, images, video, and audio, all at once. Such a model would more accurately capture the multimodal nature of the world and human comprehension, seamlessly consolidate information from a wide range of sources, and enable strong immersion in human-AI interactions. This could transform the way humans interact with computers on various tasks, including assistive technology, custom learning tools, ambient computing, and content generation.

In a recent paper: Any-to-Any Generation via Composable Diffusion, Microsoft Azure Cognitive Service Research and UNC NLP present CoDi, a novel generative model capable of processing and simultaneously generating content across multiple modalities. CoDi allows for the synergistic generation of high-quality and coherent outputs spanning various modalities, from assorted combinations of input modalities. CoDi is the latest work of Microsoft’s Project i-Code, which aims to develop integrative and composable multimodal AI. Through extensive experiments, the researchers demonstrate CoDi’s remarkable capabilities.

The challenge of multimodal generative AI

The powerful cross-modal models that have emerged in recent years are mostly capable of generating or processing just one single modality. These models often face limitations in real-world applications where multiple modalities coexist and interact. Chaining modality-specific generative models together in a multi-step generation setting can be cumbersome and slow.

Moreover, independently generated unimodal streams may not be consistent and aligned when stitched together in a post-processing way, such as synchronized video and audio.

To address these challenges, the researchers propose Composable Diffusion (CoDi), the first model capable of simultaneously processing and generating arbitrary combinations of modalities. CoDi employs a novel composable generation strategy that involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

The power of composable diffusion

A GIF of CoDi generation pipelines. The input modalities are listed vertically on the left side, including the text “Teddy bear on a skateboard, 4k”, a picture of Times Square, and the waveform of raining ambience. Input modalities are input to the CoDi model, depicted by a rectangular block, and output modalities are listed on the right side. Input modalities, CoDi, and output modalities connected by lines of different colors to represent different generation pipelines. The yellow line depicts that given an input of rain audio, CoDi can generate the text description “Raining, rain, moderate”. Depicted by red lines, CoDi can take in the image of Times Square together with the rain audio, to generate the audio of raining street. Finally, depicted by purple lines, the input modalities are the text “Teddy bear on a skateboard, 4k”, the picture of Times Square, and the raining audio; the output is a video with sound. In the video, a Teddy bear is skateboarding in the rain on the street of Times Square, and one can hear synchronized sound of skateboarding and rain.
Figure 1: CoDi can generate any combination of modalities from any mixture of input modalities.

Training a model to take any mixture of input modalities and flexibly generate any mixture of outputs presents significant computational and data requirements, as the number of combinations for the input and output modalities scales exponentially. And the scarcity of aligned training data for many groups of modalities makes it infeasible to train with all possible input-output combinations. To address these challenges, the researchers propose to build CoDi in a composable and integrative manner.

They start by training each individual modality-specific latent diffusion model (LDM) independently (these LDMs will be smoothly integrated later for joint generation). This approach ensures exceptional single-modality generation quality using widely available modality-specific training data. To allow CoDi to handle any mixture of inputs, input modalities like images, video, audio, and language are projected into the same semantic space. Consequently, the LDM of each modality can flexibly process any mixture of multimodal inputs. The multi-conditioning generation process is done by letting diffusers be conditioned on these inputs via a weighted sum of each input modality’s representation.

One of CoDi’s most significant innovations is its ability to handle many-to-many generation strategies, simultaneously generating any mixture of output modalities. To achieve this, CoDi adds a cross-attention module to each diffuser, and an environment encoder to project the latent variable of different LDMs into a shared latent space.

By freezing the parameters of the LDM and training only the cross-attention parameters and the environment encoder, CoDi can seamlessly generate any group of modalities without training on all possible generation modality combinations, reducing the training objectives to a more manageable number.

Showcasing CoDi’s capabilities

The research demonstrates the novel capacity of joint generation of multiple modalities, such as synchronized video and audio, given separate text, audio, and image prompts. Specifically, in the example shown below, the input text prompt is “teddy bear on a skateboard, 4k, high resolution”, the input image prompt is a picture of Times Square, and the input audio prompt is rain. The generated video, shown in Figure 2, is a teddy bear skateboarding in the rain at Times Square. The generated audio contains the sounds of rain, skateboarding, and street noise, which are synchronized with the video. This shows that CoDi can consolidate information from multiple input modalities and generate coherent and aligned outputs.

Figure 2: The video shows an example of CoDi generating video + audio from text, image and audio input. The input modalities are listed vertically on the left side, including the text “Teddy bear on a skateboard, 4k”, a picture of Times Square, and the waveform of raining ambience. The output is a video with sound. In the video, a Teddy bear is skateboarding in the rain on the street of Times Square. One can also hear synchronized sound of skateboarding and rain.

In addition to its strong joint-modality generation quality, CoDi is also capable of single-to-single modality generation and multi-conditioning generation. It outperforms or matches the unimodal state of the art for single-modality synthesis.

Potential real-world applications and looking forward

CoDi’s development unlocks numerous possibilities for real-world applications requiring multimodal integration. For example, in education, CoDi can generate dynamic, engaging materials catering to diverse learning styles, allowing learners to access information tailored to their preferences, while enhancing understanding and knowledge retention. CoDi can support some accessible experiences for people with disabilities, such as providing audio descriptions and visual cues for deaf or low-hearing individuals.

Composable Diffusion marks a significant step towards more engaging and holistic human-computer interactions, establishing a solid foundation for future investigations in generative artificial intelligence.

The post Breaking cross-modal boundaries in multimodal AI: Introducing CoDi, composable diffusion for any-to-any generation appeared first on Microsoft Research.

Read More

Unlocking the future of computing: The Analog Iterative Machine’s lightning-fast approach to optimization 

Unlocking the future of computing: The Analog Iterative Machine’s lightning-fast approach to optimization 

Analog Iterative Machine (AIM)

Picture a world where computing is not limited by the binary confines of zeros and ones, but instead, is free to explore the vast possibilities of continuous value data. Over the past three years a team of Microsoft researchers has been developing a new kind of analog optical computer that uses photons and electrons to process continuous value data, unlike today’s digital computers that use transistors to crunch through binary data. This innovative new machine has the potential to surpass state-of-the-art digital technology and transform computing in years to come.

The Analog Iterative Machine (AIM) is designed to solve difficult optimization problems, which form the foundation of many industries, such as finance, logistics, transportation, energy, healthcare, and manufacturing. However, traditional digital computers struggle to crack these problems in a timely, energy-efficient and cost-effective manner. This is because the number of possible combinations explodes exponentially as the problem size grows, making it a massive challenge for even the most powerful digital computers. The Traveling Salesman Problem is a classic example. Imagine trying to find the most efficient route for visiting a set of cities just once before returning to the starting point. With only five cities, there are 12 possible routes – but for a 61-city problem, the number of potential routes surpasses the number of atoms in the universe.

AIM addresses two simultaneous trends. First, it sidesteps the diminishing growth of computing capacity per dollar in digital chips – or the unraveling of Moore’s Law. Second, it overcomes the limitations of specialized machines designed for solving optimization problems. Despite over two decades of research and substantial industry investment, such unconventional hardware-based machines have a limited range of practical applications, because they can only address optimization problems with binary values. This painful realization within the optimization community has driven the team to develop AIM, with a design that combines mathematical insights with cutting-edge algorithmic and hardware advancements. The result? An analog optical computer that can solve a much wider range of real-world optimization problems while operating at the speed of light, offering potential speed and efficiency gains of about a hundred times.

Today, AIM is still a research project, but the cross-disciplinary team has recently assembled the world’s first opto-electronic hardware for mixed – continuous and binary – optimization problems. Though presently operating on a limited scale, the initial results are promising, and the team has started scaling up its efforts. This includes a research collaboration with the UK-based multinational bank Barclays to solve an optimization problem critical to the financial markets on the AIM computer. Separate engagements are aimed at gaining more experience in solving industry-specific optimization problems. In June 2023, the team launched an online service that provides an AIM simulator to allow partners to explore the opportunities created by this new kind of computer.

The technology 

Photons possess a remarkable property of not interacting with one another, which has underpinned the internet era by enabling large amounts of data to be transmitted over light across vast distances. However, photons do interact with the matter through which they propagate, allowing for linear operations such as addition and multiplication, which form the basis for optimization applications. For instance, when light falls on the camera sensor on our smartphones, it adds up the incoming photons and generates the equivalent amount of current. Additionally, data transmission over fiber which brings internet connectivity to homes and businesses relies on encoding zeroes and ones onto light by programmatically controlling its intensity. This scaling of light through light-matter interaction multiplies the light intensity by a specific value – multiplication in the optical domain. Beyond optical technologies for linear operations, various other electronic components prevalent in everyday technologies can perform non-linear operations that are also critical for efficient optimization algorithms.

Analog optical computing thus involves constructing a physical system using a combination of analog technologies – both optical and electronic – governed by equations that capture the required computation. This can be very efficient for specific application classes where linear and non-linear operations are dominant. In optimization problems, finding the optimal solution is akin to discovering a needle in an inconceivably vast haystack. The team has developed a new algorithm that is highly efficient at such needle-finding tasks. Crucially, the algorithm’s core operation involves performing hundreds of thousands or even millions of vector-matrix multiplications – the vectors represent the problem variables whose values need to be determined while the matrix encodes the problem itself. These multiplications are executed swiftly and with low energy consumption using commodity optical and electronic technologies, as shown in Figure 1.

Figure 1: Illustration of the AIM computer
Figure 1: Illustration of the AIM computer, which implements massively parallel vector-matrix multiplication using commodity optical technologies (in the back) and non-linearity applied using analog electronics (front). The vector is represented using an array of light sources, the matrix is embedded into the modulator array (shown in grayscale) and the result is collected into the camera sensor.
Figure 2: The second-generation AIM computer
Figure 2: The second-generation AIM computer, with 48 variables, is a rack-mounted appliance.

Thanks to the miniaturization of all these components onto tiny centimeter-scale chips, the entire AIM computer fits into a small rack enclosure – as shown in Figure 2. As light travels incredibly fast – 5 nanoseconds per meter – each iteration within the AIM computer is significantly faster and consumes less electricity than running the same algorithm on a digital computer. Importantly, since the entire problem is embedded into the modulator matrix inside the computer itself, AIM does not require the problem to be transferred back and forth between storage and compute locations. And unlike synchronous digital computers, AIM’s operation is entirely asynchronous. These architectural choices circumvent key historical bottlenecks for digital computers. 

Finally, all technologies used in AIM are already prevalent in consumer products with existing manufacturing ecosystems, which paves the way for a viable computing platform, at full scale, if all the technical challenges can be tamed by the team.

The importance of optimization problems

Optimization problems are mathematical challenges that require finding the best possible solution from a set of feasible alternatives. The modern world relies heavily on efficient solutions to these problems – from managing electricity in our power grids and streamlining goods delivery across sea, air, and land, to optimizing internet traffic routing.

Effectively and efficiently solving optimization problems can significantly improve processes and outcomes across many other industries. Take finance, for example, where portfolio optimization involves selecting the ideal combination of assets to maximize returns while minimizing risks. In healthcare, optimizing patient scheduling can enhance resource allocation and minimize waiting times in hospitals.

For many larger problems, even the world’s biggest supercomputer would take years or even centuries to find the optimal solution to such problems. A common workaround is heuristic algorithms – problem-solving techniques that provide approximate solutions by employing shortcuts or “rules of thumb.” Although these algorithms might not guarantee the discovery of an optimal solution, they are the most practical and efficient methods for finding near-optimal solutions in reasonable timeframes. Now, imagine the immense impact of a computer that could deliver more optimal solutions in a significantly shorter timeframe for these critical problems. In some instances, solving these problems in real-time could create a domino effect of positive outcomes, revolutionizing entire workflows and industries.

QUMO: a world beyond QUBO

For years, researchers, both in industry and academia, have built impressive specialized machines to efficiently solve optimization problems using heuristic algorithms. This includes an array of custom hardware, such as field programmable gate arrays (FPGAs), quantum annealers, and electrical and optical parametric oscillator systems. However, all of them rely on mapping difficult optimization problems to the same binary representation, often referred to as Ising, Max-Cut or QUBO (quadratic unconstrained binary optimization). Unfortunately, none of these efforts have provided a practical alternative to conventional computers. This is because it is very hard to map real-world optimization problems at scale to the binary abstraction, a common theme in the team’s engagement with practitioners across industry and academia.

With AIM, the team has introduced a more expressive mathematical abstraction called QUMO (quadratic unconstrained mixed optimization), which can represent mixed – binary and continuous – variables and is compatible with hardware implementation, making it the “sweetspot” for many practical, heavily-constrained optimization problems. Discussions with industry experts indicate that scaling AIM to 10,000 variables would mean that most of the practical problems discussed earlier are within reach. A problem with 10,000 variables that can be directly mapped to the QUMO abstraction would require an AIM computer with 10,000 physical variables. In contrast, existing specialized machines would need to scale to beyond a million physical variables, well beyond the capabilities of the underlying hardware.

AIM also implements a novel and efficient algorithm for solving such QUMO problems that relies on an advanced form of gradient descent, a technique that is also popular in machine learning. The algorithm shows highly competitive performance and accuracy across various industrially inspired problem benchmarks. It even discovered new best-ever solutions to four problems. The first-generation AIM computer, built last year, solves QUMO optimization problems that are represented with an accuracy of up to 7 bits. The team, shown in Figure 3, has also shown good quantitative agreement between the simulated and the hardware version of the AIM computer to gain further confidence in the viability of these efficiency gains as the computer is scaled up. This paper gives more details about the AIM architecture, its implementation, evaluation and scaling roadmap.

Photo of the AIM team – Front row (left to right): Doug Kelly, Jiaqi Chu, James Clegg, Babak Rahmani. Back row: Hitesh Ballani, George Mourgias-Alexandris, Daniel Cletheroe, Francesca Parmigiani, Lucinda Pickup, Grace Brennan, Ant Rowstron, Kirill Kalinin, Jonathan Westcott, Christos Gkantsidis. (Greg O'Shea and Jannes Gladrow do not appear in this photo.)
Figure 3: AIM’s design involves innovation at the intersection of optical and analog hardware, mathematics and algorithms, and software and system architecture, which is typified in the cross-disciplinary nature of the team working hand-in-hand towards the mission of building a computer that solves practical problems. Photo of the AIM team – Front row (left to right): Doug Kelly, Jiaqi Chu, James Clegg, Babak Rahmani. Back row: Hitesh Ballani, George Mourgias-Alexandris, Daniel Cletheroe, Francesca Parmigiani, Lucinda Pickup, Grace Brennan, Ant Rowstron, Kirill Kalinin, Jonathan Westcott, Christos Gkantsidis. (Greg O’Shea and Jannes Gladrow do not appear in this photo.)

Rethinking optimization with QUMO: A more expressive way of reasoning for experts

AIM’s blueprint for co-designing unconventional hardware with an expressive abstraction and a new algorithm has the potential to spark a new era in optimization techniques, hardware platforms, and automated problem mapping procedures, utilizing the more expressive QUMO abstraction. This exciting journey has already begun, with promising results from mapping problems from diverse domains like finance and healthcare to AIM’s QUMO abstraction. Recent research has already shown that increased expressiveness with continuous variables can substantially expand the real-world business problems that can be tackled. However, to the team’s knowledge, AIM is the first and only hardware to natively support this abstraction.

As we venture into a new abstraction, we must also adopt new ways of thinking. It is crucial for the team to build a strong community to deeply investigate the benefits of embracing QUMO. We invite people who have previously been deterred by the limitations of binary solvers to consider the new opportunities offered by AIM’s QUMO abstraction. To facilitate this, we are releasing our AIM simulator as a service, allowing selected users to get first-hand experience. The initial users are the team’s collaborators at Princeton University and at Cambridge University. They have helped us identify several exciting problems where the AIM computer and its abstraction is a much more natural fit. We are also actively engaging with thought leaders from internal Microsoft divisions and external companies in sectors where optimization is crucial.

Together, we can drive innovation and unlock the true potential of analog optical computing for solving some of the most complex optimization problems across industries.

The post Unlocking the future of computing: The Analog Iterative Machine’s lightning-fast approach to optimization  appeared first on Microsoft Research.

Read More

Research Focus: Week of June 19, 2023

Research Focus: Week of June 19, 2023

Microsoft Research Focus 18 | Week of June 19, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESOURCE

Responsible AI Maturity Model

As the use of AI continues to surge, new government regulations are expected. But the organizations that build and use AI technologies needn’t wait to devise best practices for developing and deploying AI systems responsibly. Many companies have adopted responsible AI (RAI) principles as a form of self-regulation. Yet, effectively translating these principles into practice is challenging.

To help organizations identify their current and desired levels of RAI maturity, researchers at Microsoft have developed the Responsible AI Maturity Model (RAI MM). The RAI MM is a framework containing 24 empirically derived dimensions that are key to an organization’s RAI maturity, and a roadmap of maturity progression so organizations and teams can identify where they are and where they could go next.

Derived from interviews and focus groups with over 90 RAI specialists and AI practitioners, the RAI MM can help organizations and teams navigate their RAI journey, even as RAI continues to evolve.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

NEW RESEARCH

FoundWright helps people re-find web content they previously discovered

Re-finding information is a common task—most online search requests involve re-finding information. However, this can be difficult when people struggle to express what they seek. People may forget exact details of the information they want to re-find, making it hard to craft a query to locate it. People may also struggle to recover information within web repositories, such as bookmarks or history, as these do not capture enough information, or present an experience to allow ambiguous queries. As a result, people can feel overwhelmed and cognitively exhausted when faced with a re-finding task.

A new paper from Microsoft researchers: FoundWright: A System to Help People Re-find Pages from Their Web-history, introduces a new system to address these problems. FoundWright leverages recent advances in language transformer models to expand people’s ability to express what they seek by defining concepts that can attract documents with semantically similar content. The researchers used FoundWright as a design probe to understand how people create and use concepts; how this expanded ability helps re-finding; and how people engage and collaborate with FoundWright’s machine learning support. The research reveals that this expanded way of expressing re-finding goals complements traditional searching and browsing.


NEW RESEARCH

Trace-Guided Inductive Synthesis of Recursive Functional Programs

In recent years, researchers have made significant advances in synthesis of recursive functional programs, including progress in inductive synthesis of recursive programs from input-output examples. The latter problem, however, continues to pose several challenges.

In a new paper: Trace-Guided Inductive Synthesis of Recursive Functional Programs, which received a distinguished paper award from the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2023), researchers from Microsoft and Purdue University propose a novel trace-guided approach to tackle the challenges of ambiguity and generalization in synthesis of recursive functional programs from examples. This approach augments the search space of programs with recursion traces consisting of sequences of recursive subcalls of programs. It is based on a new version space algebra (VSA) for succinct representation and efficient manipulation of pairs of recursion traces and programs that are consistent with each other. The researchers implement this approach in a tool called SyRup. Evaluating SyRup on benchmarks from prior work demonstrates that it not only requires fewer examples to achieve a certain success rate than existing synthesizers, but is also less sensitive to the quality of the examples.

These results indicate that utilizing recursion traces to differentiate satisfying programs with similar sizes is applicable to a wide range of tasks.


NEW RESEARCH

Wait-Free Weak Reference Counting

Reference counting is a common approach to memory management. One challenge with reference counting is cycles that prevent objects from being deallocated. Systems such as the C++ and Rust standard libraries introduce two types of reference: strong and weak. A strong reference allows access to the object and prevents the object from being deallocated, while a weak reference only prevents deallocation. A weak reference can be upgraded to provide a strong reference if other strong references to the object exist. Hence, the upgrade operation is partial, and may fail dynamically. The classic implementation of this upgrade operation is not wait-free—it can take arbitrarily long to complete if there is contention on the reference count.

In a new paper: Wait-Free Weak Reference Counting, researchers from Microsoft propose a wait-free algorithm for weak reference counting, which requires primitive wait-free atomic operations of “compare and swap”, and “fetch and add”. The paper includes a correctness proof of the algorithm using the Starling verification tool, a full implementation in C++, and a demonstration of the best- and worst-case performance using micro-benchmarks.

The new algorithm is faster than the classic algorithm in the best case, but has an overhead in the worst case. The researchers present a more complex algorithm that effectively combines the classic algorithm and the wait-free algorithm, delivering much better performance in the worst case, while maintaining the benefits of the wait-free algorithm.


NEW RESEARCH

Disaggregating Stateful Network Functions

For security, isolation, metering, and other purposes, public clouds today implement complex network functions at every server. Today’s implementations, in software or on FPGAs and ASICs that are attached to each host, are becoming increasingly complex and costly, creating bottlenecks to scalability.

In a new paper: Disaggregating Stateful Network Functions, researchers from Microsoft present a different design that disaggregates network function processing off the host and into shared resource pools by making novel use of appliances which tightly integrate general-purpose ARM cores with high-speed stateful match processing ASICs. When work is skewed across VMs, such disaggregation can offer better reliability and performance over the state of the art, at a lower per-server cost. The paper, which was published at the 2023 USENIX Symposium on Networked Systems Design and Implementation (NSDI), includes solutions to the consequent challenges and presents results from a production deployment at a large public cloud.


NEW RESEARCH

Industrial-Strength Controlled Concurrency Testing for C# Programs with Coyote

Testing programs with concurrency is challenging because their execution is non-deterministic, making bugs hard to find, re-produce and debug. Non-determinism can cause flaky tests—which may pass or fail without any code changes—creating a significant engineering burden on development teams. As concurrency is essential for building modern multi-threaded or distributed systems, solutions are required to help developers test their concurrent code for correctness.

Testing concurrent programs comes with two main challenges. First is the problem of reproducibility or control, while the second challenge is the state-space explosion problem. A concurrent program, even with a fixed test input, can have an enormous number of possible behaviors.

In a new research paper: Industrial-Strength Controlled Concurrency Testing for C# Programs with Coyote, researchers from Microsoft describe the design and implementation of the open-source tool Coyote for testing concurrent programs written in the C# language. This research won a 2023 Best Software Science Paper award from The European Association of Software Science and Technology (EASST).

The post Research Focus: Week of June 19, 2023 appeared first on Microsoft Research.

Read More

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

black and white photos of Microsoft Principal Researcher Dr. Bichlien Nguyen and Dr. David Kwabi, Assistant Professor of Mechanical Engineering at the University of Michigan, next to the Microsoft Research Podcast

Episode 141 | June 22, 2023

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.

In this episode, Microsoft Principal Researcher Dr. Bichlien Nguyen and Dr. David Kwabi, Assistant Professor of Mechanical Engineering at the University of Michigan, join host Dr. Gretchen Huizinga to talk about how their respective research interests—and those of their larger teams—are converging to develop renewable energy storage systems. They specifically explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables for a not rainy day. The bonus? These new compounds may just help advance carbon capture, too.

Transcript

[MUSIC PLAYS UNDER DIALOGUE]

DAVID KWABI: I’m a mechanical engineer who sort of likes to hang out with chemists.

BICHLIEN NGUYEN: I’m an organic chemist by training, and I dabble in machine learning. Bryan’s a computational chemist who dabbles in flow cell works. Anne is, uh, a purely synthetic chemist who dabbles in almost all of our aspects.

KWABI: There’s really interesting synergies that show up just because there’s people, you know, coming from very different backgrounds.

NGUYEN: Because we have overlap, we, we have lower, I’m going to call it an activation barrier, in terms of the language we speak.

[MUSIC]

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.

[MUSIC ENDS]


Today I’m talking to Dr. Bichlien Nguyen, a Principal Researcher at Microsoft Research, and Dr. David Kwabi, an Assistant Professor of Mechanical Engineering at the University of Michigan. Bichlien and David are collaborating on a fascinating project under the umbrella of the Microsoft Climate Research Initiative that brings organic chemistry and machine learning together to discover new forms of renewable energy storage. Before we unpack the “computational design and characterization of organic electrolytes for flow batteries and carbon capture,” let’s meet our collaborators.

Bichlien, I’ll start with you. Give us a bit more detail on what you do at Microsoft Research and the broader scope and mission of the Microsoft Climate Research Initiative.

BICHLIEN NGUYEN: Thanks so much, Gretchen, for the introduction. Um, so I guess I’ll start with my background. I have a background in organic electric chemistry, so it’s quite fitting, and as a researcher at Microsoft, really, it’s my job, uh, to come up with the newest technologies and keep abreast of what is happening around me so that I can actually, uh, fuse those different technology streams together and create something that’s, um, really valuable. And so as part of that, uh, the Microsoft Climate Research Initiative was really where a group of researchers came together and said, “How can we use the resources, the computational resources, and expertise at Microsoft to enable new technologies that will allow us to get to carbon negative by the year 2050? How can we do that?” And that, you know, as part of that, um, I just want to throw out that the Microsoft Climate Research Initiative really is focusing on three pillars, right. The three pillars are being carbon accounting, because if you don’t know how much carbon is in the atmosphere, you can’t really, uh, do much to remedy it, right, if you don’t know what’s there. The other one is climate resilience. So how do people get affected by climate change, and how do we overcome that, and how can we help that with technology? And then the third is materials engineering, where, that’s where I sit in the Microsoft Climate Research Initiative, and that’s more of how do we either develop technologies that, um, are used to capture and store carbon, uh, or are used to enable the green energy transition?

HUIZINGA: So do you find yourself spread across those three? You say the last one is really where your focus is, but do you dip your toe in the other areas, as well?

NGUYEN: I love dipping my toe in all the areas because I think they’re all important, right? They’re all important. We have to really understand what the environmental impacts of all the materials, for example, that we’re making are. I mean, it’s so … so carbon accounting is really important. Environmental accounting is very important. Um, and then people are the ones that form the core, right? Why are we, do … why do we do what we do? It’s because we want to make sure that we can enable people and solve their problems.

HUIZINGA: Yeah, when you talk about carbon accounting and why you’re doing it, it makes me think about when you have to go on a diet and the doctor says, “You have to get really honest about what you’re eating. Don’t, don’t fake it.” [LAUGHS] So, David, you’re a professor at the University of Michigan, and you run the eponymous Kwabi Lab there. Tell us about your work in general. What are your research interests? Who do you work with, and what excites you most about what you do?

DAVID KWABI: Yeah, happy to! Thank you for the introduction and, and for having me on, on here today. So, um, uh, so as you said, I run the Kwabi Lab here at the University of Michigan, and the sort of headline in terms of what we’re interested in doing is that we like to design and study batteries that can store lots of renewable electricity, uh, on the grid, so, so that’s kind of our mission. Um, that’s not quite all of what we do, but it’s, uh, it’s how I like to describe it. Um, and the motivation, of course, is … comes back to what Bichlien just mentioned, is this need for us to transition from carbon-intensive, uh, ways of producing energy to renewables. And the thing about renewables is that they’re intermittent, so solar and wind aren’t there all the time. You need to find a way to store all that energy and store it cheaply for us to really make, make a dent, um, in carbon emissions from energy production. So we work on, on building systems or energy storage systems that can meet that goal, that can accomplish that task.

HUIZINGA: Both of you talked about having larger teams that support the work you’re doing or collaborate with you two as collaborators. Do you want to talk about the size and scope of those teams or, um, you know, this collaboration across collaboration?

KWABI: Yeah, so I could start with that. So my group, like you said, we’re in the mechanical engineering department, so we really are, um, we call ourselves electric chemical engineers, and electric chemistry is the science of batteries, but it’s a science of lots of other things besides that, but the interesting thing about energy storage systems or batteries in general is that you need to build and put these systems together, but they’re made of lots of different materials. And so what we like to do in my group is build and put together these systems and then essentially figure out how they perform, right. Uh, try to explore performance limits as a function of different chemistries and system configurations and so on. But the hope then is that this can inform materials chemists and computationalists in terms of what materials we want to make next, sort of, so, so there’s a lot of need for collaboration and interdisciplinary knowledge to, to make progress here.

HUIZINGA: Yeah. Bichlien, how about you in terms of the umbrella that you’re under at Microsoft Research?

NGUYEN: There are so many different disciplines, um, within Microsoft Research, but also with the team that we’re working, you know, with David. So we have actually two other collaborators from two different, I guess, departments. There’s the chemical engineering department, which Bryan Goldsmith is part of, and Anne McNeil, I believe, is part of the chemistry department. And you know, for this particular project, on flow battery electrolytes for energy storage, um, we do need a multidisciplinary team, right? We, we need to go from the atomistic, you know, simulation level all the way to the full system level. And I think that’s, that’s where, you know, that’s important.

HUIZINGA: Now that we’re on the topic of this collaboration, let’s talk about how it came about. I like to call this “how I met your mother.” Um, what was the initial felt need for the project, and who called who and said, “Let’s do some research on renewable climate-friendly energy solutions?” Bichlien, why don’t you go ahead and take the lead on this?

NGUYEN: Yeah. Um, so I’m pretty sure what happened—and David, correct me if I’m wrong—

[LAUGHTER]

HUIZINGA: Pretty sure … !

NGUYEN: I’m pretty sure, but not 100 percent sure—is that, um, while we were formulating how to … uh, what, what topics we wanted to target for the Microsoft Climate Research Initiative, we began talking to many different universities as a way to learn from them, to see what areas of interest and what areas they think are, uh, really important for the future. And one of those universities was the University of Michigan, and I believe David was one of few PIs on that initial Teams meeting. And David gave, I believe … David, was it like a 10-minute presentation? Very quick, right? Um, but it sparked this moment of, “Wow, I think we could accelerate this.”

HUIZINGA: David, how do you remember it?

KWABI: Yeah, I think I remember it. [LAUGHS] This is so almost like a, like a marriage. Like, how did you guys meet? Um, and then, and then the stories have to align in some way or …

HUIZINGA: Yeah, who liked who first?

KWABI: Yeah, exactly. Um, but yeah, I think, I think that’s what I recall. So basically, I’m part of … here at the university, I’m part of this group called the Global CO2 Initiative, uh, which is basically, uh, an institute here at the university that convenes research related to CO2 capture, CO2 utilization, um, and I believe the Microsoft team set up a meeting with the Global CO2 Initiative, which I joined, uh, in my capacity as a member, and I gave a little 10-minute talk, which apparently was interesting enough for a second look, so, um, that, that’s how the collaboration started. There was a follow-up meeting after that, and here we are today.

HUIZINGA: Well, it sounds like you’re compelling, so let’s get into it, David. Now would be a good time to talk about, uh, more detail on this research. I won’t call it Flow Batteries for Dummies, but assume we’re not experts. So what are flow batteries, what problems do they solve, and how are you going after your big research goals methodologically?

KWABI: OK, so one way to think about flow batteries is to, to think first about pumped hydro storage is, is how I like to introduce it. So, so a flow battery is just like a battery, the, the sort of battery that you have in your cell phone or laptop computer or electric vehicle, but it’s a … it has a very different architecture. Um, and to explain that architecture, I like to talk about pumped hydro. So pumped hydro is I think a technology many of us probably appreciate or know about. You have two reservoirs that, that hold water—so upper and lower reservoirs—and when you have extra electricity, or, or excess electricity, you can pump water up a mountain, if you like, from one reservoir to another. And when you need that electricity, water just flows down, spins some turbines, and produces electricity. You’re turning gravitational potential energy into electrical energy, or electricity, is the idea. And the nice thing about pumped hydro, um, is that if you want to store more energy, you just need a bigger reservoir. So you just need more water essentially, um, in, in the two reservoirs to get to longer and longer durations of storage, um, and so then this is nice because more and more water is actually … is cheap. So the, the marginal cost of your … every stored, um, unit of energy is quite low. This isn’t really the case for the source of batteries we have in our cell phones and laptop computers. So if we’re talking about grid storage, you want something like this, something that decouples energy and power, so we have essentially a low cost per unit of electricity. So, so flow batteries essentially mimic pumped hydro because instead of turning gravitational potential energy into electricity, you’re actually changing or turning chemical potential energy, if you like, into electricity. So essentially what you’re doing is just storing energy in, um, in the form of electrons that are sort of attached to molecules. So you have an electron at a really high energy that is like flowing onto another molecule that has a low energy. That’s essentially the transformation that you’re trying to do in a, in a flow battery. But that’s the analogy I like to, I like to draw. It’s sort of a high- and low-reservation, uh, reservoirs you have. High and low chemical, uh, potential energy.

HUIZINGA: So what do these do better than the other batteries that you mentioned that we’re already using for energy storage?

KWABI: So the other batteries don’t have this decoupling. So in the flow battery, you have the, the energy being stored in these tanks, and the larger the tank, the more the energy. If you want to turn that energy into, uh, chemical energy into electricity, you, you, you run it through a reactor. So the reactor can stay the same size, but the tank gets bigger and bigger and you store more energy. In a laptop battery, you don’t have that. If you want more energy, you just want more battery, and that, the cost of that is the same. So there isn’t this big cost advantage, um, that comes from decoupling the energy capacity from the power capacity.

NGUYEN: David, would you, would you also say that, um, with, you know, redox organic flow batteries, you also kind of decouple the source, right, of the, the material, the battery material, so it’s no longer, for example, a rare earth metal or precious metal.

KWABI: Absolutely. So that’s, that’s then the thing. So when you … so you’ve got … you know, imagine these large systems, these giant tanks with, with molecules that store electricity. The question then is what molecules do you choose? Because if it’s like really expensive, then your electricity is also very expensive. Um, the particular type of battery we’re looking at uses organic molecules to store that electricity, and the hope is that these organic molecules can be made very cheaply from very abundant materials, and in the end, that means that this then translates to a really low cost of electricity.

HUIZINGA: Bichlien, I’m glad you brought that up because that is a great comparison in terms of the rare earth stuff, especially lithium mining right now is a huge cost, or tax, on the environment, and the more electric we have, the more lithium we need, so there’s other solutions that you guys are, are digging around for. Were you going to say something else, Bichlien?

NGUYEN: I was just going to say, I mean, another reason why, um, we thought David’s work was so interesting is because, you know, we’re looking at, um, energy, um, storage for renewables, and so to get to this green energy economy, we’ll need a ton of renewables, and then we’ll need a ton of ways to store the energy because renewables are, you know, they’re intermittent. I mean, sometimes the rain rains all the time, [LAUGHS] and sometimes it doesn’t. It’s really dry. Um, I don’t know why I say rain, but I assume … I probably …

HUIZINGA: Because you’re in Seattle, that’s why!

NGUYEN: That’s true. But like the sun shines; it doesn’t shine. Um, yeah, the wind blows, and sometimes it doesn’t.

HUIZINGA: Or doesn’t. Yeah. … Well, let’s talk about molecules, David, um, and getting a little bit more granular here, or maybe I should say atomic. You’re specifically looking for aqueous-soluble redox-active organic molecules, and you’ve noted that they’re really hard to find, um, these molecules that meet all the performance requirements for real-world applications. In other words, you have to swipe left a lot before you get to a good [LAUGHS] match, continuing with the marriage analogy. … So what are the properties necessary that you’re looking for, and why are they so hard to find?

KWABI: So the “aqueous soluble” just means soluble in water. You want the molecule to be able to dissolve into water at really high concentrations. So that’s one, um, property. You need it to last a really long time because the hope is that these flow battery installations are going to be there for decades. You need it to store electrons at the right energy. So, uh, I mentioned you have two tanks: one tank will store electrons at high energy; the other at low energy. So you need those energy levels to be set just right in a sense, so you want a high-voltage battery, essentially. You also want the molecule to be set that it doesn’t leak from one tank to the other through the reactor that’s in the middle of the two tanks, right. Otherwise, you’re essentially losing material, which is not, uh, desirable. And you want the molecule to be cheap. So that, that’s really important, obviously, because if we’re going to do this at, um, really large scale, and we want this to be low cost, that we want something that’s abundant and cheap. And finding a molecule that satisfies all of these requirements at the same time is really difficult. Um, you can find molecules that satisfy three or four or two, but finding something that hits all the, all the criteria is really hard, as is finding a good partner. [LAUGHS]

HUIZINGA: Well, and even in, in other areas, you hear the phrase cheap, fast, good—pick two, right? So, um, yeah, finding them is hard, but have you identified some or one, or I mean where are you on this search?

KWABI: Right now, the state-of-the-art charge-storing molecule, if you like, is based on a rare earth … rare element called vanadium. So the most developed flow batteries now use vanadium, um, to store electricity. But, uh, vanadium is pretty rare in the Earth’s crusts. It’s unclear if we start to scale this technology, um, to levels that would really make an impact on climate, it’s not clear there’s enough vanadium, uh, to, to do the job. So it fulfills a bunch of, a bunch of the criteria that I just mentioned, but not, not the cheap one, which is pretty important. We’re hoping, you know, with this project, that with organic molecules, we can find examples or particular compounds that really can fulfill all of these requirements, and, um, we’re excited because organic chemistry gives us, uh … there’s a wide design space with organic molecules, and you’re starting from abundant elements, and, you know, the hope is that we can really get to something that, that, you know, we can swipe left on … is it swipe left or right? I’m not sure.

NGUYEN: I have no idea.

HUIZINGA: Swipe left means …

[LAUGHTER]

HUIZINGA: I looked it up. I, I’ve been married for a really long time, so I did look it up, and it is left if you’re not interested and right if you are, apparently on Tinder, but, uh, not to beat that horse …

KWABI: You want to swipe right eventually.

HUIZINGA: Yes. Which leads me to, uh, Bichlien. What does machine learning have to do with natural sciences like organic chemistry? Why does computation play a role here, particularly generative models, for climate change science?

NGUYEN: Yeah so what, you know, the past decade or two, um, in computer science and machine learning have taught us is that ML is really good at pattern recognition, right, being able to, um, take complex datasets and pull out the most type … you know, relevant, uh, trends and information, and, and it’s good at classifying … you know, used as a classification tool. Uh, and what we know about nature is that nature is full of patterns, right. We see repeating patterns all the time in nature, at many different scales. And when we think, for example, of all of the combinations of carbons, carbon organic molecules, that you could make, you see around 1060; it’s estimated to be 1060. Um, and those are all connected somehow, you know, in this large, you know, space, this large, um, distribution. And we want to, for example, as David mentioned, we want to check the boxes on all these properties. So what is really powerful, we believe, is that generative models will allow us to sample this, this organic chemistry space and allow us to condition the outputs of these models on these properties that David wants to checkmark. And so in a way, it’s allowing us to do more efficient searching. And I like to think about it as like you’re trying to find a needle, right, in the ocean, and the ocean’s pretty vast; needles are really small. And instead of having the size of the ocean as your search space, maybe you have the size of a bathtub and so that we can narrow down the search space and then be able to test and validate some of the, the molecules that come out.

HUIZINGA: So do these models then eliminate a lot of the options, making the pool smaller? Is that how that works to make it a bathtub instead of an ocean?

NGUYEN: I, I wouldn’t say eliminates, but it definitely tells you where you should be … it helps focus where you’re searching.

HUIZINGA: Right, right, right. Well, David, you and I talked briefly and exchanged some email on “The Elements” song by Tom Lehrer, and it’s, it’s a guy who basically sings the entire periodic chart of the elements, really fast, to the piano. But at the end, he mentions the fact that there’s a lot that haven’t been discovered. There’s, there’s blanks in the chart. And so I wonder if, you know, this, this search for molecules, um, it just feels like is there just so much more out there to be discovered?

KWABI: I don’t know if there’s more elements to be discovered, per se, but certainly there’s ways of combining them in ways that produce …

HUIZINGA: Aaahhhhh …

KWABI: … new compounds or compounds with properties that, that we’re looking for, for example, in this project. So, um, that’s, I think, one of the things that’s really exciting about, uh, about this particular endeavor we’re, we’re, we’re engaged in. So, um, one of the ways that people have traditionally thought about finding new molecules for flow batteries is, you know, you go into the lab or you go online and order a chemical that you think is going to be promising [LAUGHS] … some people I know have done this, uh, myself included … but you, you order a chemical that you think is promising, you throw it in the flow battery, and then you figure out if it works or not, right. And if it doesn’t work, you move on to the next compound, or you, um …

NGUYEN: You tweak it!

KWABI: … if it does work, you publish it. Yeah, exactly—you tweak it, for example. Um, but one of the, one of the questions that we get to ask in this project is, well, rather than think about starting from a molecule and then deciding or figuring out whether it works, we, we actually start from the criteria that we’re looking for and then figure out if we can intelligently design, um, a molecule based on the criteria. Um, so it’s, it’s, uh, I think a more promising way of going about discovering new molecules. And, as Bichlien’s already alluded to, with organic chemistry, the possibilities are endless. We’ve seen this already in, like, the pharmaceutical industry for example, um, and lots of other industries where people think about, uh, this combinatorial problem of, how do I get the right structure, the right compound, that solves the problem of, you know, killing this virus or whatever it is. We’re hoping to do something similar for, uh, for flow batteries.

HUIZINGA: Yeah, in fact, as I mentioned at the very beginning of the show, you titled your proposal “The computational design and characterization of organic electrolytes for flow batteries,” so it’s kind of combining all of that together. David, sometimes research has surprising secondary uses. You start out looking for one thing and it turns out to be useful for something else. Talk about the dual purposes of your work, particularly how flow batteries both store energy and work as a sort of carbon capture version of the Ghostbusters’ Ecto-Containment Unit. [LAUGHS]

KWABI: Sure. Yeah, so this is where I sort of confess and say I wasn’t completely up front in the beginning when I said all we do is energy storage, but, um, another, um, application we’re very interested in is carbon capture in my group. And with regard to flow batteries, it turns out that you, you actually can take the same architecture that you use for a flow battery and actually use it to, to capture carbon, CO2 in particular. So the way this would work is, um, it turns out that some of the molecules that we’ve been talking about, some of the organic molecules, when you push an electron onto them—so you’re storing energy and you push an electron onto them—it turns out that some of these molecules also absorb hydrogen ions from water so those two processes sort of happen together. You push an electron onto the molecule, and then it picks up a hydrogen ion from water. Um, and if you remember anything about something from your chemistry classes in high school, that changes the pH of water. If you remove protons, uh, from water, that makes water more basic, or more alkaline. And alkaline electrolytes or alkaline water actually absorbs or reacts with CO2 to make bicarbonate. So that’s a chemical reaction that can serve as a mode, or a mechanism, for removing CO2 from, from the environment, so it could be air, or it could be, uh, flue gas or, you know, exhaust gas from a power plant. So imagine you, you run this process, you push the electron onto the molecule, you change the pH of the solution, you remove CO2 … that can then … you can actually concentrate that CO2 and then run the opposite reaction. So you pull the electron off the molecule; that then dumps protons back into solution, and then you can release all this pure CO2 all of a sudden. So, so now what you can do is take a flow battery that stores energy, but also, uh, use it to separate CO2, separate and concentrate CO2 from a, from a gaseous source. So this is, um, some work that we’ve been pursuing sort of in parallel with our work on energy storage, and the hope is that we can find molecules that, in principle, maybe could do both—could do the energy storage and also help with, uh, with CO2 separation.

HUIZINGA: Bichlien, is that part of the story that was attractive to Microsoft in terms of both storage for energy and getting rid of CO2 in the environment?

NGUYEN: Yeah, absolutely. Absolutely. Of course, the properties of, of, you know, both CO2 capture and the energy storage components are sometimes somewhat, uh—David, correct me if I’m wrong—kind of divergent. It’s, it’s hard to optimize for one and have the other one optimized, too. So it’s really a balance of, of things, and we’re targeting, just right now, for this project, our joint project, the, the energy storage aspect.

HUIZINGA: Yeah. On that note, and either one of you can take this, what do you do with it? I mean, when I, when I used the Ghostbusters’ Ecto-Containment Unit, I was being direct. I mean, you got to put it somewhere once you capture it, whether it’s storing it for good use or getting rid of it for bad use. So how are you managing that?

KWABI: Great question, so … Bichlien, were you going to … were you going to go?

NGUYEN: Oh, I mean, yeah, I was going to say that there are many ways, um, for I’ll call it CO2 abatement, um, once you have it. Um, there are people who are interested in storing it underground, so, uh, mineralizing it in basalt formations, rock formations. There are folks, um, like me, who are interested in, you know, developing new catalysts so that we can convert CO2 to different renewable feedstocks that can be used in different materials like different plastics, different, um, you know, essentially new fuels, things of that nature. And then there’s, you know, commercial applications for pure streams of CO2, as well. Uh, yeah, so I, I would say there’s a variety of things you can do with CO2.

HUIZINGA: What’s happening now? I mean, where does that generally … David we, I, I want to say, we talked about this issue, um, when we met before on some of the downsides of what’s current.

KWABI: Yeah, so currently, um, so there’s, as Bichlien has mentioned, there’s a number of things you could do with it. But right now, of all the sort of large projects that have been set up, large pilot plants for CO2 capture that have been set up, I think the main one is enhanced oil recovery, which is a little bit controversial, um, because what you’re doing with the CO2 there is you’re pumping it underground into an oil field that has become sort of less productive over time. And the goal there is to try to coax a little bit more oil, um, out of this field. So, so you pump the CO2 underground, it mixes in with the oil, and then you … that, that sort of comes back up to the surface, and you separate the CO2 from the oil, and you can, you can go off and, um, use the oil for whatever you use it for. So, so the economically attractive thing there is, there’s, uh, there’s, there’s going to be some sort of payoff. There’s a reason, a commercial incentive, for separating the CO2, uh, but of course the problem is you’re removing oil from the … you’re, you’re extracting more oil that’s going to end up with … in more CO2 emissions. So, um, there are, in principle, many potential options, but there aren’t very many that have both the sort of commercial … uh, where there’s sort of a commercial impact and there’s also sort of the scale to take care of the, you know, the gigatons of CO2 that we’re going to have to draw down, basically, so … .

NGUYEN: Yeah. And I, I think, I mean, you know, to David’s point, that’s true—that, that is what’s happening, you know, today because it provides value, right? The issue, I think, with CO2 capture and storage is that while there’s global utility, there’s no monetary value to it right now. Um, and so it makes it a challenge in terms of being able to industrialize, you know, industries to take care of the CO2. But I, I, I think, you know, as part of the MCRI initiative, you know, we’re very interested in both the carbon capture and the utilization aspect, um, and utilization would mean utilizing the CO2 in productive ways for long-term storage, so think about maybe using CO2, um, converting it electrochemically, for example, into, uh, different monomers. Those monomers maybe could be used in new plastics for long-term storage. Uh, maybe those are recyclable plastics. Maybe they’re plastics that are easily biodegradable. But, you know, one of the issues with using, or manufacturing, is there’s always going to be energy associated with manufacturing. And so that’s why we care a lot about renewables and, and the green energy transition. And, and that’s why, uh, you know, we’re collaborating with David and his team as, as part of that. It’s really full circle. We have to really think about it on a systems level, and the collaboration with David is, is one part of that system.

HUIZINGA: Well, that leads beautifully, and that’s probably an odd choice of words for this question, but it seems like “solving for X” in climate change is a no-lose proposition. It’s a good thing to do. But I always ask what could possibly go wrong, and in this case, I’m thinking about other solutions, some of which you’ve already mentioned, that had seemed environmentally friendly at first, but turned out to have unforeseen environmental impacts of their own. So even as you’re exploring new solutions to renewable energy sources, how are you making sure, or how are you mitigating, harming the environment while you’re trying to save it?

KWABI: That’s a great question. So it’s, it’s something that I think isn’t traditionally, at least in my field, isn’t traditionally sort of part of the “solve for X” when people are thinking about coming up with a new technology or new way of storing renewable electricity. So, you know, in our particular case, one of the things that’s really exciting about the project we’re working on is we’re looking at molecules that are fairly already quite ubiquitous, so, so they’re already being used in the food and textile industry, for example, derivatives of the molecules we’re using. So, you know, thinking about the materials you’re using and the synthetic routes that are necessary to produce them is sort of a pitfall that one can easily sort of get into if you don’t start thinking about this question at the very beginning, right? You might come up with a technology that’s, um, appealing and that works really well, performance-wise, but might not be very recyclable or might have some difficulties in terms of extraction and so on and so forth. So lithium-ion batteries, for example, come to mind. I think you were alluding to this earlier, that, you know, they’re a great technology for electric vehicles, but mining cobalt, extracting cobalt, comes with a lot of, um, just negative impacts in terms of child labor and so on in the Congo, et cetera. So how, how do we, you know, think about, you know, materials that don’t … that sort of avoid this? And I’ll, I’ll just highlight as one of our team members … so Anne McNeil, who’s in the chemistry department here, thinks quite a lot about this, and that’s appropriate because she’s sort of the synthetic chemist on the team. She’s the one who’s thinking a lot about, you know, given we have this molecule we want to make, what’s the most eco-friendly, sustainable route to making that molecule with materials that don’t require, you know, pillaging and polluting the earth to do it in a sense, right. And also materials … and also making it in a way that, you know, at end of life, it can be potentially recycled, right.

HUIZINGA: Right.

KWABI: So thinking about sustainable routes to making these molecules and potential sort of ways of recycling them are things that, um, we’re, we’re trying to, in some sense, to take into consideration. And by we, I mean Anne, specifically, is thinking quite seriously about …

NGUYEN: David … David, can I put words in your mouth?

KWABI: But, yeah. … Yeah, sure, go ahead.

NGUYEN: Um, you’re, you’re thinking of sustainability as being a first design principle for …

KWABI: Yes! I would take those words! Exactly.

NGUYEN: OK. [LAUGHS] Yeah, I mean, that’s really important. I, I agree and second what David said.

HUIZINGA: Bichlien, when we talked earlier, the term co-optimization came up, and I want to dig in here a little bit because whenever there’s a collaboration, each discipline can learn something from the other, but you can also learn about your own in the process. So what are some of the benefits you’ve experienced working across the sciences here for this project? Um, could you provide any specific insights or learnings from this project?

NGUYEN: I mean, I, I think maybe a naive … something that maybe seems naive is that we definitely have to work together in all three disciplines because what we’re also learning from David and Bryan is that there are different experimental and computational timelines that sometimes don’t agree, and sometimes do agree, and we really have to, uh, you know, work together in order to create a unified, I’m not going to call it a roadmap, but a unified research plan that works for everyone. For example, um, it takes much longer to run an experiment to synthesize a molecule … I, I think it takes much longer to synthesize a molecule than, for example, to run a, uh, flow cell, um, experiment. And then on the computational side, you could probably run it, you know, at night, on a weekend, you know, have it done relatively soon, generate molecules. And one of those that we’re, you know, understanding is that the human feedback and the computational feedback, um, it takes a lot of balancing to make sure that we’re on the same track.

HUIZINGA: What do you think, David?

KWABI: Yeah, I think that’s definitely accurate, um, figuring out how we can work together in a way that sort of acknowledges these timelines is really important. And I think … I’m a big believer in the fact that people from somewhat different backgrounds working together, the diversity of background, actually helps to bring about, you know, really great innovative solutions to things. And there’s various ways that this has sort of shown up in our, in own work, I think, and in our, in our discussions. Like, you know, we’re currently working on a particular sort of molecular structure for, uh, for a compound that we think will be promising at storing electricity, and the way we, we came about with it is that my, my group, you know, we ran a flow cell and we saw some data that seemed to suggest that the molecule was decomposing in a certain way, and then Anne’s group, or one of Anne’s students, proposed a mechanism for what might be happening. And then Jake, who works with Bichlien, also … and then thought about, “Well, what, what about this other structure?” So that sort of … and then that’s now informing some of the calculations that are going on, uh, with Bryan. So there’s really interesting synergies that show up just because there’s people working from, you know, coming from very different backgrounds. Like I’m a mechanical engineer who sort of likes to hang out with chemists and, um, there’s actual chemists and then there’s, you know …

NGUYEN: But, David, I think …

KWABI: … the people who do computation, and so on …

NGUYEN: I think you’re absolutely right here in terms of the overlap, too, right? Because in a, in a way, um, I’m an organic chemist by training, and I dabble in machine learning. You’re a mechanical engineer who dabbles in chemistry. Uh, Bryan’s a computational chemist who dabbles in flow cell works. Uh, Anne is, uh, you know, a purely synthetic chemist who dabbles in, you know, almost all of our aspects. Because we have overlap, we have lower, I’m going to call it an activation barrier, [LAUGHS] in terms of the language we speak. I think that is something that, you know, we have to speak the same language, um, so that we can understand each other. And sometimes that can be really challenging, but oftentimes, it’s, it’s not.

HUIZINGA: Yeah, David, all successful research projects begin in the mind and make their way to the market. Um, where does this research sit on that spectrum from lab to life, and how fast is it moving as far as you’re concerned?

KWABI: Do you mean the research, uh, in general or this project?

HUIZINGA: This project, specifically.

KWABI: OK, so I’d say we’re, we’re still quite early at this stage. So there’s a system of classification called Technology Readiness Level, and I would say we’re probably on the low end of the scale, I don’t know, maybe like a 1 or 2.

NGUYEN: We just started six months ago!

KWABI: We just started six months ago! So …

[LAUGHTER]

HUIZINGA: OK, that’s early. Wait, how many levels are there? If there’s 1 or 2, what’s the high end?

KWABI: I think we go up to an 8 or so, an 8 or a 9. Um, so, so we’re quite early; we just started. But the, the nice thing about this field is that things can move really quickly. So in a year or two, who knows where we’ll be? Maybe four or five, but things are still early. There’s a lot of fundamental research right now that’s happening …

HUIZINGA: Which is so cool.

KWABI: Proof of concept. Which is necessary, I think, before you can get to the, the point where you’re, um, you’re spinning out a company or, or moving up to larger scales.

HUIZINGA: Right. Which lives very comfortably in the academic world. Bichlien, Microsoft Research is sort of a third space where they allow for some horizon on that scale in terms of how long it’s going to take this to be something that could be financially viable for Microsoft. Is that just not a factor right now? It’s just like, let’s go, let’s solve this problem because this is super-important?

NGUYEN: I guess I’ll say that it takes roughly 20 years or so to get a proof of concept into market at an industrial scale. So, I’m … what I’m hoping that with this collaboration, and with others, is that we can shorten the time for discovery so that we understand the fundamentals and we have a good baseline of what we think can be achieved so that we can go to, for example, a pilot scale, like a test scale, outside of the laboratory, not full industrial scale, but just a pilot scale much faster than we would if we had to hand iterate every single molecule.

HUIZINGA: So the generative models play a huge role in that shortening of the time frame …

NGUYEN: Yes, yes. That’s what we …

KWABI: Yeah, I think …

NGUYEN: Go ahead, David.

KWABI: Yeah. I think the idea of having a platform … so, so rather than, you know, you found this wonderful, precious molecule that you’re going to make a lot of, um … you know, having a platform that can generate molecules, right, I think is, you know, proving that this actually works gives you a lot more shots on goal, basically. And I think that, you know, if we’re able to show that, in the next year or two, that there’s, there’s a proof of concept that this can go forward, then um, then, in principle, we have many more chemistries to work with and play with, than the …

NGUYEN: Yeah, and, um, we might also be able to, you know, with, with this platform, discover molecules that have that dual purpose, right, of both energy storage and carbon capture.

HUIZINGA: Well, as we wrap up, I’d love to know in your fantastical ideal preferred future, what does your work look like … now, I’m going to say five to 10 years, but, Bichlien, you just said 20 years, [LAUGHS] so maybe I’m on the short end of it here. In the “future,” um, how have you changed the landscape of eco-friendly, cost-effective energy solutions?

KWABI: That’s a, that’s a big question. I, I tend to think in more two–, three–year timelines sometimes. [LAUGHS] But I think in, in, in, you know, in like five, 10 years, if this research leads to a company that’s sort of thriving and demonstrating that flow batteries can really make an impact in terms of low-cost energy storage, that would have been a great place to land. I mean that and the demonstration that you, you know, with artificial intelligence, you can create this platform that can, uh, custom design molecules that fulfill these criteria. I think that would be, um, that would be a fantastic outcome.

HUIZINGA: Bichlien, what about you?

NGUYEN: So I think in one to two years, but I also think about the 10-to-20-year timeline, and what I’m hoping for is, again, to demonstrate the value of AI in order to enable a carbon negative economy so that we can all benefit from it. It sounds very … a polished answer, but I, I really think there are going to be accelerations in this space that’s enabled by these new technologies that are coming out.

HUIZINGA: Hmm.

NGUYEN: And I hope so! We have to save the planet!

KWABI: There’s a lot more to AI than ChatGPT and, [LAUGHS] you know, language models and so on, I think …

HUIZINGA: That’s a perfect way to close the show. So … Bichlien Nguyen and David Kwabi, thank you so much for coming on. It’s been delightful—and informative!

NGUYEN: Thanks, Gretchen.

KWABI: Thank you very much.

The post Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi appeared first on Microsoft Research.

Read More

Improving Subseasonal Forecasting with Machine Learning

Improving Subseasonal Forecasting with Machine Learning

This content was previously published by Nature Portfolio and Springer Nature Communities on Nature Portfolio Earth and Environment Community.

Improving our ability to forecast the weather and climate is of interest to all sectors of the economy and to government agencies from the local to the national level. Weather forecasts zero to ten days ahead and climate forecasts seasons to decades ahead are currently used operationally in decision-making, and the accuracy and reliability of these forecasts has improved consistently in recent decades (Troccoli, 2010). However, many critical applications – including water allocation, wildfire management, and drought and flood mitigation – require subseasonal forecasts with lead times in between these two extremes (Merryfield et al., 2020; White et al., 2017).

While short-term forecasting accuracy is largely sustained by physics-based dynamical models, these deterministic methods have limited subseasonal accuracy due to chaos (Lorenz, 1963). Indeed, subseasonal forecasting has long been considered a “predictability desert” due to its complex dependence on both local weather and global climate variables (Vitart et al., 2012). Recent studies, however, have highlighted important sources of predictability on subseasonal timescales, and the focus of several recent large-scale research efforts has been to advance the subseasonal capabilities of operational physics-based models (Vitart et al., 2017; Pegion et al., 2019; Lang et al., 2020). Our team has undertaken a parallel effort to demonstrate the value of machine learning methods in improving subseasonal forecasting.

The Subseasonal Climate Forecast Rodeo

To improve the accuracy of subseasonal forecasts, the U.S. Bureau of Reclamation (USBR) and the National Oceanic and Atmospheric Administration (NOAA) launched the Subseasonal Climate Forecast Rodeo, a yearlong real-time forecasting challenge in which participants aimed to skillfully predict temperature and precipitation in the western U.S. two-to-four weeks and four-to-six weeks in advance. Our team developed a machine learning approach to the Rodeo and a SubseasonalRodeo dataset for training and evaluating subseasonal forecasting systems.

Week 3-4 temperature forecasts and observations for February 5th, 2018. Upper left: Our Rodeo submission. Upper right: Realized temperature anomalies. Bottom left: Forecast of the U.S. operational dynamical model, Climate Forecasting System v2. Bottom right: A standard meteorological forecasting method used as a Rodeo baseline.
Week 3-4 temperature forecasts and observations for February 5th, 2018. Upper left: Our Rodeo submission. Upper right: Realized temperature anomalies. Bottom left: Forecast of the U.S. operational dynamical model, Climate Forecasting System v2. Bottom right: A standard meteorological forecasting method used as a Rodeo baseline.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Our final Rodeo solution was an ensemble of two nonlinear regression models. The first integrates a diverse collection of meteorological measurements and dynamic model forecasts and prunes irrelevant predictors using a customized multitask model selection procedure. The second uses only historical measurements of the target variable (temperature or precipitation) and introduces multitask nearest neighbor features into a weighted local linear regression. Each model alone outperforms the debiased operational U.S. Climate Forecasting System version 2 (CFSv2), and, over 2011-2018, an ensemble of our regression models and debiased CFSv2 improves debiased CFSv2 skill by 40%-50% for temperature and 129%-169% for precipitation. See our write-up Improving Subseasonal Forecasting in the Western U.S. with Machine Learning for more details. While this work demonstrated the promise of machine learning models for subseasonal forecasting, it also highlighted the complementary strengths of physics- and learning-based approaches and the opportunity to combine those strengths to improve forecasting skill.

Adaptive Bias Correction (ABC)

To harness the complementary strengths of physics- and learning-based models, we next developed a hybrid dynamical-learning framework for improved subseasonal forecasting. In particular, we learn to adaptively correct the biases of dynamical models and apply our novel adaptive bias correction (ABC) to improve the skill of subseasonal temperature and precipitation forecasts.

At subseasonal lead times, weeks 3-4 and 5-6, ABC doubles or triples the forecasting skill of leading operational dynamical models from the U.S. (CFSv2) and Europe (ECMWF).
At subseasonal lead times, weeks 3-4 and 5-6, ABC doubles or triples the forecasting skill of leading operational dynamical models from the U.S. (CFSv2) and Europe (ECMWF).

ABC is an ensemble of three new low-cost, high-accuracy machine learning models: Dynamical++, Climatology++, and Persistence++. Each model trains only on past temperature, precipitation, and forecast data and outputs corrections for future forecasts tailored to the site, target date, and dynamical model. Dynamical++ and Climatology++ learn site- and date-specific offsets for dynamical and climatological forecasts by minimizing forecasting error over adaptively-selected training periods. Persistence++ additionally accounts for recent weather trends by combining lagged observations, dynamical forecasts, and climatology to minimize historical forecasting error for each site.

ABC can be applied operationally as a computationally inexpensive enhancement to any dynamical model forecast, and we use this property to substantially reduce the forecasting errors of eight operational dynamical models, including the state-of-the-art ECMWF model.

ABC can be applied operationally as a computationally inexpensive enhancement to any dynamical model forecast.
ABC can be applied operationally as a computationally inexpensive enhancement to any dynamical model forecast.

A practical implication of these improvements for downstream decision-makers is an expanded geographic range for actionable skill, defined here as spatial skill above a given sufficiency threshold. For example, we vary the weeks 5-6 sufficiency threshold from 0 to 0.6 and find that ABC consistently boosts the number of locales with actionable skill over both raw and operationally-debiased CFSv2 and ECMWF.

ABC consistently boosts the number of locales with forecasting accuracy above a given skill threshold, an important property for operational decision-making in water allocation, wildfire management, and drought and flood mitigation.
ABC consistently boosts the number of locales with forecasting accuracy above a given skill threshold, an important property for operational decision-making in water allocation, wildfire management, and drought and flood mitigation. 

We couple these performance improvements with a practical workflow for explaining ABC skill gains using Cohort Shapley (Mase et al., 2019) and identifying higher-skill windows of opportunity (Mariotti et al., 2020) based on relevant climate variables.

a.) impact of hgt_500_pc1 on ABC skill improvement b.) forecast with largest hgt_500_pc1 impact
Our “forecast of opportunity” workflow explains ABC skill gains in terms of relevant climate variables observable at forecast time.

To facilitate future deployment and development, we also release our model and workflow code through the subseasonal_toolkit Python package.

The SubseasonalClimateUSA dataset

To train and evaluate our contiguous US models, we developed a SubseasonalClimateUSA dataset housing a diverse collection of ground-truth measurements and model forecasts relevant to subseasonal timescales. The SubseasonalClimateUSA dataset is updated regularly and publicly accessible via the subseasonal_data package. In SubseasonalClimateUSA: A Dataset for Subseasonal Forecasting and Benchmarking, we used this dataset to benchmark ABC against operational dynamical models and seven state-of-the-art deep learning and machine learning methods from the literature. For each subseasonal forecasting task, ABC and its component models provided the best performance.

Percentage improvement in accuracy over operationally-debiased dynamical CFSv2 forecasts. ABC consistently outperforms standard meteorological baselines (Persistence and Climatology) and 7 state-of-the-art machine learning and deep learning methods from the literature.
Percentage improvement in accuracy over operationally-debiased dynamical CFSv2 forecasts. ABC consistently outperforms standard meteorological baselines (Persistence and Climatology) and 7 state-of-the-art machine learning and deep learning methods from the literature.

Online learning with optimism and delay

To provide more flexible and adaptive model ensembling in the operational setting of real-time climate and weather forecasting, we developed three new optimistic online learning algorithms — AdaHedgeD, DORM, and DORM+ — that require no parameter tuning and have optimal regret guarantees under delayed feedback.

online learning regret plot
Each year, the PoolD online learning algorithms produce ensemble forecasts with accuracy comparable to the best individual model in hindsight despite observing only 26 observations per year.

Our open-source Python implementation, available via the PoolD library, provides simple strategies for combining the forecasts of different subseasonal forecasting models, adapting the weights of each model based on real-time performance. See our write-up Online Learning with Optimism and Delay for more details.

Looking forward

We’re excited to continue exploring machine learning applied to subseasonal forecasting on a global scale, and we hope that our open-source packages will facilitate future subseasonal development and benchmarking. If you have ideas for model or dataset development, please contribute to our open-source Python code or contact us!

The post Improving Subseasonal Forecasting with Machine Learning appeared first on Microsoft Research.

Read More

Accounting for past imaging studies: Enhancing radiology AI and reporting

Accounting for past imaging studies: Enhancing radiology AI and reporting

The use of self-supervision from image-text pairs has been a key enabler in the development of scalable and flexible vision-language AI models in not only general domains but also in biomedical domains such as radiology. The goal in the radiology setting is to produce rich training signals without requiring manual labels so the models can learn to accurately recognize and locate findings in the images and relate them to content in radiology reports.

Radiologists use radiology reports to describe imaging findings and offer a clinical diagnosis or a range of possible diagnoses, all of which can be influenced by considering the findings on previous imaging studies. In fact, comparisons with previous images are crucial for radiologists to make informed decisions. These comparisons can provide valuable context for determining whether a condition is a new concern or improving, deteriorating, or stable if an existing condition and can inform more appropriate treatment recommendations. Despite the importance of comparisons, current AI solutions for radiology often fall short in aligning images with report data because of the lack of access to prior scans. Current AI solutions also typically fail to account for the chronological progression of disease or imaging findings often present in biomedical datasets. This can lead to ambiguity in the model training process and can be risky in downstream applications such as automated report generation, where models may make up temporal content without access to past medical scans. In short, this limits the real-world applicability of such AI models to empower caregivers and augment existing workflows.

In our previous work, we demonstrated that multimodal self-supervised learning of radiology images and reports can yield significant performance improvement in downstream applications of machine learning models, such as detecting the presence of medical conditions and localizing these findings within the images. In our latest study, which is being presented at the 2023 IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), we propose BioViL-T, a self-supervised training framework that further increases the data efficiency of this learning paradigm by leveraging the temporal structure present in biomedical datasets. This approach enables the incorporation of temporal information and has the potential to perform complementary self-supervision without the need for additional data, resulting in improved predictive performance.

Our proposed approach can handle missing or spatially misaligned images and can potentially scale to process a large number of prior images. By leveraging the existing temporal structure available in datasets, BioViL-T achieves state-of-the-art results on several downstream benchmarks. We’ve made both our models and source code open source, allowing for a comprehensive exploration and validation of the results discussed in our study. We’ve also released a new multimodal temporal benchmark dataset, MS-CXR-T, to support further research into longitudinal modeling of medical images and text data.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Connecting the data points

Solving for the static case in vision-language processing—that is, learning with pairs of single images and captions—is a natural first step in advancing the field. So it’s not surprising that current biomedical vision-language processing work has largely focused on tasks that are dependent on features or abnormalities present at a single point in time—what is a patient’s current condition, and what is a likely diagnosis?—treating image-text pairs such as x-rays and corresponding reports in today’s datasets as independent data points. When prior imaging findings are referenced in reports, that information is often ignored or removed in the training process. Further, a lack of publicly available datasets containing longitudinal series of imaging examinations and reports has further challenged the incorporation of temporal information into medical imaging benchmarks.

Thanks to our early and close collaboration with practicing radiologists and our long-standing work with Nuance, a leading provider of AI solutions in the radiology space that was acquired by Microsoft in 2022, we’ve been able to better understand clinician workflow in the radiological imaging setting. That includes how radiology data is created, what its different components are, and how routinely radiologists refer to prior studies in the context of interpreting medical images. With these insights, we were able to identify temporal alignment of text across multiple images as a clinically significant research problem. To ground, or associate, report information such as “pleural effusion has improved compared to previous study” with the imaging modality requires access to the prior imaging study. We were able to tackle this challenge without gathering additional data or annotations.

As an innovative solution, we leveraged the metadata from de-identified public datasets like MIMIC-CXR. This metadata preserves the original order and intervals of studies, allowing us to connect various images over time and observe disease progression. Developing more data efficient and smart solutions in the healthcare space, where data sources are scarce, is important if we want to develop meaningful AI solutions.

An animated flowchart of BioViL-T. Arrows direct from a prior chest x-ray and current chest x-ray  through boxes labeled “CNN” to image embeddings, illustrated by a purple cube and a brown cube, respectively, representing relevant spatial and temporal features. An arrow points from these features through a box labeled “Vision Transformer Blocks” to a “difference embedding,” represented by a blue cube. A curly bracket pointing to a brown and blue cube labeled “image features” indicates the aggregation of the current image embedding and the difference embedding. Arrows from the “image features” cube and from an extract from a radiology report point to a text model, represented by box labeled “CXR-BERT.”
Figure 1: The proposed self-supervised training framework BioViL-T leverages pairs of radiology reports and sequences of medical images. The training scheme does not require manual expert labels and can scale to a large amount of radiology data to pretrain image and text models required for downstream clinical applications.

Addressing the challenges of longitudinal analysis

With current and prior images now available for comparison, the question became, how can a model reason about images coming from different time points? Radiological imaging, especially with planar techniques like radiographs, may show noticeable variation. This can be influenced by factors such as the patient’s posture during capture and the positioning of the device. Notably, these variations become more pronounced when images are taken with longer time gaps in between. To manage variations, current approaches to longitudinal analysis, largely used for fully supervised learning of image models only, require extensive preprocessing, such as image registration, a technique that attempts to align multiple images taken at different times from different viewpoints. In addition to better managing image variation, we wanted a framework that could be applied to cases in which prior images weren’t relevant or available and the task involved only one image.

We designed BioViL-T with these challenges in mind. Its main components are a multi-image encoder, consisting of both a vision transformer and a convolutional neural network (CNN), and a text encoder. As illustrated in Figure 1, in the multi-image encoder, each input image is first encoded with the CNN model to independently extract findings, such as opacities, present in each medical scan. Here, the CNN counteracts the large data demands of transformer-based architectures through its efficiency in extracting lower-level semantic features.

At the next stage, the features across time points are matched and compared in the vision transformer block, then aggregated into a single joint representation incorporating both current and historical radiological information. It’s important to note that the transformer architecture can adapt to either single- or multi-image scenarios, thereby better handling situations in which past images are unavailable, such as when there’s no relevant image history. Additionally, a cross-attention mechanism across image regions reduces the need for extensive preprocessing, addressing potential variations across images.

In the final stage, the multi-image encoder is jointly trained with the text encoder to match the image representations with their text counterparts using masked modeling and contrastive supervision techniques. To improve text representations and model supervision, we utilize the domain-specific text encoder CXR-BERT-general, which is pretrained on clinical text corpora and built on a clinical vocabulary.

Two chest x-rays side-by-side animated with bounding boxes and attention maps on the affected area of the lung.
Figure 2: Example of current (left) and prior (right) chest x-ray scans. The attention maps computed within the vision transformer show (in purple) how the model interprets disease progression by focusing on these image regions. In this example, the airspace disease seen in the left lung lobe has improved since the prior acquisition.

Grounded model prediction

In our work, we found that linking multiple images during pretraining makes for both better language and vision representations, enabling the AI model to better associate information present in both the text and the images. This means that when given a radiology report of a chest x-ray, for example, with the description “increased opacities in the left lower lung compared with prior examination,” a model can more accurately identify, locate, and compare findings, such as opacities. This improved alignment between data modalities is crucial because it allows the model to provide more accurate and relevant insights, such as identifying abnormalities in medical images, generating more accurate diagnostic reports, or tracking the progression of a disease over time.

Two findings were particularly insightful for us during our experimentation with BioViL-T:

  • Today’s language-generating AI models are often trained by masking portions of text and then prompting them to fill in the blanks as a means of encouraging the models to account for context in outputting a prediction. We extended the traditional masked language modeling (MLM) approach to be guided by multi-image context, essentially making the approach multimodal. This, in return, helped us better analyze whether BioViL-T was learning a progression based on provided images or making a random prediction of the masked words based solely on the text context. We gave the model radiology images and reports with progression-related language, such as “improving,” masked. An example input would be “pleural effusion has been [MASKED] since yesterday.” We then tasked the model with predicting the missing word(s) based on single and multi-image inputs. When provided with a single image, the model was unsuccessful in completing the task; however, when provided with a current and prior image, performance improved, demonstrating that the model is basing its prediction on the prior image.
  • Additionally, we found that training on prior images decreases instances of the generative AI model producing ungrounded outputs that seem plausible but are factually incorrect, in this case, when there’s a lack of information. Prior work into radiology report generation utilizes single input images, resulting in the model potentially outputting text that describes progression without having access to past scans. This severely limits the potential adoption of AI solutions in a high-stakes domain such as healthcare. A decrease in ungrounded outputs, however, could enable automated report generation or assistive writing in the future, which could potentially help reduce administrative duties and ease burnout in the healthcare community. Note that these models aren’t intended for any clinical use at the moment, but they’re important proof points to assess the capabilities of healthcare AI.

Moving longitudinal analysis forward

Through our relationships with practicing radiologists and Nuance, we were able to identify and concentrate on a clinically important research problem, finding that accounting for patient history matters if we want to develop AI solutions with value. To help the research community advance longitudinal analysis, we’ve released a new benchmark dataset. MS-CXR-T, which was curated by a board-certified radiologist, consists of current-prior image pairs of chest x-rays labeled with a state of progression for the temporal image classification task and pairs of sentences about disease progression that are either contradictory or capture the same assessment but are phrased differently for the sentence similarity task.

We focused on chest x-rays and lung diseases, but we see our work as having the potential to be extended into other medical imaging settings where analyzing images over time plays an important part in clinician decision-making, such as scenarios involving MRI or CT scans. However far the reach, it’s vital to ensure that models such as BioViL-T generalize well across different population groups and under the various conditions in which medical images are captured. This important part of the journey requires extensive benchmarking of models on unseen datasets. These datasets should widely vary in terms of acquisition settings, patient demographics, and disease prevalence. Another aspect of this work we look forward to exploring and monitoring is the potential role of general foundation models like GPT-4 in domain-specific foundation model training and the benefits of pairing larger foundation models with smaller specialized models such as BioViL-T.

To learn more and to access our text and image models and source code, visit the BioViL-T Hugging Face page and GitHub.

Acknowledgments

We’d like to thank our co-authors: Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Pérez-García, Maximilian Ilse, Daniel C. Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, Anton Schwaighofer, Maria Wetscherek, and Aditya Nori. We’d also like to thank Hoifung Poon, Melanie Bernhardt, Melissa Bristow, and Naoto Usuyama for their valuable technical feedback and Hannah Richardson for assisting with compliance reviews.

MEDICAL DEVICE DISCLAIMER

BioViL-T was developed for research purposes and is not designed, intended, or made available as a medical device and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment.

The post Accounting for past imaging studies: Enhancing radiology AI and reporting appeared first on Microsoft Research.

Read More

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

black and white photos of Emre Kiciman, Senior Principal Researcher at Microsoft Research and Amit Sharma, Principal Researcher at Microsoft Reserach, next to the Microsoft Research Podcast

Episode 140 | June 8, 2023

Powerful new large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these new models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity.

This episode features Senior Principal Researcher Emre Kiciman and Principal Researcher Amit Sharma, whose paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” examines the causal capabilities of large language models (LLMs) and their implications. Kiciman and Sharma break down the study of cause and effect; recount their respective ongoing journeys with GPT-3.5 and GPT-4—from their preconceptions to where they are now—and share their views of a future in which LLMs help bring together different modes of reasoning in the practice of causal inference and make causal methods easier to adopt.

Transcript

[MUSIC PLAYS]

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more fortunate to work in the field than at this moment. The development of increasingly powerful large-scale models like GPT-4 is accelerating the advancement of AI. These models are exhibiting surprising new abilities like reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’ll share conversations with fellow researchers about our impressions of GPT-4, the work we’re doing to understand its capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers

Today we’re talking with Emre Kiciman and Amit Sharma, two Microsoft researchers who have been studying causal reasoning with AI for many years. Determining cause and effect relationships is critically important across many domains such as law, medicine, and the advancement of science itself. Emre and Amit recently published a paper that explores how large language models can advance the research and application of causal reasoning with AI. Emre joins us from our lab in Redmond, Washington, and Amit is on the line from Microsoft Research India, in Bangalore. 


[MUSIC FADES]

Emre, Amit, let’s jump right in. I’m so excited to speak with you both about causal reasoning. And this is such a timely conversation because we’re living through the rise of generative pretrained models, specifically large language models. And when I’ve engaged with GPT-4 in dialogue, depending on what I ask, it can appear to be doing something resembling causal reasoning. And as a machine learning person myself, I have to say this is not something that I’d expected to see from a neural network that works based on analyzing and generating statistical patterns. Um, you know, this is something that before this time last year, I thought of as a uniquely human skill as I think maybe many others have, as well. Now, both of you do this for a living. You study causal reasoning for a living. Um, and so where I’d like to start is with your first reactions to GPT-4, your first contact. What did you find surprising, and how did you feel, uh, as a researcher in this area? I want to go to Emre first on this. 

EMRE KICIMAN: Sure. Well, um, yeah, I think I went through a process. Um, right now, I am surprised how much I’m depending on functionality from GPT-4 and how much I expect it to work. And yet, I also don’t quite believe that it can do the things that it’s doing. It’s really, um, a weird mind space to be in. I think the, the moment when I was a bit astounded by, like, what might be possible was actually before I got my hands on GPT-4 directly. You know, I’ve been hearing that people were very impressed with what it was doing. But the thing that made me reconsider my preconceptions was actually some of the academic research looking at, um, how transformer models and architectures could actually represent Turing machines, Turing-complete computational machines. And once I saw that the transformer architecture could represent that type of program, that type of thing, then I figured, well, all bets are off. We don’t know whether it’s learning this or not, but if it can represent it, now there really is a chance that it could, that it might be learning that. And so we have to really keep an open mind.

The second moment when I changed my mind again about what GPT-4 might be doing … so I’ll give a little background. So once I saw some of the work that we’ll talk about here, uh, coming into play, where we’re seeing GPT do some sorts of, you know, very interesting causal-related tasks, um, I was like, OK, this is great. We have our causal processes; we’re just going to run through them and this fits in. Someone will come with their causal question; we’ll run through and run our, our causal analysis. And I thought that, you know, this all makes sense. We can do things that we want, what we’ve wanted to do for so, for so long. And it was actually reading, uh, some of the vignettes in Peter Lee’s book where he was quizzing, uh, GPT-4 to diagnose a patient based on their electronic health records, explain counterfactual scenarios, um, think through why someone might have made a misdiagnosis. And, and here, all of a sudden, I realized our conceptualizations of causal tasks that we’ve worked on in the academic fields are kind of boxes where we say we’re doing effect inference or we’re doing attribution or we’re doing discovery. These like very well-circumscribed tasks are, are not enough; they’re not flexible enough. Once you have this natural language interface, you can ask so many more things, so many more interesting questions. And we need to make sure that we can formally answer those … correctly answer those questions. And, and this GPT-4 is basically a bridge to expressing and, you know, meeting people where they want to be. That really opened my eyes the second time. 

LLORENS: Thanks, Emre. Amit, first impressions. 

AMIT SHARMA: Yeah, my experience was back in December—I think it was when a lot of people were talking about ChatGPT—and me, thinking that I worked in causality, uh, I was quite smug, right. I knew that causality requires you to have interventional data. Language models are only built on some observations. So I was quite happy to think that I would beat this topic, right. But it was just that every day, I would see, perhaps on Twitter, people expressing new things that ChatGPT can do that one day, I thought, OK, let me just try it, right. So the first query I thought was an easy query for, uh, GPT models. I just asked it, does smoking cause lung cancer, right? And I was surprised when it gave the right answer. But then I thought maybe, oh, this is just too common. Let me ask the opposite. Does lung cancer cause smoking? Uh, it gave the right answer. No. Uh, and then I was literally struck, and I, and I thought, what else can I test, right? And then I thought of the all the causal relationships that we typically talk about in our field, and I started doing them one by one. And what I found was that the accuracy was just astounding. And it was not just the accuracy, but also the explanation that it gives would sort of almost make you believe that as if it is a causal agent, as if it is doing, uh, something causal. So, so to me, I think those few days in December with slightly sleepless nights on what exactly is going on with these models and what I might add … what am I going to do as a researcher now? [LAUGHS] I think that was, sort of, my initial foray into this. And, and I think the logical next step was then to study it more deeply. 

LLORENS: And stemming from both of your reactions, you began collaborating on a paper, which you’ve recently released, called “Causal Reasoning [and] Large Language Models,” um, and I’ve had the, you know, the pleasure of spending some time with that over these last few days and, and a week here. And one of the things you do in the paper is you provide what I think of as a helpful overview of the different kinds of causality. And so, Emre, I want to go back to you. What is causality, and how can we think about the space of different, you know, kinds of causal reasoning?

KICIMAN: Causality … it’s the study of cause-and-effect relationships, of the mechanisms that, that drive, you know, what we see happening in the world around us. You know, why do things happen? What made something happen? And this is a study that spread out across so many disciplines—computer science, economics, health, statistics. Like, everyone cares about, about causality, to some degree. And so this means that there’s many different kinds of, you know, tools and languages to talk about causality, um, that are appropriate for different kinds of tasks. So that’s one of the first things that we thought we had to lay out in the paper, was kind of a very broad landscape about what causality is. And so we talk about a couple of different axes. One is data-driven causal analysis, and the other is logic-based causal reasoning. These are two very different ways of, of, of thinking about causality. And then the second major axis is whether we’re talking about causal relationships in general, in the abstract, like, uh, does smoking normally cause … or often cause cancer? Versus causality in a very specific context— that’s called actual causality. And this is something like Bob smoked; Bob got lung cancer. Was Bob’s lung cancer caused by Bob’s smoking? It’s a very specific question in this very, you know, in, in a specific instance. And so those are the two axes: data-driven versus logic and then general causality versus actual causality. 

LLORENS: Amit, I want to go to you now, and I want to dwell on this topic of actual causality. And I actually learned this phrase from your paper. But I think this is a kind of causal reasoning that people do quite often, maybe even it’s the thing they think about when they think about causal reasoning. So, Amit, you know, let’s go deeper into what actual causality is. Maybe you can illustrate with some examples. And then I want to get into experiments you’ve conducted in this area with GPT-4. 

SHARMA: Sure. So interestingly, actual causality in research is sort of the less talked about. As Emre was saying, I think most researchers in health sciences, economics often talk about general phenomena. But actual causality talks about events and what might have caused them, right. So think about something happens in the real world. So let’s say … I’ll take an example of, let’s say, you catch a ball and you prevent it from falling down, right. And I think people would reasonably argue that your catching the ball was the cause of preventing it from falling onto the ground. But very quickly, these kinds of determinations become complex because what could have been happening is that there could be multiple other factors at play, uh, and there could also be questions about how exactly you’re even thinking about what is a cause. Should, should you be thinking about necessary causes, or should you be thinking about sufficient causes, and so on. So, so I think actual causality before sort of these language models was kind of a paradox in the sense that the applications were kind of everywhere, going from everyday life to even thinking about computer systems. So if your computer system fails, you want to understand why this failure occurred, right. You’re not really interested in why computer systems fail in general; you’re just interested in answering the specific failure’s causes. And the paradox is that even though these sort of questions were so common, I think what research had to offer, uh, was not immediately systemizable or deployable, uh, because you would often sort of tie yourself in knots in defining exactly what you mean by the cause and also sort of how do you even get that framing without sort of just having a formal representation, right. Most of these tasks were in English, right, or in the case of computer systems, you would just get a debug log. So I think one of the hardest problems was how do you take something in vague language, human language, and convert it into sort of logical framing or logical systems? 

LLORENS: In the paper, you explore briefly, you know, kind of actual causality that deals with responsibility or faults. And, you know, this connects with things like, you know, reasoning in the, in the legal domain. And so I just want to, I want to explore that with you. And I know I’ve jumped to the back of the paper. I just find these particular set … this particular set of topics pretty fascinating. And so tell me about the experiments that you’ve conducted where you ask, you know, the, the algorithm … the model to do this kind of actual causal reasoning around assigning blame or responsibility for something? 

SHARMA: So one of the important challenges in actual causality is determining what’s a necessary cause and what’s a sufficient cause for an event, right. Now if you’re familiar with logic, you can break this down into sort of simple predicates. What we are asking is if an event happened, was some action necessary? It means that if that action did not happen, then that event would not happen, right. So we have a nice ”but for” relationship. Sufficiency, on the other hand, is kind of the complement. So there you’re saying if this action happens, the event will always happen, irrespective of whatever else happens in the world, right. And so, so far, in actual causality, people would use logic-based methods to think about what’s the right answer for any kind of event. So what we did was we looked at all the sort of vignettes or these examples that causality researchers had collected over the past decade. All of these are very challenging examples of situations in English language. And I think their purpose was to kind of elucidate the different kinds of sort of gotchas you get when you try to sort of just use the simple concept for real-world applications. So let me take you through one example in our dataset that we studied and how we’re finding that LLMs are somehow able to take this very vague, ambiguous information in an English-language vignette and directly go from sort of that language to an answer in English, right. So in a sense, they’re kind of sidestepping the logical reasoning, but maybe in the future we can also combine logical reasoning and LLMs. 

So let’s take an example. Uh, it’s like Alice catches a ball. The next part on … the next destination on the ball’s trajectory was a brick wall, which would have stopped it, and beyond that there was a window. So as humans, we would immediately think that Alice was not a cause, right, because even if she had not stopped the ball, it would have hit the brick, and so if you’re asking if Alice was the cause of the window being safe, an intuitive answer might be no. But when you analyze it through the necessary and sufficient lens, you would find that Alice was obviously not a necessary cause because the brick wall would have stopped it, but Alice was a sufficient cause, meaning that if Alice had stopped the ball, even if the brick wall collapsed, even if other things happened in the world, the window would still be safe right. So these are the kind of sort of interesting examples that we tried out. And what we found was GPT-3.5, which is ChatGPT, does not do so well. I think it actually fails to identify correctly these causes, but GPT-4 somehow is able to do that. So it gets about 86 percent accuracy on, on this task. And one of the interesting things we were worried about was maybe it’s just memorizing. Again, these are very popular examples in textbooks, right? So we did this fun thing. We just created our own dataset. So, so now instead of Alice catching a ball, Alice could be, I don’t know, dropping a test tube in a lab, right? So we created this sort of a lab setup—a completely new dataset—and we again found the same results that GPT-4 is able to infer these causes. 

LLORENS: Now you’re, you’re getting into experimental results, and that’s great because one of the things that I think required some creativity here was how you actually even structure, you know, a rigorous set of experiments. And so, Emre, can you take … take us through the experiment setup and how you had to approach that with this, you know, kind of unique, unique way of assessing causal reasoning? 

KICIMAN: Well, one of the things that we wanted to make sure we had when we were running these experiments is, uh, construct validity to really make sure that the experiments that we were running were testing what we thought they were testing, or at least that we understood what they actually were testing. Um, and so most of these types of, uh, tests over large language models work with benchmark questions, and the biggest issue with the, with many of these benchmark questions is that often the large language models have seen them before. And there’s a concern that rather than thinking through to get the right answer, they’ve really only memorized the specific answers to these, to these specific questions.

And so what we did was, uh, we actually ran a memorization test to see whether the underlying dataset had been memorized by the large language model before. We developed … some of our benchmark datasets we developed, uh, as novel datasets that, you know, had never been written before so clearly had not been seen or memorized. And then we ran additional tests to help us understand what was triggering the specific answers. Like we would redact words from our question, uh, to see what would lead the LLM to make a mistake. So, for example, if we remove the key word from the question, we would expect the LLM to be confused, right. That’s, that’s fine. If we removed an unimportant word, maybe, you know, a participle or something, then we would expect that, that, that, that should be something that the LLM should recover from. And so this was able to give us a better understanding of what the LLM was, was paying attention to. This led us, for example, to be very clear in our paper that in, for example, our causal discovery experiments—where we are specifically asking the LLM to go back to its learned knowledge and tell us whether it knows something from common sense or domain knowledge, whether it’s memorized that, you know, some, uh, some cause, uh, has a particular effect—we are very clear in our experiments that we are not able to tell you what the odds are that the LLM has memorized any particular fact. But what we can say is, given that it’s seen that fact, is it able to transform it, you know, and combine it somehow into the correct answer in a particular context. And so it’s just, it’s really important to, to know what, uh, what these experiments really are testing. So I, I really appreciated the opportunity to go a little bit deeper into these studies.

LLORENS: I find this concept of construct validity pretty fascinating here, and it’s, you know, you, you stressed the importance of it for doing this kind of black-box testing, where you don’t actually have an explicit model for how the, well, the model is doing what it’s doing. And, you know, you talked about memorization as one important test where you’re, you know, you want to, you want to have a valid construct. But I think even deeper than that, there’s, there’s an aspect of your mental model, your beliefs about, you know, what the algorithm is doing and how relevant the testing you’re doing would be to future performance or performance on future tasks. And so I wonder if we can dwell on this notion of construct validity a little bit, maybe even one level deeper than the memorization, you know, you and your mental model of what’s happening there and why that’s important. 

KICIMAN: My mental model of what the large language model is giving us is that it’s read so much of the text out on the internet that it’s captured the common sense and domain knowledge that we would normally expect only a human to do. And through some process—maybe it’s, maybe it’s probabilistic; maybe it’s some more sophisticated reasoning—it’s able to identify, like Amit said, the most important or relevant relationships for a particular scenario. So it knows that, you know, when we’re talking about a doctor washing his or her hands with soap or not, that infection, uh, in a patient is the next … is something that’s really critical. And maybe if we weren’t talking about a doctor, this would not be, you know, the most important consideration. So it is starting from capturing this knowledge, remembering it somehow in its model, and then recognizing the right moment to recall that fact and put it back out there as part of its answer. Um, that’s, that’s my mental model of what I think it’s doing, and we are able to demonstrate with our, you know, experiments that it is transforming from many different input data formats into, you know, answers to our natural language questions. So we, we have data we think it’s seen that’s in tabular format or in graphical formats. Um, and, you know, it’s, it’s impressive to see that it’s able to generate answers to our questions in various natural language forms. 

LLORENS: I want to go now to a different kind of causality, causal discovery, which you describe in your paper as dealing with variables and their effect on each other. Emre, we’ll stick with you. And I also think that this is a, a kind of causal reasoning that maybe is closer to your day job and closer to the kinds of models maybe that you construct in the problems that you deal with. And so tell me about causal discovery and, you know, what you’re seeing in terms of the capabilities of GPT-4 and your, your experimentation. 

KICIMAN: Yeah. So causal discovery is about looking at data, observational data, where you’re not necessarily intervening on the system—you’re just watching—and then from that, trying to figure out what relationships … uh, what the causal relationships are among the factors that you’re observing. And this is something that usually is done in the context of general causality, so trying to learn general relationships, uh, between factors, and it’s usually done in a, in a databased way—looking at the covariances, statistical covariances, between your observations. And, uh, there’s causal discovery algorithms out there. Uh, there are … this is something that’s been studied for decades. And there’s essentially, uh, testing statistical independence relationships that, you know, if something isn’t causing something else, then if you hold everything constant, there should be statistical independence between those two factors or different kinds of statistical independence relationships depending on what type of causal structures you see in, uh, among the relationships. And what these algorithms are able to do, the classical algorithms, is they can get you down to, um, a set of, a set of plausible relationships, but there’s always some point at which they can’t solve … uh, they can’t distinguish things based on data alone. They can, you know … there’s going to be a couple of relationships in your dataset where they might not know whether A is causing B or B is causing A, vice versa. And this is where a human comes in with their domain knowledge and has to make a declaration of what they think the right answer is based on their understanding of system mechanics. So there’s always this reliance on a human coming in with domain knowledge. And what, what we’re, uh, seeing now, I think, with LLMs is for the first time, we have some sort of programmatic access to this common sense and domain knowledge, just like in the actual causality setting. We have it provided to us again, uh, in the causal discovery setting. And we can push on this further. We don’t have … we can, if we want, run our data analysis first, then look at the LLM to, um, to disambiguate the last couple of things that we couldn’t get out of data. But we can also start from scratch and just ask, uh, the LLM to orient all of these causal edges and identify the right mechanisms from the beginning, just solely based on common sense and domain knowledge. 

And so that’s what we did in our experiments here. We went through, uh, lists of edges and then larger graph structures to see how much we could re-create from, uh, just the common sense or domain knowledge that’s captured inside the LLM. And it did, it did quite well, beating the state of the art of the data-oriented approaches. Now, to be clear, it’s not doing the same task. If you have some data about a phenomenon that’s never been studied before, it’s not well understood, it’s never been named, the large language model is not going to be able to tell you—I don’t think it’s going to be able to tell you—what that causal relationship is. But for the many things that we do already know, it, it beats, you know, looking at the data. It’s, it’s quite impressive that way. So we think this is super exciting because it really removes this burden that we’ve really put on to the human analyst before, and now, now we can run these analyses, these … this whole data-driven process can be, uh, uh, built off of common sense it’s already captured without having to ask a user, a human, to type it all up correctly. 

LLORENS: Amit, one of the things I found fascinating about the set of experiments that you, that you ran here was the prompt engineering and just the effect on the experimental results of different ways of prompting the model. Take us through that experience and, and please do get specific on the particular prompts that you used and their effects on the outcome. 

SHARMA: Sure, yeah, this was an iterative exercise for us, as well. So as I was mentioning [to] you, when I started in December, um, the prompt I used was pretty simple: does changing A cause a change in B, right? So if you’re thinking of, let’s say, the relationship between altitude and temperature, it would just translate to a single sentence: does changing the altitude change the temperature? As we sort of moved into working for our paper and as we saw many different prompt strategies from other works, we started experimenting, right, and one of the most surprising things—actually shocking for us—was that if you just add … in these GPT-3.5 and 4 class of models, there’s a system prompt which sort of you can give some meta instructions to, to the model, and we just added a single line saying that “you are an expert in causal reasoning.” And it was quite shocking that just that thing gave us a 5-percentage point boost in the accuracy on the datasets that we were testing. So there’s something there about sort of prompting or kind of conditioning the model to be generating text more attuned with causality, which we found as interesting. It also sort of suggests that maybe the language model is not the model here; maybe it’s the prompt plus a language model, uh, meaning that GPT-4 with a great prompt could give you great answers, but sort of there’s a question of robustness of the prompt, as well. And I think finally, the prompt that we went for was an iteration on this, where instead of asking two questions—because for each pair we can ask, does A cause B or does B cause A—we thought of just making it one prompt and asking it, here are two variables, let’s say, altitude and temperature. Which direction is more likely? And so we just gave it two options or three options in the case of no direction exists. And there were two benefits to this. So, one, I think somehow this was, uh, increasing the accuracy even more, perhaps because choosing between options becomes easier now; you can compare which one is more likely. But also we could ask the LLM now to explain its reasoning. So we would ask it literally, explain it step by step going from the chain of thought reasoning. And its answers would be very instructive. So for example, some of the domains we tested, uh, we don’t know anything about it, right. So there was one neuropathic pain dataset, which has nodes called radiculopathy, DLS , lumbago. We have no idea, right. But just looking at the responses from the LLM, you can both sort of get a peek into what it’s doing at some high level maybe, but also understand the concepts and think for yourself whether those sorts of things, the reasoning, is making sense or not, right. And of course, we are not experts, so we may be fooled. We might think this is doing something. But imagine a doctor using it or imagine some expert using it. I think they can both get some auxiliary insight but also these explanations help them debug it. So if the explanation seems to be off or it doesn’t make sense, uh, that’s also a nice way of sort of knowing when to trust the model or not. 

KICIMAN: One of the things that we noticed with these prompts is that, you know, there’s more to do in this space, too. Like the kinds of mistakes that it’s making right now are things that we think might be resolved at least, you know, in some part with additional prompting or thinking strategies. For example, one of the mistakes was, um, about … when we asked about the relationship between ozone and levels in radiation levels, and it answered wrong. It didn’t answer what, what was expected in the benchmark. But it turns out it’s because there’s ambiguity in the question. The relationship between ozone and radiation, uh, is one direction if you’re talking about ozone at ground level in a city, and it’s the other direction if you’re talking about ozone in the stratosphere. And so you can ask it, is there any ambiguity here? Is there any additional information you would need that would change the direction of the causal mechanism that you’re, you know, suggesting? And it’ll tell you; it’ll say, if we’re talking about in the stratosphere, it’s this; if it’s on the ground, it’s this. And so there’s really … I think we’re going to see some really fun strategies for improving the performance further by digging into these types of interrogations. 

LLORENS: You know, the model is a kind of generalist in a way that most people are not or—I’m just going to go for it—in a way that no person is. You know, with all this knowledge of law and culture and economics and so many other … code, you know, so many other things, and I could imagine showing up and, yeah, a little bit of a primer on, a briefing on, well, here’s why you’re here and what you’re doing … I mean, that’s helpful for a person. And I imagine … and as we see, it’s helpful for these generalist, you know, general-purpose reasoners. And of course, mechanistically, what we’re doing is through the context, we’re inducing a different probability distribution over the tokens. And so I guess that’s … no, that’s what’s happening here. This is the primer that it gets before it steps into the room and, and does the Q&A or gives the talk, you know, as, as, as we do. But I want to get into a little bit now about where you see this going from here—for the field and for you as a researcher in the field. Let’s, let’s stick with you, Emre. Where do we go from here? What are some of the exciting frontiers? 

KICIMAN: What I’m most excited about is this opportunity I think that’s opening up right now to fluidly, flexibly go back and forth between these different modes of causality. Going from logic-based reasoning to data-based reasoning and going beyond the kind of set tasks that we have well-defined for, for us in our field right now. So there’s a fun story that I heard when I was visiting a university a couple of months ago. We were talking about actual causality and connections to, to database causality, and this person brought up this scenario where they were an expert witness in a case where a hedge fund was suing a newspaper. The newspaper had run an exposé of some kind on the hedge fund, scared off all of their investors, and the hedge fund went belly-up. And the hedge fund was blaming the newspaper and wanted, you know, compensation for this, right. But at the same time, this was in the middle of a financial crisis. And so there’s this question of wouldn’t the hedge fund have failed anyway? A lot of other hedge funds did. Plus there’s the question of, you know, how much of an effect do newspaper stories like this usually have? Could it possibly have killed the hedge fund? And then there’s all the, you know, questions of normality and, you know, morality and stuff of maybe this is what the newspaper is supposed to be doing anyway. It’s not their fault, um, what the consequences were. So now you can imagine asking this question, starting off in this logical, you know, framing of the problem; then when you get down to this sub-element of what happened to all the other hedge funds—what would have happened to this hedge fund if, um, if the newspaper hadn’t written a story?—we can go look at the data of what happened to all the other hedge funds, and we can run the data analysis, and we can come back. We can go back and forth so much. I think that kind of flexibility is something I’m really going to be excited to see us, you know, able to automate in some fashion. 

LLORENS: Amit, what do you think? Where do we go from here? 

SHARMA: Yeah, I think I’m also excited about the practical aspects of how this might transform the causal practice. So, for example, what Emre and I have worked a lot on, this problem of estimating the causal effect, and one of the challenges that has been in the field for a long time is that we have great methods for estimating the causal effect once we have the graph established, but getting that graph often is a really challenging process, and you need to get domain expertise, human involvement, and often that means that a lot of the causal analysis does not get done just because the upfront cost of building a graph is just too much or it’s too complex. And the flipside is also that it’s also hard to verify. So suppose you assume a graph and then you do your analysis; you get some effect like this policy is better, let’s say. It’s very hard to evaluate how good your graph was and how maybe there are some checks you can do, robustness checks, to, to validate that, right.

And so what I feel the opportunity here is that the LLMs are really being complementary to what we are already good at in causal inference, right? So we’re only good at, given a graph, getting you an estimate using statistics. What the LLMs can come in and do is help domain experts build the graph much, much faster. So now instead of sort of thinking about, “Oh, what is my system? What do I need to do?” Maybe there’s a documentation of your system somewhere that you just feed into an LLM, and it provides you a candidate graph to start with. And at the same time, on the backend, once you have estimated something, a hard challenge that researchers like us face is what might be good robustness checks, right. So often these are … one example is a negative control, where you try to think of what is something that would definitely not cause the outcome. I know it from my domain knowledge. Let me run my analysis through assuming if that was the action variable, and then my analysis should always give an answer of zero. But again, like sort of figuring out what such variables are is more of an art than science. And I think in the preliminary experiments that we are doing, the LLMs could also help you there; you could again sort of give your graph and your data … and your sort of data description, and the LLMs can suggest to you, “Hey, these might be the variables that you can use for your robustness check.” So I’m most excited about this possibility of sort of more and more adoption of causal methods because now the LLMs can substitute or at least help people to stand up these analyses much faster. 

LLORENS: Thank you both for this fascinating discussion. Understanding cause-and-effect relationships is such a fundamental part of how we apply human intelligence across so many different domains. I’m really looking forward to tracking your research, and the possibilities for more powerful causal reasoning with AI.

The post AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma appeared first on Microsoft Research.

Read More