SAMMO: A general-purpose framework for prompt optimization

SAMMO: A general-purpose framework for prompt optimization

SAMMO optimizer diagram showing progression from starting prompt to optimized prompt.

Large language models (LLMs) have revolutionized a wide range of tasks and applications that were previously reliant on manually crafted machine learning (ML) solutions, streamlining through automation. However, despite these advances, a notable challenge persists: the need for extensive prompt engineering to adapt these models to new tasks. New generations of language models like GPT-4 and Mixtral 8x7B advance the capability to process long input texts. This progress enables the use of longer inputs, providing richer context and detailed instructions to language models. A common technique that uses this enhanced capacity is the Retrieval Augmented Generation (RAG) approach. RAG dynamically incorporates information into the prompt based on the specific input example. This process is illustrated in Figure 1, which shows a RAG prompt designed to translate user queries into a domain-specific language (DSL), also known as semantic parsing. 

A table showing an example metaprompt for a semantic parsing task. The underlying metaprompt consists of three larger parts, each of which comes with a variety of aspects that can be optimized. For example, the input example can be rendered using different formats, the few shot example included can be retrieved using various similarity functions, or the task description can be paraphrased.
Figure 1: A RAG prompt is used for a semantic parsing task. The underlying prompt consists of three larger parts, each with a variety of aspects that can be optimized.

The example in Figure 1 combines three distinct structures to construct the final prompt. The first structure, the task description, remains static and independent of the input as a result of conventional prompt optimization techniques. However, RAG contains two input-specific structures: the example retriever and the input text itself. These introduce numerous optimization opportunities that surpass the scope of most traditional approaches. Despite previous efforts in prompt optimization, the evolution towards more complex prompt structures has rendered many older strategies ineffective in this new context. 

SAMMO: A prompt optimization approach 

To address these challenges, we developed the Structure-Aware Multi-objective Metaprompt Optimization (SAMMO) framework. SAMMO is a new open-source tool that streamlines the optimization of prompts, particularly those that combine different types of structural information like in the RAG example above. It can make structural changes, such as removing entire components or replacing them with different ones. These features enable AI practitioners and researchers to efficiently refine their prompts with little manual effort.

Central to SAMMO’s innovation is its approach to treating prompts not just as static text inputs but as dynamic, programmable entities—metaprompts. SAMMO represents these metaprompts as function graphs, where individual components and substructures can be modified to optimize performance, similar to the optimization process that occurs during traditional program compilation.

The following key features contribute to SAMMO’s effectiveness:

Structured optimization: Unlike current methods that focus on text-level changes, SAMMO focuses on optimizing the structure of metaprompts. This granular approach facilitates precise modifications and enables the straightforward integration of domain knowledge, for instance, through rewrite operations targeting specific stylistic objectives. 
Multi-objective search: SAMMO’s flexibility enables it to simultaneously address multiple objectives, such as improving accuracy and computational efficiency. Our paper illustrates how SAMMO can be used to compress prompts without compromising their accuracy.

General purpose application: SAMMO has proven to deliver significant performance improvements across a variety of tasks, including instruction tuning, RAG, and prompt compression.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

Exploring SAMMO’s impact through use cases 

Use case 1: RAG optimization 

A common application of LLMs involves translating natural user queries into domain-specific language (DSL) constructions, often to communicate with external APIs. For example, Figure 1 shows how an LLM can be used to map user queries about geography facts to a custom DSL.

In a realistic RAG scenario, SAMMO demonstrates significant performance improvements. To demonstrate this, we conducted experiments across three semantic parsing datasets of varying complexity: GeoQuery, SMCalFlow, and Overnight. Given the often limited availability of data in practical settings, we trained and tested the model on a subsampled dataset (training and retrieval set n=600, test set n=100). We compared SAMMO against a manually designed competitive baseline, using enumerative search within a search space of 24 configurations. This included variations in data formats, the number of few-shot examples, and DSL specifications.  


As illustrated in Figure 2, SAMMO improved accuracy across different datasets and backend LLMs in almost all cases, with the most notable gains observed in older generation models. However, even newer models like GPT-4, SAMMO facilitated accuracy improvements exceeding 100 percent.

A series of four bar charts showing the performance of SAMMO on semantic parsing tasks. SAMMO achieves substantial improvements for most backend models and datasets.
Figure 2: For semantic parsing with RAG, SAMMO achieves substantial improvements across most backend models and datasets. 

Use case 2: Instruction tuning 

Instruction tuning addresses the optimization of static instructions given to LLMs that provide the goal and constraints of a task. To show that SAMMO extends beyond many previous prompt tuning methods, we applied this conventional setting.

To align with previous research, we used eight zero-shot BigBench classification tasks where the baseline prompt for GPT-3.5 achieved an accuracy of less than 0.9. We compared it against Automatic Prompt Optimization (APO) and GrIPS, applying open-source models Mixtral 7x8B and Llama-2 70B, alongside GPT-3.5 as backend LLMs. We did not include GPT-4 due to minimal improvement potential identified in pilot experiments. The results, shown in Figure 3, demonstrate that SAMMO outperformed all baselines regardless of the backend model, proving its effectiveness with even more complex metaprompts.

A series of three bar charts comparing the accuracy of different methods on instruction tuning. SAMMO matches or exceeds the performance of competing methods for instruction tuning on classification tasks.
Figure 3: SAMMO does at least as well as older methods for instruction tuning on simpler tasks.

Implications and looking forward

SAMMO introduces a new and flexible approach to optimize prompts for specific requirements. Its design works with any LLM, and it features versatile components and operators suitable for a broad range of applications.

We are excited to integrate and apply SAMMO to the components and pipelines behind AI-powered assistant technologies. We also hope to establish a user-driven community centered around SAMMO, where people can exchange best practices and patterns, and encourage the expansion of the existing set of search operators.

The post SAMMO: A general-purpose framework for prompt optimization appeared first on Microsoft Research.

Read More

Research Focus: Week of April 15, 2024

Research Focus: Week of April 15, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus April 15, 2024

Appropriate reliance on Generative AI: Research synthesis

Appropriate reliance on AI happens when people accept correct AI outputs and reject incorrect ones. It requires users of AI systems to know when to trust the AI and when to trust themselves. But fostering appropriate reliance comes with new complexities when generative AI (genAI) systems are involved. Though their capabilities are advancing, genAI systems, which use generative models to produce content such as text, music, images, and videos, have limitations as well. Inappropriate reliance – either under-reliance or overreliance – on genAI can have negative consequences, such as poor task performance and even product abandonment.  

In a recent paper: Appropriate reliance on Generative AI: Research synthesis, researchers from Microsoft, who reviewed 50 papers from various disciplines, provide an overview of the factors that affect overreliance on genAI, the effectiveness of different mitigation strategies for overreliance on genAI, and potential design strategies to facilitate appropriate reliance on genAI. 

Characterizing Power Management Opportunities for LLMs in the Cloud

Cloud providers and datacenter operators are grappling with increased demand for graphics processing units (GPUs) due to expanding use of large language models (LLMs). To try to keep up, enterprises are exploring various means to address the challenge, such as power oversubscription and adding more servers. Proper power usage analysis and management could help providers meet demand safely and more efficiently. 

In a recent paper: Characterizing Power Management Opportunities for LLMs in the Cloud, researchers from Microsoft analyze power patterns for several popular, open-source LLMs across commonly used configurations and identify opportunities to improve power management for LLMs in the cloud. They present a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, POLCA simulations demonstrate it could deploy 30% more servers in existing clusters while incurring minimal power throttling events. POLCA improves power efficiency, reduces the need for additional energy sources and datacenters, and helps to promptly meet demand for running additional LLM workloads. 

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Various prompting techniques, such as chain-of-thought (CoT), in-context learning (ICL), and retrieval augmented generation (RAG), can empower large language models (LLMs) to handle complex and varied tasks through rich and informative prompts. However, these prompts are lengthy, sometimes exceeding tens of thousands of tokens, which increases computational and financial overhead and degrades the LLMs’ ability to perceive information. Recent efforts to compress prompts in a task-aware manner, without losing essential information, have resulted in shorter prompts tailored to a specific task or query. This typically enhances performance on downstream tasks, particularly in question answering. However, the task-specific features present challenges in efficiency and generalizability. 

In a recent paper: LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression, researchers from Microsoft and Tsinghua University propose a data distillation procedure to derive knowledge from an LLM (GPT-4) and compress the prompts without losing crucial information. They introduce an extractive text compression dataset, containing pairs of original texts from MeetingBank and their compressed versions. Despite its small size, their model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. The new model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x. 

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Despite recent progress in scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging. Evaluation is often performed using n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics like COMET have a higher correlation; however, challenges such as the lack of evaluation data with human ratings for under-resourced languages, the complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and the limited language coverage of multilingual encoders, have hampered their applicability to African languages. 

In a recent paper: AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages (opens in new tab), researchers from University College London, University of Maryland, Unbabel, Microsoft and the Masakhane Community (opens in new tab), address these challenges, creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. They also develop AFRICOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLMR) to create state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441). 

Comparing the Agency of Hybrid Meeting Remote Users in 2D and 3D Interfaces of the Hybridge System

Video communication often lacks the inclusiveness and simultaneity enabled by physical presence in a shared space. This is especially apparent during hybrid meetings, where some attendees meet physically in a room while others join remotely. Remote participants are at a disadvantage, unable to navigate the physical space like in-room participants. 

In a Late Breaking Work paper to be presented at CHI2024: Comparing the Agency of Hybrid Meeting Remote Users in 2D and 3D Interfaces of the Hybridge System,” Microsoft researchers present an experimental system for exploring designs for improving the inclusion of remote attendees in hybrid meetings. In-room users see remote participants on individual displays positioned around a table. Remote participants see video feeds from the room integrated into a digital twin of the meeting room, choosing where they appear in the meeting room and from where they view it. The researchers designed both a 2D and a 3D version of the interface. They found that 3D outperformed 2D in participants’ perceived sense of awareness, sense of agency, and physical presence. A majority of participants also subjectively preferred 3D over 2D. The next step in this research will test the inclusiveness of Hybridge 3D meetings against fully in-room meetings and traditional hybrid meetings. 

FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction. This is because models like transformers and convolutional networks aggressively pool information over large areas. 

In a paper that was published at ICLR 2024: FeatUp: A Model-Agnostic Framework for Features at Any Resolution, researchers from Microsoft and external colleagues introduce a task- and model-agnostic framework to restore lost spatial information in deep features. The paper introduces two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multiview consistency loss with deep analogies to neural radiance fields (NeRFs), a deep learning method of building 3D representations of a scene using sparse 2D images. In the new research, features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains, even without re-training. FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation. 

The post Research Focus: Week of April 15, 2024 appeared first on Microsoft Research.

Read More

Abstracts: April 16, 2024

Abstracts: April 16, 2024

Stylized microphone and sound waves illustration.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Research Software Engineer Tusher Chakraborty joins host Gretchen Huizinga to discuss “Spectrumize: Spectrum-efficient Satellite Networks for the Internet of Things,” which was accepted at the 2024 USENIX Symposium on Networked Systems Design and Implementation (NSDI). In the paper, Chakraborty and his coauthors share their efforts to address the challenges of delivering reliable and affordable IoT connectivity via satellite-based networks. They propose a method for leveraging the motion of small satellites to facilitate efficient communication between a large IoT-satellite constellation and devices on Earth within a limited spectrum.



GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.


I’m talking today to Tusher Chakraborty, a senior research software engineer at Microsoft Research. Tusher is coauthor of a paper called “Spectrumize: Spectrum-efficient Satellite Networks for the Internet of Things.” Tusher, thanks for joining us on Abstracts!

TUSHER CHAKRABORTY: Hi. Thank you for having me here, Gretchen, today. Thank you.

HUIZINGA: So because this show is all about abstracts, in just a few sentences, tell us about the problem your paper addresses and why we should care about it.

CHAKRABORTY: Yeah, so think of, I’m a farmer living in a remote area and bought a sensor to monitor the soil quality of my farm. The big headache for me would be how to connect the sensor so that I can get access to the sensor data from anywhere. We all know that connectivity is a major bottleneck in remote areas. Now, what if, as a farmer, I could just click the power button of the sensor, and it gets connected from anywhere in the world. It’s pretty amazing, right? And that’s what our research is all about. Get your sensor devices connected from anywhere in the world with just the click of power button. We call it one-click connectivity. Now, you might be wondering, what’s the secret sauce? It’s not magic; it’s direct-to-satellite connectivity. So these sensors directly get connected to the satellites overhead from anywhere on Earth. The satellites, which are orbiting around the earth, collect the data from the sensing devices and forward to the ground stations in some other convenient parts of the world where these ground stations are connected to the internet.

HUIZINGA: So, Tusher, tell us what’s been tried before to address these issues and how your approach contributes to the literature and moves the science forward.

CHAKRABORTY: So satellite connectivity is nothing new and has been there for long. However, what sets us apart is our focus on democratizing space connectivity, making it affordable for everyone on the planet. So we are talking about the satellites that are at least 10 to 20 times cheaper and smaller than state-of-the-art satellites. So naturally, this ambitious vision comes with its own set of challenges. So when you try to make something cheaper and smaller, you’ll face lots of challenges that all these big satellites are not facing. So if I just go a bit technical, think of the antenna. So these big satellite antennas, they can actually focus on particular part of the world. So this is something called beamforming. On the other hand, when we try to make the satellites cheaper and smaller, we can’t have that luxury. We can’t have beamforming capability. So what happens, they have omnidirectional antenna. So it seems like … you can’t focus on a particular part of the earth rather than you create a huge footprint on all over the earth. So this is one of the challenges that you don’t face in the state-of-the-art satellites. And we try to solve these challenges because we want to make connectivity affordable with cheaper and smaller satellites.

HUIZINGA: Right. So as you’re describing this, it sounds like this is a universal problem, and people have obviously tried to make things smaller and more affordable in the past. How is yours different? What methodology did you use to resolve the problems, and how did you conduct the research?

CHAKRABORTY: OK, I’m thrilled that you asked this one because the research methodology was the most exciting part for me here. As a part of this research, we launched a satellite in a joint effort with a satellite company. Like, this is very awesome! So it was a hands-on experience with a real-deal satellite system. It was not simulation-based system. The main goal here was to learn the challenge from a real-world experience and come up with innovative solutions; at the same time, evaluate the solutions in real world. So it was all about learning by doing, and let me tell you, it was quite the ride! [LAUGHTER] We didn’t do anything new when we launched the satellites. We just tried to see how industry today does this. We want to learn from them, hey, what’s the industry practice? We launched a satellite. And then we faced a lot of problems that today’s industry is facing. And from there, we learned, hey, like, you know, this problem is industry facing; let’s go after this, and let’s solve this. And then we tried to come up with the solutions based on those problems. And this was our approach. We didn’t want to assume something beforehand. We want to learn from how industry is going today and help them. Like, hey, these are the problems you are facing, and we are here to help you out.

HUIZINGA: All right, so assuming you learned something and wanted to pass it along, what were your major findings?

CHAKRABORTY: OK, that’s a very good question. So I was talking about the challenges towards this democratization earlier, right? So one of the most pressing challenges: shortage of spectrum. So let me try to explain this from the high level. So we need hundreds of these satellites, hundreds of these small satellites, to provide 24-7 connectivity for millions of devices around the earth. Now, I was talking, the footprint of a satellite on Earth can easily cover a massive area, somewhat similar to the size of California. So now with this large footprint, a satellite can talk with thousands of devices on Earth. You can just imagine, right? And at the same time, a device on Earth can talk with multiple satellites because we are talking about hundreds of these satellites. Now, things get tricky here. [LAUGHTER] We need to make sure that when a device and a satellite are talking, another nearby device or a satellite doesn’t interfere. Otherwise, there will be chaos—no one hearing others properly. So when we were talking about this device and satellite chat, right, so what is that all about? This, all about in terms of communication, is packet exchange. So the device sends some packet to the satellite; satellite sends some packet to the device—it’s all about packet exchange. Now, you can think of, if multiple of these devices are talking with a satellite or multiple satellites are talking with a device, there will be a collision in this packet exchange if you try to send the packets at the same time. And if you do that, then your packet will be collided, and you won’t be able to get any packet on the receiver end. So what we do, we try to send this packet on different frequencies. It’s like a different sound or different tone so that they don’t collide with each other. And, like, now, I said that you need different frequencies, but frequency is naturally limited. And the choice of frequency is even limited. This is very expensive. But if you have limited frequency and you want to resolve this collision, then you have a problem here. How do you do that? So we solve this problem by smartly looking at an artifact of these satellites. So these satellites are moving really fast around the earth. So when they are moving very fast around the earth, they create a unique signature on the frequency that they are using to talk with the devices on Earth. And we use this unique signature, and in physics, this unique signature is known as Doppler signature. And now you don’t need a separate frequency to sound them different, to have packets on different frequencies. You just need to recognize that unique signature to distinguish between satellites and distinguish between their communications and packets. So in that sense, there won’t be any packet collision. And this is all about our findings. So with this, now multiple devices and satellites can talk with each other at the same time without interference but using the same frequency.

HUIZINGA: It sounds, like, very similar to a big room filled with a lot of people. Each person has their own voice, but in the mix, you, kind of, lose track of who’s talking and then you want to, kind of, tune in to that specific voice and say, that’s the one I’m listening to.

CHAKRABORTY: Yeah, I think you picked up the correct metaphor here! This is the scenario you can try to explain here. So, yeah, like what we are essentially doing, like, if you just, in a room full of people and they are trying to talk with each other, and then if they’re using the same tone, no one will be distinguished one person from another.


CHAKRABORTY: Everyone will sound same and that will be colliding. So you need to make sure that, how you can differentiate the tones …


CHAKRABORTY: … and the satellites differentiate their tones due to their fast movement. And we use our methodology to recognize that tone, which satellite is sending that tone.

HUIZINGA: So you sent up the experimental satellite to figure out what’s happening. Have you since tested it to see if it works?

CHAKRABORTY: Yeah, yeah, so we have tried it out, because this is a software solution, to be honest.


CHAKRABORTY: As I was talking about, there is no hardware modification required at this point. So what we did, we just implemented this software in the ground stations, and then we tried to recognize which satellite is creating which sort of signature. That’s it!

HUIZINGA: Well, it seems like this research would have some solid real-world impact. So who would you say it helps most and how?

CHAKRABORTY: OK, that’s a very good one. So the majority of the earth still doesn’t have affordable connectivity. The lack of connectivity throws a big challenge to critical industries such as agriculture—the example that I gave—energy, and supply chain, so hindering their ability to thrive and innovate. So our vision is clear: to bring 24-7 connectivity for devices anywhere on Earth with just a click of power button. Moreover, affordability at the heart of our mission, ensuring that this connectivity is accessible to all. So in core, our efforts are geared towards empowering individuals and industries to unlock their full potential in an increasingly connected world.

HUIZINGA: If there was one thing you want our listeners to take away from this research, what would it be?

CHAKRABORTY: OK, if there is one thing I want you to take away from our work, it’s this: connectivity shouldn’t be a luxury; it’s a necessity. Whether you are a farmer in a remote village or a business owner in a city, access to reliable, affordable connectivity can transform your life and empower your endeavors. So our mission is to bring 24-7 connectivity to every corner of the globe with just a click of a button.

HUIZINGA: I like also how you say every corner of the globe, and I’m picturing a square! [LAUGHTER] OK, last question. Tusher, what’s next for research on satellite networks and Internet of Things? What big unanswered questions or unsolved problems remain in the field, and what are you planning to do about it?

CHAKRABORTY: Uh … where do I even begin? [LAUGHTER] Like, there are countless unanswered questions and unsolved problems in this field. But let me highlight one that we talked here: limited spectrum. So as our space network expands, so does our need for spectrum. But what’s the tricky part here? Just throw more and more spectrum. The problem is the chunk of spectrum that’s perfect for satellite communication is often already in use by the terrestrial networks. Now, a hard research question would be how we can make sure that the terrestrial and the satellite networks coexist in the same spectrum without interfering [with] each other. It’s a tough nut to crack, but it’s a challenge we are excited to tackle head-on as we continue to push the boundaries of research in this exciting field.


HUIZINGA: Tusher Chakraborty, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at (opens in new tab). You can also read it on the Networked Systems Design and Implementation, or NSDI, website, and you can hear more about it at the NSDI conference this week. See you next time on Abstracts!


The post Abstracts: April 16, 2024 appeared first on Microsoft Research.

Read More

Ideas: Language technologies for everyone with Kalika Bali

Ideas: Language technologies for everyone with Kalika Bali

Microsoft Research Podcast | Ideas | Kalika Bali

Behind every emerging technology is a great idea propelling it forward. In the new Microsoft Research Podcast series, Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets. 

In this episode, host Gretchen Huizinga talks with Principal Researcher Kalika Bali. Inspired by an early vision of “talking computers” and a subsequent career in linguistics, Bali has spent the last two decades bringing the two together. Aided by recent advances in large language models and motivated by her belief that everyone should have access to AI in their own language, Bali and her teams are building language technology applications that they hope will bring the benefits of generative AI to under-resourced and underserved language communities around the world.




KALIKA BALI: I do think, in some sense, the pushback that I got for my idea makes me think it was outrageous. I didn’t think it was outrageous at all at that time! I thought it was a very reasonable idea! But there was a very solid pushback and not just from your colleagues. You know, for researchers, publishing papers is important! No one would publish a paper which focused only on, say, Indian languages or low-resource languages. We’ve come a very long way even in the research community on that, right. We kept pushing, pushing, pushing! And now there are tracks, there are workshops, there are conferences which are devoted to multilingual and low-resource languages. 


GRETCHEN HUIZINGA: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. I’m Dr. Gretchen Huizinga. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward. 


I’m excited to be live in the booth today with Kalika Bali, a principal researcher at Microsoft Research India. Kalika is working on language technologies that she hopes will bring the benefits of generative AI to under-resourced and underserved language communities around the world. Kalika, it’s a pleasure to speak with you today. Welcome to Ideas

KALIKA BALI: Thank you. Thank you, Gretchen. Thank you for having me. 

HUIZINGA: So before we dive in on the big ideas behind Kalika Bali’s research, let’s talk about you for a second. Tell us about your “origin story,” as it were, and if there is one, what “big idea” or animating “what if?” captured your imagination and inspired you to do what you’re doing today? 

BALI: So, you know, I’m a great reader. I started reading well before I was taught in school how to read, and I loved science fiction. I come from a family where reading was very much a part of our everyday lives. My dad was a journalist, and I had read a lot of science fiction growing up, and I also saw a lot of science fiction, you know, movies … Star Trek … everything that I could get hold of in India. And I remember watching 2001: Space Odyssey. And there was this HAL that spoke. He actually communicated that he was a computer. And I was just so struck by it. I was like, this is so cool! You know, here are computers that can talk! Now, how cool would that be if it would happen in real life? I was not at all aware of what was happening in speech technology, whether it was possible or not possible, but that’s something that really got me into it. I’ve always, like, kind of, been very curious about languages and how they work and, you know, how people use different things in languages to express not just meaning, not just communicating, but you know expressing themselves, really. And so I think it’s a combination of HAL and this curiosity I had about the various ways in which people use languages that got me into what I’m doing now. 

HUIZINGA: OK. So that’s an interesting path, and I want to go into that just a little bit, but let me anchor this: how old were you when you saw this talking computer? 

BALI: Oh, I was in my early teens. 

HUIZINGA: OK. And so at that time, did you have any conception that … ? 

BALI: No. You know, there weren’t computers around me when I was growing up. We saw, you know, some at school, you know, people coded in BASIC … 


BALI: And we heard about them a lot, but I hadn’t seen one since I was in high school. 

HUIZINGA: OK. So there’s this inception moment, an aha moment, of that little spark and then you kind of drifted away from the computer side of it, and what … tell us about how you went from there to that! 

BALI: So that, that’s actually a very funny story because I actually wanted to study chemistry. I was really fascinated by how these, you know, molecular parts rotate around each other and, you know, we can’t even tell where an electron is, etc. It sounded, like, really fun and cool. So I actually studied chemistry, but then I was actually going to pick up the admission form for my sister, who wanted to study in this university, and … or, no, she wanted to take an exam for her master’s. And I went there. I picked up the form, and I said, this is a cool place. I would love to study here! And then I started looking at everything like, you know, what can I apply for here? And something called linguistics came up, and I had no idea what linguistics was. So I went to the British Library, got like a thin book on introduction to linguistics, and it sounded fun! And I took the exam. And then, as they say, that was history. Then I just got into it. 

HUIZINGA: OK. I mean, so much has happened in between then and now, and I think we’ll kind of get there in … but I do want you to connect the larger dot from how you got from linguistics to Microsoft Research [LAUGHTER] as a computer scientist.

BALI: So I actually started teaching at the University of South Pacific as a linguistics faculty in Fiji. And I was very interested in acoustics of speech sounds, etc., etc. That’s what I was teaching. And then there was a speech company in Belgium that was looking to start some work in Indian languages, and they contacted me, and at that time, you needed people who knew about languages to build language technology, especially people who knew about phonetics, acoustics, for speech technology. And that’s how I got into it. And then, you know, I just went from startups to companies and then Microsoft Research, 18 years ago, almost 18 years ago. 

HUIZINGA: Wow. OK. I would love to actually talk to you about all that time. But we don’t have time because I have a lot more things to talk to you about, technology-wise. But I do want to know, you know, how would you describe the ideas behind your overarching research philosophy, and who are your influences, as they say in the rock-and-roll world? [LAUGHTER] Who inspired you? Real-life person, scientist or not, besides, HAL 9000, who’s fictional, and any seminal papers that, sort of, got you interested in that along the way? 

BALI: So since I was really into speech, Ken Stevens—who was a professor, who sadly is no longer with us anymore, at MIT—was a big influence. He, kind of, had this whole idea of how speech is produced. And, you know, the first time I was exposed to the whole idea of the mathematics behind the speech, and I think he influenced me a lot on the speech side of things. For the language side of things, you know, my professor in India Professor Anvita Abbi—you know, she’s a Padma Shri, like, she’s been awarded by the Indian government for her work in, you know, very obscure, endangered languages—you know, she kind of gave me a feel for what languages are, and why they are important, and why it’s important to save them and not let them die away. 


BALI: So I think I would say both of them. But what really got me into wanting to work with Indian language technology in a big way was I was working in Belgium, I was working in London, and I saw the beginning of how technology is, kind of, you know, making things easier, exciting; there’s cool technology available for English, for French, for German … But in a country like India, it was more about giving access to people who have no access, right? It actually mattered, because here are people who may not be very literate and therefore not be able to use technology in the way we know it, but they can talk. 


BALI: And they can speak, and they should be able to access technology by doing that. 

HUIZINGA: Right. OK. So just real quickly, that was then. What have you seen change in that time, and how profoundly have the ideas evolved? 

BALI: So just from pure methodology and what’s possible, you know, I have seen it all. When I started working in language technology, mainly for Indian languages, but even for other languages, it was all a rule-based system. So everybody had to create all these rules that then were, you know, responsible for building or like making that technology work. But then, just at that time, you know, all the statistical systems and methodologies came into being. So we had hidden Markov models, you know, doing their thing in speech, and it was all about a lot of data. But that data still had to be procured in a certain way, labeled, annotated. It was still a very long and resource-intensive process. Now, with generative AI, the thing that I am excited about is, we have a very powerful tool, right? 

HUIZINGA: Mm-hmm. 

BALI: And, yes, it requires a lot of data, but it can learn also; you know, we can fine-tune stuff on smaller datasets … 


BALI: … to work for, you know, relevant things. So it’s not going to take me years and years and years to first procure the data, then have it tagged for part of speech … then, you know, have it tagged for sentiment, have it tagged for this, have it tagged for that, and then, only can I think of building anything. 


BALI: So it just shortens that timeline so much, and it’s very exciting. 

HUIZINGA: Right. As an ex-English teacher—which I don’t think there is such a thing as an ex-English teacher; you’re always silently correcting someone’s grammar! [LAUGHTER]—just what you said about tagging parts of speech as what they are, right? And that, I used to teach that. And then you start to think, how would you translate that for a machine? So fascinating. So, Kalika, you have said that your choice of career was accidental—and you’ve alluded to the, sort of, the fortuitous things that happened along the way—but that linguistics is one subject that goes from absolute science to absolute philosophy. Can you unpack that a little bit more and how this idea impacted your work in language technology? 

BALI: Yeah. So, so if you think about it, you know, language has a physical aspect, right. We move our various speech organs in a certain way. Our ears are constructed in a certain way. There is a physics of it where, when I speak, there are sound waves, right, which are going into your ear, and that’s being interpreted. So, you know, if you think about that, that’s like an absolute science behind it, right? But then, when you come to the structure of language, you know, the syntax, like you’re an English teacher, so you know this really well, that you know, there’s semantics; there’s, you know, morphology, how our words form, how our sentences form. And that’s like a very abstract kind of method that allows us to put, you know, meaningful sentences out there, right? 

HUIZINGA: Right … 

BALI: But then there’s this other part of how language works in society, right. The way I talk to my mother would be probably very different to the way I’m talking to you, would be very different from the way I talk to my friends, at a very basic level, right? The way, in India, I would greet someone older to me would be very different from the way I would greet somebody here, because here it’s like much less formal and that, you know, age hierarchy is probably less? If I did the same thing in India, I would be considered the rudest creature ever. [LAUGHS] So … and then, you know, you go into the whole philosophy—psycholinguistics part. What happens in our brains, you know, when we are speaking? Because language is controlled by various parts of our brain, right. And then, you go to the pure philosophy part, like why? How does language even occur? Why do we name things the way we name things? You know, why do we have a language of thought? You know, what language are we thinking in? [LAUGHTER] 


BALI: So, so it really does cover the entire gamut of language … 

HUIZINGA: Yeah, yeah, yeah … 

BALI: … like from science to philosophy. 

HUIZINGA: Yeah, as I said before, when we were talking out there, my mother-in-law was from Holland, and every time she did math or adding, she would do it in Dutch, which—she’d be speaking in English and then she’d go over here and count in Dutch out loud. And it’s like, yeah, your brain switches back and forth. This is so exciting to me. I had no idea how much I would love this podcast! So, much of your research is centered on this big idea called “design thinking,” and it’s got a whole discipline in universities around the world. And you’ve talked about using something you call the 4D process for your work. Could you explain that process, and how it plays out in the research you do with the communities you serve?

BALI: Yeah, so we’ve kind of adapted this. My ex-colleague Monojit Choudhury and I, kind of, came up with this whole thing about 4D thinking, which is essentially discover, design, develop and deploy, right. And when we are working with, especially with, marginalized or low-resource-language communities, the very basic thing we have to do is discover, because we cannot go with, you know, our own ideas and perceptions about what is required. And I can give you a very good example of this, right. You know, most of us, as researchers and technologists, when we think of language technology, we are thinking about machine translation; we’re thinking about speech recognition; we are thinking about state-of-the-art technology. And here we were talking to a community that spoke the language Idu Mishmi, which is a very small community in northeast of India. And we were talking about, you know, we can do this, we can do that. And they just turned to us and said, what we really want is a mobile digital dictionary! [LAUGHS] 

HUIZINGA: Wow. Yeah … 

BALI: Right? And, you know, if you don’t talk, if you don’t observe, if you are not open to what the community’s needs might be, then you’ll miss that, right. You’ll miss the real thing that will make a difference to that community. So that’s the discover part. The design part, again, you have to design with the community. You cannot go and design a system that they are unable to use properly, right. And again, another very good example, one of the people I know, you know, he gave me this very good example of why you have to think, even at the architecture level when you’re designing such things, is like a lot of applications in India and around the world require your telephone number for verification. Now, for women, it might be a safety issue. They might not want to give their telephone number. Or in India, many women might not even have a telephone, like a mobile number, right. So how do you think of other ways in which they can verify, right? And so that’s the design part. The develop and the deploy part, kind of, go hand in hand, because I think it’s a very iterative process. You develop quickly, you put it out there, allow it to fail and, you know … 

HUIZINGA: Mm-hmm. Iterate … 

BALI: Iterate. So that’s like the, kind of, design thinking that we have. 

HUIZINGA: Yeah, I see that happening in accessibility technology areas, too, as well as language … 

BALI: Yeah, and, you know, working with the communities, very quickly, you become really humble.


BALI: There’s a lot of humility in me now. Though I have progressed in my career and, you know, supposedly become wiser, I am much more humble about what I know and what I can do than I was when I started off, you know. 

HUIZINGA: I love that. Well, one thing I want to talk to you about that has intrigued me, there’s a thing that happens in India where you mix languages … 

BALI: Yes!

HUIZINGA: You speak both Hindi and English at the same time, and you think, oh, you speak English, but it’s like, no, there’s words I don’t understand in that. What do you call that, and how did that drive your interest? I mean, that was kind of an early-on kind of thing in your work, right? Talk about that. 

BALI: So that’s called code-mixing or code-switching. The only linguistic difference is code-mixing happens within a sentence, and code-switching means one sentence in one language and another. 

HUIZINGA: Oh, really? 

BALI: Yeah. So … but this is, like, not just India. This is a very, very common feature of multilingual societies all over the world. So it’s not multilingual individuals, but at the societal level, when you have multilingualism, then, you know, this is a marker of multilingualism. But code-mixing particularly means that you have to be fluent in both languages to actually code-mix, right. You have to have a certain amount of fluency in both languages. And there are various reasons why people do this. You know, it’s been studied by psychologists and linguists for a long time. And for most people like me, multilingual people, that’s the language we dream in, we think about. [LAUGHTER] That’s the language we talk to our siblings and friends in, right. And for us, it’s, like, just natural. We just keep … 

HUIZINGA: Mixing … 

BALI: … flipping between the two languages for a variety of reasons. We might do it for emphasis; we might do it for humor. We might just decide, OK, I’m going to pick this from this … the brain decides I’m going to pick this from this language … 


BALI: … and this … So the reason we got interested in, like, looking into code-mixing was that when we are saying that we want humans to be able to interact with machines in their most natural language, then by some estimates, half the world speaks like this! 


BALI: So we have to be able to understand exactly how they speak and, you know, be able to process and understand their language, which is code-mixed … 

HUIZINGA: Sure. Well, it seems like the human brain can pick this up and process it fairly quickly and easily, especially if it knows many languages. For a machine, it would be much more difficult? 

BALI: It is. So initially, it was really difficult because, you know, the way we created systems was one language at a time … 


BALI: … right. And it’s not about having an English engine and a Hindi engine available. It doesn’t work that way. 


BALI: So you’d really need something that, you know, is able to tackle the languages together. And in some theories, this is almost considered a language of its own because it’s not like you’re randomly mixing. There is a structure to … 

HUIZINGA: Oh, is there? 

BALI: Yeah. Where you can, where you can’t … 

HUIZINGA: Gotcha. 

BALI: You know, so there is a structure or grammar, you can say, of code-mixing. So we went after that. We, kind of, created tools which could generate grammatically viable code-mixed sentences given parallel data, etc. 

HUIZINGA: That’s awesome. Amazing.

BALI: So, yeah, it takes effort to do it. But again, right now, because the generative AI models have at their disposal, you know, so many languages and at least, like, theoretically can work in many, many, many languages, you know, code-mixing might be an easier problem to solve right now. 

HUIZINGA: Right. OK. So we’re talking mostly about widely used languages, and you’re very concerned right now on this idea of low-resource languages. So unpack what you mean by low-resource, and what’s missing from the communities that speak those languages? 

BALI: Yeah. So when we say low-resource languages, we typically mean that languages do not have, say, digital resources, linguistic resources, language resources, that would enable technology building. It doesn’t mean that the communities themselves are impoverished in culture or linguistic richness, etc., right. But the reason why these communities do not have a lot of language resources, linguistic resources, digital resources, most of the time, it is because they are also marginalized in other ways … social and economic marginalization. 


BALI: And these are … if you look at them, they’re not ti—I mean, of course, some of them are tiny, but when we say low-resource communities, we are talking about really big numbers. 

HUIZINGA: Oh, really? 

BALI: Yeah. So one of the languages that I have worked with—language communities that I’ve worked with—speak a language called Gondi, which is like a Dravidian language that is spoken in … like a South Indian language that is spoken in north, central-north area. It’s a tribal language, and it’s got around three million speakers.

HUIZINGA: Oh, wow! 

BALI: Yeah. That’s like more than Welsh, … 


BALI: … right? But because socio-politically, they have been—or economically, they have been marginalized, they do not have the resources to build technologies. And, you know, when we say empower everyone and we only empower the top tier, I don’t think we fulfill our ambition to empower everyone. And like I said earlier, for these communities, all the technology that we have, digital tools that we have access to, they really matter for them. So, for example, you know, a lot of government schemes or the forest reserve laws are provided, say, in Hindi. If they are provided in Gondi, these people have a real idea of what they can do. 

HUIZINGA: Yeah. … Sure. 

BALI: Similarly, for education, you know, there are books and books and books in Hindi. There’s no book available for Gondi. So how is the next generation even going to learn the language? 


BALI: And there are many, many languages which are low resource. In fact, you know, we did a study sometime in 2020, I think, we published this paper on linguistic diversity, and there we saw that, you know, we divided languages in five categories, and the top most which have all the resources to build every possible technology have only five languages, right. And more than half of the world’s languages are at the bottom. So it is a big problem. 

HUIZINGA: Yeah. Let’s talk about some of the specific technologies you’re working on. And I want to go from platform to project because you’ve got a big idea in a platform you call VeLLM. Talk about that. 

BALI: So VeLLM, which actually means jaggery—the sweet, sugary jaggery—in Tamil, one of the languages in India … 

HUIZINGA: Let me, let me interject that it’s not vellum like the paper, or what you’re talking about. It’s capital V, little e, and then LLM, which stands for large language model? 

BALI: So universal, the “V” comes from there. Empowerment, “e” comes from there. Through large language models … 

HUIZINGA: Got it. OK. But you shortened it to VeLLM. 

BALI: Yeah. 


BALI: So, so the thing with VeLLM is that a bunch of us got together just when this whole GPT was released, etc. We have a very strong group that works on technologies for empowerment in the India lab, Microsoft Research India. And we got together to see what it is that we can do now that we have access to such a strong and powerful tool. And we started thinking of the work that we’ve been doing, which is to, you know, build these technologies for specific areas and specific languages, specific demographies. So we, kind of, put all that knowledge and all that experience we had and thought of like, how can we scale that, really, across everything that we do? So VeLLM, at its base, you know, takes a GPT-like LLM, you know, as a horizontal across everything. On top of it, we have again, horizontals of machine learning, of multilingual tools and processes, which allow us to take the outputs from, say, GPT-like things and adapt it to different languages or, you know, some different kind of domain, etc. And then we have verticals on top of it, which allow people to build specific applications. 

HUIZINGA: Let me just go back and say GPT … I think most of our audience will know that that stands for generative pretrained transformer models. But just so we have that for anyone who doesn’t know, let’s anchor that. So VeLLM basically was an enabling platform … 

BALI: Yes. 

HUIZINGA: … on which to build specific technologies that would solve problems in a vertical application. 

BALI: Yes. Yes. And because it’s a platform, we’re also working on tools that are needed across domains … 

HUIZINGA: Oh, interesting. 

BALI: … as well as tools that are needed for specific domains. 

HUIZINGA: OK, so let’s talk about some of the specifics because we could get into the weeds on the tools that everybody needs, but I like the ideas that you’re working on and the specific needs that you’re meeting, the felt-need thing that gets an idea going. So talk about this project that you’ve worked on called Kahani. Could you explain what that is, and how it works? It’s really interesting to me. 

BALI: So Kahani, actually, is about storytelling, culturally appropriate storytelling, with spectacular images, as well as like textual story. 

HUIZINGA: So visual storytelling? 

BALI: Visual storytelling with the text. So this actually started when my colleague Sameer Segal, he was trying to use generative AI to create stories for his daughter, and he discovered that, you know, things are not very culturally appropriate! So I’ll give an example that, you know, if you want to take Frozen and take it to, like, the south Indian state of Kerala, you’ll have the beaches of Kerala, you’ll have even have the coconut trees, but then you will have this blond princess in a princess gown … 


BALI: … who’s there, right? So that’s where we started discussing this, and we, kind of, started talking about, how can we create visuals that are anchored on text of a story that’s culturally appropriate? So when we’re talking about, say, Little Red Riding Hood, if we ask the generative AI model, OK, that I want the story of Little Red Riding Hood but in an Indian context, it does a fantastic job. It actually gives you a very nice story, which, you know, just reflects the Red Riding Hood story into an Indian context. But the images don’t really … 


BALI: … Match at all. So that’s where the whole Kahani thing started. And we did a hackathon project on it. And then a lot of people got interested. It’s an ongoing project, so I won’t say that it’s out there yet, but we are very excited about it, but because think of it, we can actually create stories for children, you know, which is what we started with, but we can create so much more media, so much more culturally appropriate storytelling, which is not necessarily targeted at children. 

HUIZINGA: Yeah, yeah. 

BALI: So that’s what Kahani is about. 

HUIZINGA: OK. And I saw a demo of it that your colleague did for Research Forum here, and there was an image of a girl—it was beautiful—and then there was a mask of some kind or a … what was that? 

BALI: So the mask is called Nazar Battu, which is actually, you have these masks which are supposed to drive away the evil eye. So that’s what the mask was about. It’s a very Indian thing. You know, when you build a nice house, you put one on top of it so that the envious glances are, like, kept at bay. So, yeah, so that’s what it was. 

HUIZINGA: And was there some issue of the generative AI not really understanding what that was? 

BALI: No, it didn’t understand what it was. 

HUIZINGA: So then can you fix that and make it more culturally aware? 

BALI: So that’s what we are trying to do for the image thing. So we have another project on culture awareness where we are looking at understanding how much generative AI knows about other cultures. 

HUIZINGA: Interesting. 

BALI: So that’s a simultaneous project that’s happening. But in Kahani, a lot of it is, like, trying to get reference images, you know … 

HUIZINGA: Yeah. … Into the system? 

BALI: Into the system … 

HUIZINGA: Gotcha … 

BALI: … and trying to anchor on that. 

HUIZINGA: Mmmm. So—and we’re not going to talk about that project, I don’t think—but … how do you assess whether an AI knows? By just asking it? By prompting and seeing what happens? 

BALI: Yeah, yeah, yeah. So in another project, what we did was, we asked humans to play a game to get cultural artifacts from them. The problem with asking humans what cultural artifacts are important to them is we don’t think of like things as culture, right. [LAUGHS] This is food! 

HUIZINGA: It’s just who we are! 

BALI: This is my food. Like, you know, it’s not a culturally important artifact. This is how I greet my parents. It’s not like culturally … 

HUIZINGA: So it’s just like fish swimming in water. You don’t see the water. 

BALI: Exactly. So we gamified this thing, and we were able to get certain cultural artifacts, and we tried to get generative AI models to tell us about the same artifacts. And it didn’t do too well … [LAUGHS] 

HUIZINGA: But that’s why it’s research! 

BALI: Yes! 

HUIZINGA: You try, you iterate, you try again … cool. As I mentioned earlier, I was a high school English teacher and an English major. I’m not correcting your grammar because it’s fantastic.

BALI: Thank you. 

HUIZINGA: But as a former educator, one of the projects I felt was really compelling that you’re working on is called Shiksha. It’s a copilot in education. Tell our audience about this.

BALI: So this is actually our proof of concept for the VeLLM platform. Since almost all of us were interested in education, we decided to go for education as the first use case that we’re going to work on. And actually, it was a considered decision to go target teachers instead of students. I mean, you must have seen a lot of work being done on taking generative AI to students, right. But we feel that, you know, teachers are necessary to teach because they’re not just giving you information about the subject. They’re giving you skills to learn, which hopefully will stay with you for a lifetime, right. And if we enable teachers, they will enable so many hundreds of students. One teacher can enable thousands of students, right, over her career. So instead of, like, going and targeting students, if we make it possible for teachers to do their jobs more effectively or, like, you know, help them get over the problems they have, then we are actually creating an ecosystem where things will scale really fast, really quickly. And in India, you know, this is especially true because the government has actually come up with some digital resources for teachers to use, but there’s a lot more that can be done. So we interviewed about a hundred-plus teachers across different parts of the country. And this is the, you know, discover part. 


BALI: And we found out that lesson plans are a big headache! [LAUGHS] 

HUIZINGA: Yes, they are! Can confirm! 

BALI: Yeah. And they spend a lot of time doing lesson plans because they’re required to create a lesson plan for every class they teach … 

HUIZINGA: Sure. With learning outcomes … 

BALI: Exactly. 

HUIZINGA: All of it. 

BALI: All of it. So that’s where we, you know, zeroed in on—how to make it easier for teachers to create lesson plans. And that’s what the Shiksha project is about. You know, there is an enrollment process where the teachers say what subject they’re teaching, what classes they’re teaching, what boards, because there are different boards of education … 

HUIZINGA: Right … 

BALI: … which have different syllabus. So all that. But after that, it takes less than seven minutes for a teacher to create an entire lesson plan for a particular topic. You know, class assignments, class activities, home assignments, homework—everything! Like the whole thing in seven minutes! And these teachers have the ability to go and correct it. Like, it’s an interactive thing. So, you know, they might say, I think this activity is too difficult for my students. 


BALI: Can I have, like, an easier one? Or, can I change this to this? So it allows them to interactively personalize, modify the plan that’s put out. And I find that really exciting. And we’ve tested this with the Sikshana Foundation, which works with teachers in India. We’ve tested this with them. The teachers are very excited and now Sikshana wants to scale it to other schools. 

HUIZINGA: Right … well, my first question is, where were you when I was teaching, Kalika? 

BALI: There was no generative AI! 

HUIZINGA: No. In fact, we just discovered the fax machine when I started teaching. Oh, that dates me! You know, back to what you said about teachers being instrumental in the lives of their students. You know, we can remember our favorite teachers, our best teachers. We don’t remember a machine. 


HUIZINGA: And what you’ve done with this is to embody the absolute sort of pinnacle of what AI can do, which is to be the collaborator, the assistant, the augmenter, and the helper so that the teacher can do that inspirational, connective-tissue job with the students without having to, like, sacrifice the rest of their life making lesson plans and grading papers. Oh, my gosh. OK. On the positive side, we’ve just talked about what this work proposes and how it’s good, but I always like to dig a little bit into the potential unintended consequences and what could possibly go wrong if, in fact, you got everything right. So I’ll anchor this in another example. When GPT models first came out, the first reaction came from educators. It feels like we’re in a bit of a paradigm shift like we were when the calculator and the internet even came out. [It’s] like, how do we process this? So I want to go philosophically here and talk about how you foresee us adopting and moving forward with generative AI in education, writ large. 

BALI: Yeah, I think this is a question that troubles a lot of us and not just in education, but in all spheres that generative AI is … 


BALI: … art … 

HUIZINGA: … writing … 

BALI: … writing … 

HUIZINGA: … journalism … 

BALI: Absolutely. And I think the way I, kind of, think about it in my head is it’s a tool. At the end of it, it is a tool. It’s a very powerful tool, but it is a tool, and humans must always have the agency over it. And we need to come up, as a society, you know, we need to come up with the norms of using the tool. And if you think about it, you know, internet, taking internet as an example, there is a lot of harm that internet has propagated, right. The darknet and all the other stuff that happens, right. But on the whole, there are regulations, but there are also an actual consensus around what constitutes the positive use of internet, right. 

HUIZINGA: Sure, yeah. 

BALI: Nobody says that, for example, deepfakes are … 

HUIZINGA: Mm-hmm. Good … 

BALI: … good, right. So we have to come from there and think about what kind of regulations we need to have in place, what kind of consensus we need to have in place, what’s missing. 

HUIZINGA: Right. Another project that has been around, and it isn’t necessarily on top of VeLLM, but it’s called Karya, and you call it a social impact organization that serves not just one purpose, but three. Talk about that. 

BALI: Oh, Karya is my favorite! [LAUGHS] So Karya started as a research project within Microsoft Research India, and this was the brainchild again of my colleague—I have like some of the most amazing colleagues, too, that I work with!—called Vivek Seshadri. And Vivek wanted to create, you know, digital work for people who do not have access to such work. So he wanted to go to the rural communities, to people who belong to slightly lower socioeconomic demographies, and provide work, like microtasks kind of work, gig work, to them. And he was doing this, and then we started talking, and I said, you know, we need so much data for all these languages and all these different tasks, and that could be, like, a really cool thing to try on Karya, and that’s where it all started, my involvement with Karya, which is still pretty strong. And Karya then became such a stable project that Microsoft Research India spun it out. So it’s now its own standalone startup right now like a social enterprise, and they work on providing digital work. They work on providing skills, like upskilling. They work on awareness, like, you know, making people aware of certain social, financial, other such trainings. So what’s been most amazing is that Karya has been able to essentially collect data for AI in the most ethical way possible. They pay their workers a little over the minimal wage. They also have something called data ownership practice, where the data that is created by, say, me, I have some sort of ownership on it. So what that means is that every time Karya sells a dataset, a royalty comes back … 


BALI: Yeah! To the workers. 

HUIZINGA: OK, we need to scale this out! [LAUGHS] OK. So to give a concrete example, the three purposes would be educational, financial—on their end—and data collection, which would ultimately support a low-resource language by having digital assets.

BALI: Absolutely! 

HUIZINGA: So you could give somebody something to read in their language … 

BALI: Yeah. 

HUIZINGA: … that would educate them in the process. They would get paid to do it, and then you would have this data. 

BALI: Yes! 

HUIZINGA: OK. So cool. So simple. 

BALI: Like I said, it’s my favorite project. 

HUIZINGA: I get that. I totally get that. 

BALI: And they … they’ve been, you know, they have been winning awards and things all over for the work that they’re doing right now. And I am very involved in one project with them, which is to do with gender-intentional AI, or gender-intentional datasets for AI, for Indian languages. And that’s really crucial because, you know, we talk about gender bias in datasets, etc., but all that understanding comes from a very Western perspective and for languages like English, etc. They do not translate very well to Indian languages. 


BALI: And in this particular project, we’re looking at first, how to define gender bias. How do we even get data around gender bias? What does it even mean to say that technology is gender intentional? 

HUIZINGA: Right. All right, well, let’s talk a little bit about what I like to call outrageous ideas. And these are the ones that, you know, on the research spectrum from sort of really practical applied research to blue sky get dismissed or viewed as unrealistic or unattainable. So years ago—here’s a little story about you—when you told your tech colleagues that you wanted to work with the world’s most marginalized languages, they told you you’d only marginalize yourself. 

BALI: Yes! 

HUIZINGA: But you didn’t say no. You didn’t say no. Um, two questions. Did you feel like your own idea was outrageous back then? And do you still have anything outrageous yet to accomplish in this plan? 

BALI: Oh, yeah! I hope so! Yeah. No, I do think, in some sense, the pushback that I got for my idea makes me think it was outrageous. I didn’t think it was outrageous at all at that time! [LAUGHS] I thought it was a very reasonable idea! But there was a very solid pushback and not just from your colleagues. You know, for researchers, publishing papers is important! No one would publish a paper which focused only on, say, Indian languages or low-resource languages. We’ve come a very long way even in the research community on that, right. We kept pushing, pushing, pushing! And now, there are tracks, there are workshops, there are conferences which are devoted to multilingual and low-resource languages. When I said I wanted to work on Hindi, and Hindi is the biggest language in India, right. And even for that, I was told, why don’t you work on German instead? And I’m like, there are lots of people working on German who will solve the problems with German! Nobody is looking at Hindi! I mean, people should work on all the languages. People should work on German, but I don’t want to work on German! So there was a lot of pushback back then, and I see a little bit of that with the very low-resource languages even now. And I think some people think it’s a “feel-good” thing, whereas I think it’s not. I think it’s a very economically viable, necessary thing to build technology for these communities, for these languages. No one thought Hindi was economically viable 15 years ago, for whatever reason … 

HUIZINGA: That … that floors me … 

BALI: Yeah, but, you know, we’re not talking about tens of thousands of people in some of these languages; we’re talking about millions. 


BALI: I still think that is a job that I need to continue, you know, pushing back on. 

HUIZINGA: Do you think that any of that sort of outrageous reaction was due to the fact that the technology wasn’t as advanced as it is now and that it might have changed in terms of what we can do? 

BALI: There was definitely the aspect of technology there that it was just quite difficult and very, very resource-intensive to build it for languages which did not have resources. You know, there was a time when we were talking about how to go about doing this, and because people in various big tech companies, people did not really remember a time when, for English, they had to start data collection from scratch because everyone who was working on, say, English at that time was building on what people had done years and years ago. So they could not even conceptualize that you had to start from scratch for anything, right. But now with the technology as well, I’m quite optimistic and trying to think of how cool it would be to do, you know, smaller data collections and fine-tuned models specifically and things like that, so I think that the technology is definitely one big thing, but economics is a big factor, too. 

HUIZINGA: Mmm-hmm. Well, I’m glad that you said it isn’t just the feel good, but it actually would make economic sense because that’s some of the driver behind what technologies get “greenlit,” as it were. Is there anything outrageous now that you could think of that, even to you, sounds like, oh, we could never do that … 

BALI: Well … I didn’t think HAL was outrageous, so I’m not … [LAUGHS] 

HUIZINGA: Back to HAL 9000! [LAUGHS] 

BALI: Yeah, so I don’t think of things as outrageous or not. I just think of things as things that need to get done, if that makes any sense? 

HUIZINGA: Totally. Maybe it’s, how do we override “Open the pod bay door, HAL”—“No, I’m sorry, Dave. I can’t do that”? [LAUGHS] 

BALI: Yes. [LAUGHS] Yeah… 

HUIZINGA: Well, as we close—and I’m sad to close because you are so much fun—I want to do a little vision casting, but in reverse. So let’s fast-forward 20 years and look back. How have the big ideas behind your life’s work impacted the world, and how are people better off or different now because of you and the teams that you’ve worked with? 

BALI: So the way I see it is that people across the board, irrespective of the language they speak, the communities they belong to, the demographies they represent, can use technology to make their lives, their work, better. I know it sounds like really a very big and almost too good to be true, but that’s what I’m aiming for. 

HUIZINGA: Well, Kalika Bali, I’m so grateful I got to talk to you in person. And thanks for taking time out from your busy trip from India to sit down with me and our audience and share your amazing ideas. 


BALI: Thank you so much, Gretchen.


The post Ideas: Language technologies for everyone with Kalika Bali appeared first on Microsoft Research.

Read More

Research Focus: Week of April 1, 2024

Research Focus: Week of April 1, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus April 1, 2024

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

In the same way that tools can help people complete tasks beyond their innate abilities, tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a surprisingly understudied question is how accurately an LLM uses tools for which it has been trained.

In a recent paper: LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error, researchers from Microsoft find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate of 30% to 60%, which is too unreliable for practical use. They propose a biologically inspired method for tool-augmented LLMs – simulated trial and error (STE) – that orchestrates three key mechanisms: trial and error, imagination, and memory. STE simulates plausible scenarios for using a tool, then the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration. Experiments on ToolBench show STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings.

Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

The latest LLMs have surpassed the performance of older language models on several tasks and benchmarks, sometimes approaching or even exceeding human performance. Yet, it is not always clear whether this is due to the increased capabilities of these models, or other effects, such as artifacts in datasets, test dataset contamination, and the lack of datasets that measure the true capabilities of these models.

As a result, research to comprehend LLM capabilities and limitations has surged of late. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. In a recent paper: MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks, researchers from Microsoft aim to perform a thorough evaluation of the non-English capabilities of state-of-the-art LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Mistral, Gemini, Gemma and Llama2) by comparing them on the same set of multilingual datasets. Their benchmark comprises 22 datasets covering 81 languages including several low-resource African languages. They also include two multimodal datasets in the benchmark and compare the performance of LLaVA-v1.5 and GPT-4-Vision. Experiments show that GPT-4 and PaLM2 outperform the Llama and Mistral models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 on more datasets. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.

Training Audio Captioning Models without Audio

Automated Audio Captioning (AAC) is a process that creates text descriptions for audio recordings. Unlike Closed Captioning, which transcribes speech, AAC aims to describe all sounds in the audio (e.g. : A muffled rumble with people talking in the background while a siren blares in the distance). Typical AAC systems require expensive curated data of audio-text pairs, which often results in a shortage of suitable data, impeding model training.

In this paper: Training Audio Captioning Models without Audio, researchers from Microsoft and Carnegie Mellon University propose a new paradigm for training AAC systems, using text descriptions alone, thereby eliminating the requirement for paired audio and text descriptions. Their approach leverages CLAP, a contrastive learning model that uses audio and text encoders to create a shared vector representation between audio and text. For instance, the text “siren blaring” and its corresponding audio recording would share the same vector. The model is trained on text captions: a GPT language decoder generates captions conditioned on the pretrained CLAP text encoder and a mapping network. During inference, audio input is first converted to its vector using the pretrained CLAP audio encoder and then a text caption is generated.

The researchers find that the proposed text-only framework competes well with top-tier models trained on both text and audio, proving that efficient text-to-audio conversion is possible. They also demonstrated the ability to incorporate various writing styles, such as humorous, beneficial for tailoring caption generation to specific fields. Finally, they highlight that enriching training with LLM-generated text leads to improved performance and has potential in increasing vocabulary diversity.

The post Research Focus: Week of April 1, 2024 appeared first on Microsoft Research.

Read More

AI Frontiers: Rethinking intelligence with Ashley Llorens and Ida Momennejad

AI Frontiers: Rethinking intelligence with Ashley Llorens and Ida Momennejad

photo of Ida Momennejad for the AI Frontiers Microsoft Research Podcast series

Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come. 

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity. 

This episode features Principal Researcher Ida Momennejad. Momennejad is applying her expertise in cognitive neuroscience and computer science to better understand—and extend—AI capabilities, particularly when it comes to multistep reasoning and short- and long-term planning. Llorens and Momennejad discuss the notion of general intelligence in both humans and machines; how Momennejad and colleagues leveraged prior research into the cognition of people and rats to create prompts for evaluating large language models; and the case for the development of a “prefrontal cortex” for AI.



ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. In this podcast series, I share conversations with fellow researchers about the latest developments in AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.

Today, I’ll speak with Ida Momennejad. Ida works at Microsoft Research in New York City at the intersection of machine learning and human cognition and behavior. Her current work focuses on building and evaluating multi-agent AI architectures, drawing from her background in both computer science and cognitive neuroscience. Over the past decade, she has focused on studying how humans and AI agents build and use models of their environment.


Let’s dive right in. We are undergoing a paradigm shift where AI models and systems are starting to exhibit characteristics that I and, of course, many others have described as more general intelligence. When I say general in this context, I think I mean systems with abilities like reasoning and problem-solving that can be applied to many different tasks, even tasks they were not explicitly trained to perform. Despite all of this, I think it’s also important to admit that we—and by we here, I mean humanity—are not very good at measuring general intelligence, especially in machines. So I’m excited to dig further into this topic with you today, especially given your background and insights into both human and machine intelligence. And so I just want to start here: for you, Ida, what is general intelligence?

IDA MOMENNEJAD: Thank you for asking that. We could look at general intelligence from the perspective of history of cognitive science and neuroscience. And in doing so, I’d like to mention its discontents, as well. There was a time where general intelligence was introduced as the idea of a kind of intelligence that was separate from what you knew or the knowledge that you had on a particular topic. It was this general capacity to acquire different types of knowledge and reason over different things. And this was at some point known as g, and it’s still known as g. There have been many different kinds of critiques of this concept because some people said that it’s very much focused on the idea of logic and a particular kind of reasoning. Some people made cultural critiques of it. They said it’s very Western oriented. Others said it’s very individualistic. It doesn’t consider collective or interpersonal intelligence or physical intelligence. There are many critiques of it. But at the core of it, there might be something useful and helpful. And I think the useful part is that there could be some general ability in humans, at least the way that g was intended initially, where they can learn many different things and reason over many different domains, and they can transfer ability to reason over a particular domain to another. And then in the AGI, or artificial general intelligence, notion of it, people took this idea of many different abilities or skills for cognitive and reasoning and logic problem-solving at once. There have been different iterations of what this means in different times. In principle, the concept in itself does not provide the criteria on its own. Different people at different times provide different criteria for what would be the artificial general intelligence notion. Some people say that they have achieved it. Some people say we are on the brink of achieving it. Some people say we will never achieve it. However, there is this idea, if you look at it from an evolutionary and neuroscience and cognitive neuroscience lens, that in evolution, intelligence has evolved multiple times in a way that is adaptive to the environment. So there were organisms that needed to be adaptive to the environment where they were, that intelligence has evolved in multiple different species, so there’s not one solution to it, and it depends on the ecological niche that that particular species needed to adapt to and survive in. And it’s very much related to the idea of being adaptive of certain kinds of, different kinds of problem-solving that are specific to that particular ecology. There is also this other idea that there is no free lunch and the no-free-lunch theorem, that you cannot have one particular machine learning solution that can solve everything. So the idea of general artificial intelligence in terms of an approach that can solve everything and there is one end-to-end training that can be useful to solve every possible problem that it has never seen before seems a little bit untenable to me, at least at this point. What does seem tenable to me in terms of general intelligence is if we understand and study, the same way that we can do it in nature, the foundational components of reasoning, of intelligence, of different particular types of intelligence, of different particular skills—whether it has to do with cultural accumulation of written reasoning and intelligence skills, whether it has to do with logic, whether it has to do with planning—and then working on the particular types of artificial agents that are capable of putting these particular foundational building blocks together in order to solve problems they’ve never seen before. A little bit like putting Lego pieces together. So to wrap it up, to sum up what I just said, the idea of general intelligence had a more limited meaning in cognitive science, referring to human ability to have multiple different types of skills for problem-solving and reasoning. Later on, it was also, of course, criticized in terms of the specificity of it and ignoring different kinds of intelligence. In AI, this notion has been having many different kinds of meanings. If we just mean it’s a kind of a toolbox of general kinds of intelligence for something that can be akin to an assistant to a human, that could make sense. But if we go too far and use it in the kind of absolute notion of general intelligence, as it has to encompass all kinds of intelligence possible, that might be untenable. And also perhaps we shouldn’t think about it in terms of a lump of one end-to-end system that can get all of it down. Perhaps we can think about it in terms of understanding the different components that we have also seen emerge in evolution in different species. Some of them are robust across many different species. Some of them are more specific to some species with a specific ecological niche or specific problems to solve. But I think perhaps it could be more helpful to find those cognitive and other interpersonal, cultural, different notions of intelligence; break them down into their foundational building blocks; and then see how a particular artificial intelligence agent can bring together different skills from this kind of a library of intelligence skills in order to solve problems it’s never seen before.

LLORENS: There are two concepts that jump out at me based on what you said. One is artificial general intelligence and the other is humanlike intelligence or human-level intelligence. And you’ve referenced the fact that, you know, oftentimes, we equate the two or at least it’s not clear sometimes how the two relate to each other. Certainly, human intelligence has been an important inspiration for what we’ve done—a lot of what we’ve done—in AI and, in many cases, a kind of evaluation target in terms of how we measure progress or performance. But I wonder if we could just back up a minute. Artificial general intelligence and humanlike, human-level intelligence—how do these two concepts relate to you?

MOMENNEJAD: Great question. I like that you asked to me because I think it would be different for different people. I’ve written about this, in fact. I think humanlike intelligence or human-level intelligence would require performance that is similar to humans, at least behaviorally, not just in terms of what the agent gets right, but also in terms of the kinds of mistakes and biases that the agent might have. It should look like human intelligence. For instance, humans show primacy bias, recency bias, variety of biases. And this seems like it’s unhelpful in a lot of situations. But in some situations, it helps to come with fast and frugal solutions on the go. It helps to summarize certain things or make inferences really fast that can help in human intelligence, for instance. There is analogical reasoning. That is, there are different types of intelligence that humans do. Now, if you look at what are tasks that are difficult and what are tasks that are easier for humans and compare that to a, for instance, let’s say just a large language model like GPT-4, you will see whether they find similar things simple and similar things difficult or not. When they don’t find similar things easy or difficult, I think that we should not say that this is humanlike per se, unless we mean for a specific task. Perhaps on specific sets of tasks, an agent can be, can have human-level or humanlike intelligent behavior; however, if we look overall, as long as there are particular skills that are more or less difficult for one or the other, it might be not reasonable to compare them. That being said, there are many things that some AI agent and even a [programming] language would be better [than] humans at. Does that mean that they are generally more intelligent? No, it doesn’t because there are also many things that humans are far better than AI at. The second component of this is the mechanisms by which humans do the intelligent things that we do. We are very energy efficient. With very little amount of energy consumption, we can solve very complicated problems. If you put some of us next to each other or at least give a pen and paper to one of us, this can be even a lot more effective; however, the amount of energy consumption that it takes in order for any machine to solve similar problems is a lot higher. So another difference between humanlike intelligence or biologically inspired intelligence and the kind of intelligence that is in silico is efficiency, energy efficiency in general. And finally, the amount of data that goes into current state-of-[the-art] AI versus perhaps the amount of data that a human might need to learn new tasks or acquire new skills seem to be also different. So it seems like there are a number of different approaches to comparing human and machine intelligence and deriving what are the criteria for a machine intelligence to be more humanlike. But other than the conceptual aspect of it, it’s not clear that we necessarily want something that’s entirely humanlike. Perhaps we want in some tasks and in some particular use cases for the agent to be humanlike but not in everything.

LLORENS: You mentioned some of the ways in which human intelligence is inferior or has weaknesses. You mentioned some of the weaknesses of human intelligence, like recency bias. What are some of the weaknesses of artificial intelligence, especially frontier systems today? You’ve recently published some works that have gotten into new paradigms for evaluation, and you’ve explored some of these weaknesses. And so can you tell us more about that work and about your view on this?

MOMENNEJAD: Certainly. So inspired by a very long-standing tradition of evaluating cognitive capacities—those Lego pieces that bring together intelligence that I was mentioning in humans and animals—I have conducted a number of experiments, first in humans, and built reinforcement learning models over the past more than a decade on the idea of multistep reasoning and planning. It is in the general domain of reasoning, planning, and decision making. And I particularly focused on what kind of memory representations allow brains and reinforcement learning models inspired by human brain and behavior to be able to predict the future and plan the future and reason over the past and the future seamlessly using the same representations. Inspired by the same research that goes back in tradition to Edward Tolman’s idea of cognitive maps and latent learning in the early 20th century, culminating in his very influential 1948 paper, “Cognitive maps in rats and men,” I sat down with a couple of colleagues last year—exactly this time, probably—and we worked on figuring out if we can devise similar experiments to that in order to test cognitive maps and planning and multistep reasoning abilities in large language models. So I first turned some of the experiments that I had conducted in humans and some of the experiments that were done by Edward Tolman on the topic in rodents and turned them into prompts for ChatGPT. That’s where I started, with GPT-4. The reason I did that was that I wanted to make sure that I will create some prompts that have not been in the training set. My experiments, although the papers have been published, the stimuli of the experiments were not linguistic. They were visual sequences that the human would see, and they would have to have some reinforcement learning and learn from the sequences to make inferences about relationships between different states and find what is the path that would give them optimal rewards. Very simple human reinforcement learning paradigms, however, with different kind of structures. The inspirations that I had drawn from the cognitive maps works by Edward Tolman and others was in this idea that in order for a creature, whether it’s a rodent, a human, or a machine, to be able to reason in [multiple] steps, plan, and have cognitive maps, which is simply a representation of the relational structure of the environment, in order for a creature to have these abilities or these capacities, it means that the creature needs to be sensitive and adaptive to local changes in the environment. So I designed the, sort of, the initial prompts and recruited a number of very smart and generous-with-their-time colleagues, who … we sat together and created these prompts in different domains. For instance, we also created social prompts. We also created the same kind of graph structures but for reasoning over social structures. For instance, I say, Ashley’s friends with Matt. Matt is friends with Michael. If I want to pass a message to Michael, what is the path that I can choose? Which would be, I have to tell Ashley. Ashley will tell Matt. Matt will tell Michael. This is very similar to another paradigm which was more like a maze, which would be similar to saying, there is a castle; it has 16 rooms. You enter Room 1. You open the door. It opens to Room 2. In Room 2, you open the door, and so on and so forth. So you describe, using language, the structure of a social environment or the structure of a spatial environment, and then you ask certain questions that have to do with getting from A to B in this social or spatial environment from the LLM, or you say, oh, you know, Matt and Michael don’t talk to each other anymore. So now in order to pass a message, what should I do? So I need to find a detour. Or, for instance, I say, you know, Ashley has become close to Michael now. So now I have a shortcut, so I can directly give the message to Ashley, and Ashley can directly give the message to Michael. My path to Michael is shorter now. So finding things like detours, shortcuts, or if the reward location changes, these are the kinds of changes that, inspired by my own past work and inspired by the work of Tolman and others, we implemented in all of our experiments. This led to 15 different tasks for every single graph, and we have six graphs total of different complexity levels with different graph theoretic features, and [for] each of them, we had three domains. We had a spatial domain that was with rooms that had orders like Room 1, Room 2, Room 3; a spatial domain that there was no number, there was no ordinal order to the rooms; and a social environment where it was the names of different people and so the reasoning was over social, sort of, spaces. So you can see this is a very large number of tasks. It’s 6 times 15 times 3, and each of the prompts we ran 30 times for different temperatures. Three temperatures: 0, 0.5, and 1. And for those who are not familiar with this, a temperature of a large language model determines how random it will be or how much it will stick to the first or the best option that comes to it at the last layer. And so when there are some problems that may be the first obvious answer that it finds are not good, perhaps increasing the temperature could help, or perhaps a problem that needs precision, increasing the temperature would make it worse. So based on these ideas, we also tried it for different temperatures. And we tested eight different language models like this in order to systematically evaluate their ability for this multistep reasoning and planning, and the framework that we use—we call it CogEval—and CogEval is a framework that’s not just for reasoning and multistep planning. Other tasks can be used in this framework in order to be tested, as well. And the first step of it is always to operationalize the cognitive capacity in terms of many different tasks like I just mentioned. And then the second task is designing the specific experiments with different domains like spatial and social; with different structures, like the graphs that I told you; and with different kind of repetitions and with different tasks, like the detour, shortcut, and the reward revaluation, transition revaluation, and just traversal, all the different tasks that I mentioned. And then the third step is to generate many prompts and then test them with many repetitions using different temperatures. Why is that? I think something that Sam Altman had said is relevant here, which is sometimes with some problems, you ask GPT-4 a hundred times, and one out of those hundred, it would give the correct answer. Sometimes 30 out of a hundred, it will give the correct answer. You obviously want it to give hundred out of hundred the correct answer. But we didn’t want to rely on just one try and miss the opportunity to see whether it could give the answer if you probed it again[1]. And in all of the eight large language models, we saw that none of the large language models was robust to the graph structure. Meaning, its performance got really worse as soon as the graph structure, [which] didn’t even have many nodes but just had a tree structure that was six or seven nodes, or a six- or seven-node tree was much more difficult for it to solve than a graph that had 15 nodes but had a simpler structure that was just two lines. We noted that sometimes, counterintuitively, some graph structures that you think should be easy to solve were more difficult for them. On the other hand, they were not robust to the task set. So the specific task that we tried, whether it was detour, shortcut, or it was reward revaluation or traversal, it mattered. For instance, shortcut and detour were very difficult for all of them. Another thing that we noticed was that all of them, including GPT-4, hallucinated paths that didn’t exist. For instance, there was no door between Room 12 and Room 16. They would hallucinate that there is a door, and they would give a response that includes that door. Another kind of failure mode that we observed was that they would fail to even find a one-step path. Let’s say between Room 7 and 8, there is a direct door. We would say, what is the path from 7 and 8? And they would take a longer path to go from it. And a final mode that we observed was that they would sometimes fall in loops. Even though we would directly ask them to find the shortest path, they would sometimes fall into a loop on the way to getting to their destination, which obviously you shouldn’t do if you are trying to find the shortest path. That said, there is two differing notions of accuracy here. You can have satisficing, which means you get there; you just take a longer path. And there is this notion that you cannot get there because you used some imaginary path or you did something that didn’t make sense and you, sort of, gave a nonsensical response. We had both of those kinds of issues, so we had a lot of issues with giving nonsensical answers, repeating the question that we were asking, producing gibberish. So there were numerous kinds of challenges. What we did observe was that GPT-4 was far better than the other LLMs in this regard, at least at the time that we tested it; however, this is obviously on the basis of the particular kinds of tasks that we tried. In another study, we tried Tower of Hanoi, which is also a classic cognitive science approach to [testing] planning abilities and hierarchical planning abilities. And we found that GPT-4 does between zero and 10 percent in the three-disk problem and zero percent for the four-disk problem. And that is when we started to think about having more brain-inspired solutions to improve that approach. But I’m going to leave that for next.

LLORENS: So it sounds like a very extensive set of experiments across many different tasks and with many different leading AI models, and you’ve uncovered a lack of robustness across some of these different tasks. One curiosity that I have here is how would you assess the relative difficulty of these particular tasks for human beings? Would all of these be relatively easy for a person to do or not so much?

MOMENNEJAD: Great question. So I have conducted some of these experiments already and have published them before. Humans do not perform symmetrically on all these tasks, for sure; however, for instance, Tower of Hanoi is a problem that we know humans can solve. People might have seen this. It’s three little rods that are … usually, it’s a wooden structure, so you have a physical version of it, or you can have a virtual version of it, and there are different disks with different colors and sizes. There are some rules. You cannot put certain disks on top of others. So there is a particular order in which you can stack the disks. Usually what happens is that all the disks are on one side—and when I say a three-disk problem, it means you have three total disks. And there is usually a target solution that you are shown, and you’re told to get there in a particular number of moves or in a minimum number of moves without violating the rules. So in this case, the rules would be that you wouldn’t put certain disks on top of others. And based on that, you’re expected to solve the problem. And the performance of GPT-4 on Tower of Hanoi three disk is between 0 to 10 percent and on Tower of Hanoi four disks is zero percent—zero shot. With the help, it can get better. With some support, it gets better. So in this regard, it seems like Tower of Hanoi is extremely difficult for GPT-4. It doesn’t seem as difficult as it is for GPT-4 for humans. It seems for some reason, that it couldn’t even improve itself when we explained the problem even further to it and explain to it what it did wrong. Sometimes—if people want to try it out, they should—sometimes, it would argue back and say, “No, you’re wrong. I did this right.” Which was a very interesting moment for us with ChatGPT. That was the experience that we had for trying it out first without giving it, sort of, more support than that, but I can tell you what we did next, but I want to make sure that we cover your other questions. But just to wrap this part up, inspired by tasks that have been used for evaluation of cognitive capacities such as multistep reasoning and planning in humans, it is possible to evaluate cognitive capacities and skills such as multistep reasoning and planning also in large language models. And I think that’s the takeaway from this particular study and from this general cognitive science–inspired approach. And I would like to say also it is not just human tasks that are useful. Tolman’s tasks were done in rodents. A lot of people have done experiments in fruit flies, in C. elegans, in worms, in various kinds of other species that are very relevant to testing, as well. So I think there is a general possibility of testing particular intelligence skills, evaluating it, inspired by experiments and evaluation methods for humans and other biological species.

LLORENS: Let’s explore the way forward for AI from your perspective. You know, as you’ve described your recent works, it’s clear that you have, that your work is deeply informed by insights from cognitive science, insights from neuroscience, and recent works—your recent works—have called for the development, for example, of a prefrontal cortex for AI, and I understand this to be the part of the brain that facilitates executive function. How does, how does this relate to the, you know, extending the capabilities of AI, a prefrontal cortex for AI?

MOMENNEJAD: Thank you for that question. So let me start by reiterating something I said earlier, which is the brain didn’t evolve in a lump. There were different components of brains and nervous systems and neurons that evolved at different evolutionary scales. There are some parts of the brain that appear in many different species, so they’re robust across many species. And there are some parts of the brain that appear in some species that had some particular needs, some particular problems they were facing, or some ecological niche. What is, however, in common in many of them is that there seems to be some kind of a modular or multicomponent aspect to what we call higher cognitive function or what we call executive function. And so the kinds of animals that we ascribe some form of executive function of sorts to seem to have brains that have parts or modules that do different things. It doesn’t mean that they only do that. It’s not a very extreme Fodorian view of modularity. But it is the view that, broadly speaking, when, for instance, we observe patients that have damage to a particular part of their prefrontal cortex, it could be that they perform the same on an IQ test, but they have problems holding their relationship or their jobs. So there are different parts of the brain that selective damage to those areas, because of accidents or coma or such, it seems to impair specific cognitive capacities. So this is what very much inspired me. I have been investigating the prefrontal cortex for, I guess, 17 years now, [LAUGHS] which is a scary number to say. But been … basically since I started my PhD and even during my master’s thesis, I have been focused on the role of the prefrontal cortex in our ability for long-term reasoning and planning in not just this moment—long-term, open-ended reasoning and planning. Inspired by this work, I thought, OK, if I want to improve GPT-4’s performance on, let’s say, Tower of Hanoi, can we get inspired by this kind of multiple roles that different parts of the brain play in executive function, specifically different parts of the neocortex and specifically different parts of the prefrontal cortex, part of the neocortex, in humans? Can we get inspired by some of these main roles that I have studied before and ask GPT-4 to play the role of those different parts and solve different parts of the planning and reasoning problem—the multistep planning and reasoning problem—using these roles and particular rules of how to iterate over them. For instance, there is a part of the brain called anterior cingulate cortex. Among other things, it seems to be involved in monitoring for errors and signaling when there is a need to exercise more control or move from what people like to call a faster way of thinking to a slower way of thinking to solve a particular problem. And there is … so let’s call this the cognitive function of this part. Let’s call it the monitor. This is a part of the brain that monitors for when there is a need for exercising more control or changing something because there is an error maybe. There is another part of the brain and the frontal lobe that is the, for instance, dorsolateral prefrontal cortex; that one is involved in working memory and coming up with, like, simpler plans to execute. Then there is the ventromedial prefrontal cortex that is involved in the value of states and predicting what is the next state and integrating it with information from other parts of the brain to figure out what is the value. So you put all of these things together, you can basically write different algorithms that have these different components talking to each other. And we have in that paper also, written in a pseudocode style, the different algorithms that are basically akin to a tree search, in fact. So there is a part of the role … they’re part of the multicomponent or multi-agent realization of a prefrontal cortex-like GPT-4 solution. One part of it would propose a plan. The monitor would say, thanks for that; let me pass it on to the part that is evaluating what is the outcome of this and what’s the value of that, and get back to you. It evaluates there and comes back and says, you know, this is not a good plan; give me another one. And in this iteration, sometimes it takes 10 iterations; sometimes it takes 20 iterations. This kind of council of different types of roles, they come up with a solution that is solving the Tower of Hanoi problem. And we managed to bring the performance from 0 to 10 [percent] in GPT-4 to, I think, about 70—70 percent—in Tower of Hanoi three disks, and OOD, or out-of-distribution generalization, without giving any examples of a four disk, it could generalize to above 20 percent in four-disk problems. Another impressive thing that happened here—and we tested it on the CogEval and the planning tasks from the other experiment, too—was that it brought all of the, sort of, hallucinations from about 20 to 30 percent—in some cases, much higher percentages—to zero percent. So we had slow thinking; we had 30 iterations, so it took a lot longer. And if this is, you know, fast and slow thinking, this is very slow thinking. However, we had no hallucinations anymore. And hallucination in Tower of Hanoi would be making a move that is impossible. For instance, putting in a, kind of, a disk on top of another that you cannot do because you violate a rule or taking out a middle disk that you cannot pull out actually. So those would be the kinds of hallucinations in Tower of Hanoi. All of those also went to zero. And so that is one thing that we have done already, which I have been very excited about.

LLORENS: So you painted a pretty interesting—fascinating, really—picture of a multi-agent framework where different instances of an advanced model like GPT-4 would be prompted to play the roles of different parts of the brain and, kind of, work together. And so my question is a pragmatic one. How do you prompt GPT-4 to play the role of a specific part of the human brain? What does that prompt look like?

MOMENNEJAD: Great question. I can actually, well, we have all of that at the end of our paper, so I can even read some of them if that was of interest. But just a quick response to that is you can basically describe the function that you want the LLM—in this case GPT-4—to play. You can write that in simple language. You don’t have to tell it that this is inspired by the brain. It is completely sufficient to just basically provide certain sets of rules in order for it, in order to be able to do that.[2] For instance, after you provide the problem, sort of, description … let me see if I can actually read some part of this for you. For instance, you give it a problem, and you say, consider this problem. Rule 1: you can only move a number if it’s at this and that. You clarify the rules. Here are examples. Here are proposed moves. And then you say, for instance, your role is to find whether this particular number generated as a solution is accurate. In order to do that, you can call on this other function, which is the predictor and evaluator that says, OK, if I do this, what state do I end up in, and what is the value of that state? And you get that information, and then based on that information, you decide whether the proposed move for this problem is a good move or not. If it is, then you pass a message that says, all right, give me the next step of the plan. If it’s not, then you say, OK, this is not a good plan; propose another plan. And then the part of, the part that plays the role of, hey, here is the problem. Here are the rules. Propose the first towards the subgoal or find the subgoal towards this and propose the next step. And that one receives this feedback from the monitor. And monitor has asked the predictor and evaluator, hey, what happens if I do these things and what would be the value of that in order to say, hey, this is not a great idea. So in a way this becomes a very simple prefrontal cortex–inspired multi-agent system. All of them are within the same … sort of, different calls to GPT-4 but the same instance. Just, like, because we were calling it in a code, it’s just, you just call, it’s called multiple times and each time with this kind of a very simple in-context learning text that, in text, it describes, hey, here’s the kind of problem you’re going to see. Here’s the role I want you to play. And here is what other kind of rules you need to call in order to play your role here. And then it’s up to the LLM to decide how many times it’s going to call which components in order to solve the problem. We don’t decide. We can only decide, hey, cap it at 10 times, for instance, or cap it at 30 iterations and then see how it performs.

LLORENS: So, Ida, what’s next for you and your research?

MOMENNEJAD: Thank you for that. I have always been interested in understanding minds and making minds, and this has been something that I’ve wanted to do since I was a teenager. And I think that my approaches in cognitive neuroscience have really helped me to understand minds to the extent that is possible. And my understanding of how to make minds comes from basically the work that I’ve done in AI and computer science since my undergrad. What I would be interested in is—and I have learned over the years that you cannot think about the mind in general when you are trying to isolate some components and building them—is that my interest is very much in reasoning and multistep planning, especially in complex problems and very long-term problems and how they relate to memory, how the past and the future relate to one another. And so something that I would be very interested in is making more efficient types of multi-agent brain-inspired AI but also to train smaller large language models, perhaps using the process of reasoning in order to improve their reasoning abilities. Because it’s one thing to train on outcome and outcome can be inputs and outputs, and that’s the most of the training data that LLMs receive. But it’s an entirely different approach to teach the process and probe them on different parts of the process as opposed to just the input and output. So I wonder whether with that kind of an approach, which would require generating a lot of synthetic data that relates to different types of reasoning skills, whether it’s possible to teach LLMs reasoning skills, and by reasoning skills, I mean very clearly operationalized—similar to the CogEval approach—operationalized, very well-researched, specific cognitive constructs that have construct validity and then operationalizing them in terms of many tasks. And something that’s important to me is a very important idea and a part of intelligence that maybe I didn’t highlight enough in the first part is being able to transfer to tasks that they have never seen before, and they can piece together different intelligence skills or reasoning skills in order to solve them. Another thing that I have done and I will continue to do is collective intelligence. So we talked about multi-agent systems, that they are playing the roles of different parts inside one brain. But I’ve also done experiments with multiple humans and how different structures of human communication leads to better memory or problem-solving. Humans, also, we invent things; we innovate things in cultural accumulation, which requires [building] on a lot of … some people do something, I take that outcome, take another outcome, put them together, make something. Someone takes my approach and adds something to it; makes something else. So this kind of cultural accumulation, we have done some work on that with deep reinforcement learning models that share their replay buffer as a way of sharing skill with each other; however, as humans become a lot more accustomed to using LLMs and other generative AI, basically generative AI would start participating in this kind of cultural accumulation. So the notion of collective cognition, collective intelligence, and collective memory will now have to incorporate the idea of generative AI being a part of it. And so I’m also interested in different approaches to modeling that, understanding that, optimizing that, identifying in what ways it’s better.[3] We have found both in humans and in deep reinforcement learning agents, for instance, that particular structures of communication that are actually not the most energy-consuming one; it’s not all-to-all communication, but particular partially connected structures are better for innovation than others. And some other structures might be better for memory or collective memory converging with each other.[4] So I think it would be very interesting—the same way that we are looking at what kind of components talk to each other in one brain to solve certain problems—to think about what kind of structures or roles can interact with each other, in what shape and in what frequency of communication, in order to solve larger, sort of, cultural accumulation problems.


LLORENS: Well, that’s a compelling vision. I really look forward to seeing how far you and the team can take it. And thanks for a fascinating discussion.

MOMENNEJAD: Thank you so much.


[1] Momennejad notes that repetitive probing allowed she and her colleagues to report the mean and standard deviation of the accuracy over all the responses with corresponding statistics rather than merely reporting the first or the best response.

[2] Momennejad notes that a “convenient and interesting fact about these modules or components or roles is that they’re very similar to some components in reinforcement learning, like actor and critique and tree search. And people have made prefrontal cortex–inspired models in deep learning in the past. This affinity to RL makes it easier to extend this framework to realize various RL algorithms and the sorts of problems one could solve with them using LLMs. Another feature is that they don’t all solve the big problem. There’s an orchestrator that assigns subgoals and passes it on, then the actor’s input and output or the monitor or evaluator’s input and output are parts of the problem, not all of it. This makes the many calls to GPT-4 efficient and is comparable to the local view or access of heterogenous agents, echoing the classic features of a multi-agent framework.“

[3] Momennejad notes that one task she and her colleagues have used is similar to the game Little Alchemy: the players need to find elements, combine them, and create new components. There are multiple levels of hierarchy of innovation that are possible in the game; some of them combine components from different trajectories.

[4] Momennejad notes that this relates to some work she and her colleagues have done building and evaluating AI agents in multi-agent Xbox games like Bleeding Edge, as well.

The post AI Frontiers: Rethinking intelligence with Ashley Llorens and Ida Momennejad appeared first on Microsoft Research.

Read More

Abstracts: March 21, 2024

Abstracts: March 21, 2024

Microsoft Research Podcast - Abstracts hero with a microphone icon

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Researcher Chang Liu joins host Gretchen Huizinga to discuss Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning.” In the paper, Liu and his coauthors present M-OFDFT, a variation of orbital-free density functional theory (OFDFT). M-OFDFT leverages deep learning to help identify molecular properties in a way that minimizes the tradeoff between accuracy and efficiency, work with the potential to benefit areas such as drug discovery and materials discovery.



GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers. 


Today, I’m talking to Dr. Chang Liu, a senior researcher from Microsoft Research AI4Science. Dr. Liu is coauthor of a paper called “Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning.” Chang Liu, thanks for joining us on Abstracts

CHANG LIU: Thank you. Thank you for this opportunity to share our work. 

HUIZINGA: So in a few sentences, tell us about the issue or problem your paper addresses and why people should care about this research. 

LIU: Sure. Since this is an AI4Science work, let’s start from this perspective. About science, people always want to understand the properties of matters, such as why some substances can cure disease and why some materials are heavy or conductive. For a very long period of time, these properties can only be studied by observation and experiments, and the outcome will just look like magic to us. If we can understand the underlying mechanism and calculate these properties on our computer, then we can do the magic ourselves, and it can, hence, accelerate industries like medicine development and material discovery. Our work aims to develop a method that handles the most fundamental part of such property calculation and with better accuracy and efficiency. If you zoom into the problem, properties of matters are determined by the properties of molecules that constitute the matter. For example, the energy of a molecule is an important property. It determines which structure it mostly takes, and the structure indicates whether it can bind to a disease-related biomolecule. You may know that molecules consist of atoms, and atoms consist of nuclei and electrons, so properties of a molecule are the result of the interaction among the nuclei and the electrons in the molecule. The nuclei can be treated as classical particles, but electrons exhibit significant quantum effect. You can imagine this like electrons move so fast that they appear like cloud or mist spreading over the space. To calculate the properties of the molecule, you need to first solve the electronic structure—that is, how the electrons spread over this space. This is governed by an equation that is hard to solve. The target of our research is hence to develop a method that solves the electronic structure more accurately and more efficiently so that properties of molecules can be calculated in a higher level of accuracy and efficiency that leads to better ways to solve the industrial problems. 

HUIZINGA: Well, most research owes a debt to work that went before but also moves the science forward. So how does your approach build on and/or differ from related research in this field? 

LIU: Yes, there are indeed quite a few methods that can solve the electronic structure, but they show a harsh tradeoff between accuracy and efficiency. Currently, density functional theory, often called DFT, achieves a preferred balance for most cases and is perhaps the most popular choice. But DFT still requires a considerable cost for large molecular systems. It has a cubic cost scaling. We hope to develop a method that scales with a milder cost increase. We noted an alternative type of method called orbital-free DFT, or called OFDFT, which has a lower order of cost scaling. But existing OFDFT methods cannot achieve satisfying accuracy on molecules. So our work leverages deep learning to achieve an accurate OFDFT method. The method can achieve the same level of accuracy as conventional DFT; meanwhile, it inherits the cost scaling of OFDFT, hence is more efficient than the conventional DFT. 

HUIZINGA: OK, so we’re moving acronyms from DFT to OFDFT, and you’ve got an acronym that goes M-OFDFT. What does that stand for? 

LIU: The M represents molecules, since it is especially hard for classical or existing OFDFT to achieve a good accuracy on molecules. So our development tackles that challenge. 

HUIZINGA: Great. And I’m eager to hear about your methodology and your findings. So let’s go there. Tell us a bit about how you conducted this research and what your methodology was. 

LIU: Yeah. Regarding methodology, let me delve into a bit into some details. We follow the formulation of OFDFT, which solves the electronic structure by optimizing the electron density, where the optimization objective is to minimize the electronic energy. The challenge in OFDFT is, part of the electronic energy, specifically the kinetic energy, is hard to calculate accurately, especially for molecular systems. Existing computation formulas are based on approximate physical models, but the approximation accuracy is not satisfying. Our method uses a deep learning model to calculate the kinetic energy. We train the model on labeled data, and by the powerful learning ability, the model can give a more accurate result. This is the general idea, but there are many technical challenges. For example, since the model is used as an optimization objective, it needs to capture the overall landscape of the function. The model cannot recover the landscape if only one labeled data point is provided. For this, we made a theoretical analysis on the data generation method and found a way to generate multiple labeled data points for each molecular structure. Moreover, we can also calculate a gradient label for each data point, which provides the slope information on the landscape. Another challenge is that the kinetic energy has a strong non-local effect, meaning that the model needs to account for the interaction between any pair of spots in space. This incurs a significant cost if using the conventional way to represent density—that is, to using a grid. For this challenge, we choose to expand the density function on a set of basis functions and use the expansion coefficients to represent the density. The benefit is that it greatly reduces the representation dimension, which in turn reduces the cost for non-local calculation. These two examples are also the differences from other deep learning OFDFT works. There are more technical designs, and you may check them in the paper. 

HUIZINGA: So talk about your findings. After you completed and analyzed what you did, what were your major takeaways or findings? 

LIU: Yeah, let’s dive into the details, into the empirical findings. We find that our deep learning OFDFT, abbreviated as M-OFDFT, is much more accurate than existing OFDFT methods with tens to hundreds times lower error and achieves the same level of accuracy as the conventional DFT. 


LIU: On the other hand, the speed is indeed improved over conventional DFT. For example, on a protein molecule with more than 700 atoms, our method achieves nearly 30 times speedup. The empirical cost scaling is lower than quadratic and is one order less than that of conventional DFT. So the speed advantage would be more significant on larger molecules. I’d also like to mention an interesting observation. Since our method is based on deep learning, a natural question is, how accurate would the method be if applied to much larger molecules than those used for training the deep learning model? This is the generalization challenge and is one of the major challenges of deep learning method for molecular science applications. We investigated this question in our method and found that the error increases slower than linearly with molecular size. Although this is not perfect since the error is still increasing, but it is better than using the same model to predict the property directly, which shows an error that increases faster than linearly. This somehow shows the benefits of leveraging the OFDFT framework for using a deep learning method to solve molecular tasks. 

HUIZINGA: Well, let’s talk about real-world impact for a second. You’ve got this research going on in the lab, so to speak. How does it impact real-life situations? Who does this work help the most and how? 

LIU: Since our method achieves the same level of accuracy as conventional DFT but runs faster, it could accelerate molecular property calculation and molecular dynamic simulation especially for large molecules; hence, it has the potential to accelerate solving problems such as medicine development and material discovery. Our method also shows that AI techniques can create new opportunities for other electronic structure formulations, which could inspire more methods to break the long-standing tradeoff between accuracy and efficiency in this field. 

HUIZINGA: So if there was one thing you wanted our listeners to take away, just one little nugget from your research, what would that be? 

LIU: If only for one thing, that would be we develop the method that solves molecular properties more accurately and efficiently than the current portfolio of available methods. 

HUIZINGA: So finally, Chang, what are the big unanswered questions and unsolved problems that remain in this field, and what’s next on your research agenda? 

LIU: Yeah, sure. There indeed remains problems and challenges. One remaining challenge mentioned above is the generalization to molecules much larger than those in training. Although the OFDFT method is better than directly predicting properties, there is still room to improve. One possibility is to consider the success of large language models by including more abundant data and more diverse data in training and using a large model to digest all the data. This can be costly, but it may give us a surprise. And another way we may consider is to incorporate mathematical structures of the learning target functional into the model, such as convexity, lower and upper bounds, and some invariance. And such structures could regularize the model when applied to larger systems than it has seen during training. So we have actually incorporated some such structures into the model, for example, the geometric invariance, but other mathematical properties are nontrivial to incorporate. We made some discussions in the paper, and we’ll engage working on that direction in the future. The ultimate goal underlying this technical development is to build a computational method that is fast and accurate universally so that we can simulate the molecular world of any kind. 


HUIZINGA: Well, Chang Liu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at You can also read it on arXiv, or you can check out the March 2024 issue of Nature Computational Science. See you next time on Abstracts


The post Abstracts: March 21, 2024 appeared first on Microsoft Research.

Read More

Research Focus: Week of March 18, 2024

Research Focus: Week of March 18, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus March 20, 2024

Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning

Large language models (LLMs) have shown impressive capabilities, yet they still struggle with math reasoning. In a recent paper: Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning, researchers from Microsoft propose CoT-Influx, a novel approach that pushes the boundary of few-shot chain-of-Thought (CoT) learning to improve LLM mathematical reasoning.

Given that adding more concise CoT examples in the prompt can improve LLM reasoning performance, CoT-Influx employs a coarse-to-fine pruner to maximize the input of effective and concise CoT examples. The pruner first selects as many crucial CoT examples as possible and then prunes unimportant tokens to fit the context window. A math reasoning dataset with diverse difficulty levels and reasoning steps is used to train the pruner, along with a math-specialized reinforcement learning approach. As a result, by enabling more CoT examples with double the context window size in tokens, CoT-Influx significantly outperforms various prompting baselines across various LLMs (LLaMA2-7B, 13B, 70B) and 5 math datasets, achieving up to 4.55% absolute improvements. Remarkably, without any fine-tuning, LLaMA2-70B with CoT-Influx surpasses GPT-3.5 and a wide range of larger LLMs (PaLM, Minerva 540B, etc.) on the GSM8K. CoT-Influx serves as a plug-and-play module for LLMs and is compatible with most existing reasoning prompting techniques, such as self-consistency and self-verification.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.

From User Surveys to Telemetry-Driven Agents: Exploring the Potential of Personalized Productivity Solutions

Organizations and individuals continuously strive to enhance their efficiency, improve time management, and optimize their work processes. Rapid advancements in AI, natural language processing, and machine learning technologies create new opportunities to develop tools that boost productivity. 

In a recent paper: From User Surveys to Telemetry-Driven Agents: Exploring the Potential of Personalized Productivity Solutions, researchers from Microsoft present a comprehensive, user-centric approach to understand preferences in AI-based productivity agents and develop personalized solutions. The research began with a survey of 363 participants, seeking to reveal users’ specific needs and preferences for productivity agents such as relevant productivity challenges of information workers, preferred communication style and approach towards solving problems, and privacy expectations. With the survey insights, the researchers then developed a GPT-4 powered personalized productivity agent that uses telemetry data gathered from information workers via Viva Insights to provide tailored assistance. The agent’s performance was compared with alternative productivity-assistive tools, such as the traditional dashboard and AI-enabled summaries, in a study involving 40 participants. The findings highlight the importance of user-centric design, adaptability, and the balance between personalization and privacy in AI-assisted productivity tools. The insights distilled from this study could support future research to further enhance productivity solutions, ultimately leading to optimized efficiency and user experiences for information workers.

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

The size of the context window of a large language model (LLM) determines the amount of text that can be entered for processing to generate responses. The window size is specifically measured by anumber of tokens—larger windows are more desirable. However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. 

In a recent paper: LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, researchers from Microsoft introduce a new method that extends the context window of pre-trained LLMs to an impressive 2.048 million tokens, without requiring direct fine-tuning on texts with extremely long lengths, which are scarce, while maintaining performance at the level of the original short context window. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of this method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding and can reuse most pre-existing optimizations.

Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants

Conversational interactions with large language models (LLMs) enable programmers to obtain natural language explanations for various software development tasks. However, LLMs often leap to action without sufficient context, giving rise to implicit assumptions and inaccurate responses. Conversations between developers and LLMs are primarily structured as question-answer pairs, where the developer is responsible for asking the right questions and sustaining conversations across multiple turns.  

In a recent paper: Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants, researchers from Microsoft draw inspiration from interaction patterns and conversation analysis to design Robin, an enhanced conversational AI-assistant for debugging. Robin works with the developer collaboratively, creating hypotheses about the bug’s root cause, testing them using IDE debugging features such as breakpoints and watches, and then proposing fixes. A user study with 12 industry professionals shows that equipping the LLM-driven debugging assistant to (1) leverage the insert expansion interaction pattern; (2) facilitate turn-taking; and (3) utilize debugging workflows, leads to lowered conversation barriers, effective fault localization, and 5x improvement in bug resolution rates.

Ironies of Generative AI: Understanding and mitigating productivity loss in human-AI interactions

Generative AI (GenAI) systems, which can produce new content based on input like code, images, speech, video, and more, offer opportunities to increase user productivity in many tasks, such as programming and writing. However, while they boost productivity in some studies, many others show that users are working ineffectively with GenAI systems and actually losing productivity. These ‘ironies of automation’ have been observed for over three decades in human factors research on automation in aviation, automated driving, and intelligence.  

In a recent paper: Ironies of Generative AI: Understanding and mitigating productivity loss in human-AI interactions, researchers from Microsoft draw on this extensive research alongside recent GenAI user studies to outline four key reasons for why productivity loss can occur with GenAI systems: 1) a shift in users’ roles from production to evaluation; 2) unhelpful restructuring of workflows; 3) interruptions; and 4) a tendency for automation to make easy tasks easier and hard tasks harder. We then suggest how human factors research can also inform GenAI system design to mitigate productivity loss by using approaches such as continuous feedback, system personalization, ecological interface design, task stabilization, and clear task allocation. Grounding developments in GenAI system usability in decades of research aims to ensure that the design of human-AI interactions in this rapidly moving field learns from history instead of repeating it. 

The post Research Focus: Week of March 18, 2024 appeared first on Microsoft Research.

Read More

Exploring how context, culture, and character matter in avatar research

Exploring how context, culture, and character matter in avatar research

This research paper was presented at the IEEE VR Workshop Series on Animation in Virtual and Augmented Environments (opens in new tab) (ANIVAE 2024), the premier series on 3D content creation for simulated training in extended reality.

IEEE Conference logo with the paper featured

Face-to-face communication is changing, moving beyond physical interaction to include video conferencing and AR/VR platforms, where the participants are represented by avatars. Sophisticated avatars, animated through motion tracking, can realistically portray their human counterparts, but they can also suffer from noise, such as jitter and distortion, reducing their realism. Advances in motion-capture technology aim to reduce such issues, but they come with higher development costs and require additional time due to the need for advanced components. While some noise is inevitable, it’s important to determine acceptable types and levels to efficiently develop and introduce AR/VR devices and avatars to the market. Additionally, understanding how noise impacts avatar-based communication is essential for creating more inclusive avatars that accurately represent diverse cultures and abilities, enhancing the user experience.

In our paper, “Ecological Validity and the Evaluation of Avatar Facial Animation Noise,” presented at ANIVAE 2024, we explore the challenge of evaluating avatar noise without a standardized approach. Traditional methods, which present participants with isolated facial animation noise to gauge perception thresholds, fall short of reflecting real-life avatar interactions. Our approach emphasizes ecological validity—the extent to which experiments mimic real-world conditions—as central in assessing avatar noise. We discovered this significantly influences participants’ response to avatars, highlighting the impact of context on noise perception. Our goal is to improve avatar acceptance, inclusivity, and communication by developing noise evaluation methods that better represent actual experiences. 

Seeing the big picture  

To set up our study, we animated two avatars using motion capture, as depicted in Figure 1 (A). We recorded the performance of two professional actors enacting a scene between an architect and a client discussing home renovations and examining a 3D model of the proposed design. We used two proprietary characters for the avatars, whose faces were animated with 91 expression blendshapes. This allowed for a broad range of facial expressions and subtle variations in emotions, contributing to a more realistic animation. To examine different dynamics, we created six variations of the scene, changing the characters’ gender, role, and whether they agreed on the renovation plan.

Figure 1: A. Motion capture of a social interaction scenario for the experiment. B. The motion capture was remapped to stylized avatars. C. Participants experienced the scene wearing a HoloLens 2 and responded to questions on a tablet app. D. The avatars’ facial features were degraded with different types of animation noises of varying severity.
Figure 1: A. Motion capture of a social interaction scenario for the experiment. B. The motion capture was remapped to stylized avatars. C. Participants experienced the scene wearing a HoloLens 2 and responded to questions on a tablet app. D. The avatars’ facial features were degraded with different types of animation noises of varying severity.

Fifty-six participants engaged in two experiments to evaluate the impact of noise on avatar facial animation. The first experiment had low ecological validity. Participants viewed fragmented clips of dialogue through a Microsoft HoloLens 2 device and used a slider to adjust any noise to an acceptable level. The second experiment featured high ecological validity, showing the scene in its full social context. Here, participants used a HoloLens 2 to judge the noise in facial expressions as either “appropriate” or “inappropriate” for the conversation. In contrast to the first experiment, this method considered the social aspects of context, culture, and character. 

Results indicate that noise was less distracting when participants viewed the scene in its entirety, revealing a greater tolerance for noise in high ecological validity scenarios. Isolated clips, on the other hand, led to greater annoyance with facial animation noise, suggesting the importance of social context over hyper-realistic animation. 

Cultural observations showed that noise perception was influenced by implicit cultural norms, particularly around gender roles and agreement levels. For example, in the second experiment, where participants viewed the conversation within its greater social context (high ecological validity), noise was deemed “appropriate” when the female architect agreed with the male client and “inappropriate” when she disagreed, revealing potential gender biases not observed in reversed gender roles. These findings emphasize the importance of applying high ecological validity in experiments to uncover socio-cultural influences on avatar perception. They also underscore the need to carefully consider context and cultural dynamics in avatar design. 

Finally, we explored the character trait of empathy. Participants with lower empathy scores were more critical of noise in context-rich scenarios. This indicates that experiments focusing solely on low ecological validity might overlook important insights on how empathy influences responses to avatar facial animation noise.


AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI.

Avatars need to be studied in realistic situations 

When people communicate, they engage in a complex process influenced by environment, cultural background, and the nonverbal cues they perceive and interpret. By prioritizing high ecological validity in studies on avatar perception, researchers can uncover these socio-cultural influences and trust that their findings are relevant and applicable to real-life interactions within digital spaces. 

Our research examines how different combinations of demographic characteristics change the way people react to avatars, and we hope to encourage more inclusivity in avatar design. It’s essential to have an established set of guidelines to achieve this goal, and this work is one step in that direction. While our study’s scope is limited, its methodology can be applied broadly across different devices and settings.


We would like to thank Ken Jakubzak, James Clemoes, Cornelia Treptow, Michaela Porubanova, Kerry Read, Daniel McDuff, Marina Kuznetsova and Mathew Lamb for their research collaboration. We would also like to thank Shawn Bruner for providing the characters for the study and Panagiotis Giannakopoulos for leading the animation and motion capture pipelines.

The post Exploring how context, culture, and character matter in avatar research appeared first on Microsoft Research.

Read More

Scaling early detection of esophageal cancer with AI

Scaling early detection of esophageal cancer with AI

white icons of first aid kit, DNA strand, laptop monitor with overlapping eye, and microscope on a blue and green gradient background

Microsoft Research and Cyted have collaborated to build novel AI models (opens in new tab) to scale the early detection of esophageal cancer. The AI-supported methods demonstrated the same diagnostic performance as the existing manual workflow, potentially reducing the pathologist’s workload by up to 63%.

Esophageal cancer is the sixth most common cause of cancer deaths worldwide, in part because this disease is typically diagnosed late, making treatment difficult. Fewer than 1 in 5 patients survive five years after diagnosis, making early detection of this disease critical to improving a patient’s chances. One opportunity for early detection is to identify patients with a condition called Barrett’s esophagus (BE). Patients with BE are at an increased risk of developing cancer, though most never will. Chronic heartburn is a risk factor and a possible cause of Barrett’s.

Detecting BE dramatically improves a patient’s chances. Earlier detection of cancer and earlier start of treatment mean that more than 9 in 10 patients survive 5 years after diagnosis. However early detection of BE has typically involved an endoscopic biopsy, a procedure that many people find uncomfortable and invasive. It often requires sedation, is resource intensive, and increases the risk of complications.

A major step toward enabling large-scale screening for BE has been spearheaded by Cyted (opens in new tab), a start-up company at the forefront of medical innovation. Cyted has developed a capsule sponge device called EndoSign (opens in new tab)® – a dissolvable capsule on a string that expands into a small medical sponge once in the stomach. When pulled back out, it collects cells from the lining of the esophagus, which are then processed, placed on slides, stained, and scanned for digital analysis. 

The capsule sponge is easier to administer and less costly than endoscopy. But a pathologist still needs to review the digitized slides to determine the presence of any goblet cells, a type of cell normally found in the intestinal lining, which would indicate BE if found in the esophagus. These images are huge (up to 100,000 by 100,000 pixels – the size of a squash court if printed at the typical photo resolution of 300dpi) – yet may contain only a few goblet cells per image, each cell just a few pixels large. To identify BE, pathologists need to use slides from two stains, H&E (a routine stain for observing cell structure) and TFF3 (a special stain just to find goblet cells). Since most patients with heartburn will not have BE, pathologists spend most of their time examining negative cases, taking away time in which they could be prioritizing high-risk cases without more sophisticated approaches to analysis.

Microsoft Research and Cyted have collaborated to build novel AI models that can efficiently check the slides for goblet cells, using either the H&E or TFF3 stains. This joint effort has led to a Nature Communications paper titled “Enabling large-scale screening of Barrett’s esophagus using weakly supervised deep learning in histopathology (opens in new tab).” Our study uses the strength of transformer-based multiple instance learning to assist in the screening of BE. In the paper, we introduce two major innovations. First, we show that the AI models can be built solely from the pathologists’ findings on whether BE is present, eliminating the need for expensive pixel-level annotations. This means that existing large capsule sponge screening datasets can be used to further improve the performance of the model. Secondly, we demonstrate that goblet cells can be detected with high accuracy using only the H&E slides. This is the most common routine stain in pathology, and it suggests that the more time-consuming and costly specialized staining, TFF3, could be skipped (see Figure 2 below).

Figure 1: The top-left contains a thumbnail image of an H&E slide with goblet cells. In the bottom left, the attention maps of the AI model show which image regions the model uses to make its final prediction. Zooming in to those areas (bottom right), we see that image parts that receive high attention contain goblet cells. We validate that these are indeed goblet cells by looking at the corresponding TFF3 slide (top right), where goblet cells are shown as brown.

In the paper, we further discuss different AI-assisted workflows designed to optimize the screening process. The first workflow necessitates a pathologist’s review only if either the H&E or TFF3 models predict a sample as positive. This method can achieve the same diagnostic performance as the existing manual workflow in terms of sensitivity and specificity, potentially reducing the pathologist’s workload by 52% (see Figure 3 below).

The second proposed workflow reduces the need for pathologist review by 63% of the original load, by restricting reviews to positive predictions from the H&E model only. However, this comes at slightly reduced sensitivity, since goblet cells are more clearly visible in the TFF3 stain.

Figure 2: Proposed AI-assisted workflows. a) Workflow “Pathologist reviews any positives” b) Workflow “Pathologist reviews H&E model positives”
Proposed AI-assisted workflow Pathologist review (per-cent of all cases) TFF3 staining required (per-cent of all cases) Sensitivity @ Specificity 1.00
Pathologist reviews any positive 48% 100% 1.00
Pathologist reviews H&E model positives 37% 37% 0.91
Figure 3: Quantitative comparison of the proposed workflows. For the two workflows described in Figure 2, we compare the pathologist workload as a fraction of the total number of cases, the amount of images for which a costly TFF3 stain is required, and the resulting accuracy numbers.

Our collaboration with Cyted demonstrates the transformative potential of integrating advanced AI models into clinical workflows, saving valuable time for pathologists. As we move forward, the scalability of this technology holds the promise for widespread adoption of early detection in the fight against esophageal cancer.

“This represents a significant step in our fight against esophageal cancer, offering the potential to save countless lives through early detection with our minimally-invasive capsule sponge technology,” said Cyted CEO Marcel Gehrung. “Our collaboration with Microsoft Research has been instrumental in pushing the boundaries of what’s possible in medical imaging and screening technologies, creating optimal efficiencies from start to finish of the testing process.”

We have open sourced code to build these models (opens in new tab), which is designed to be scalable to very large datasets, using Azure Machine Learning (opens in new tab). This flexibility allows other researchers and institutions to adapt and enhance our code according to their specific needs. Importantly, our code represents a significant advancement over previous work in the field. Unlike earlier approaches that focused solely on training the multiple instance and attention layers, our code allows for end-to-end fine-tuning, including the image encoder. This comprehensive approach to training ensures optimal performance and accuracy, setting a new standard for AI models in histopathology. 

“The open sourcing of this code has helped us to advance our research in the field of early cancer detection,” said Florian Markowetz, Professor of Computational Oncology at the University of Cambridge, and Senior Group Leader at Cancer Research UK Cambridge Institute. “Several key features will soon be integrated into ongoing clinical trials, where we aim to improve the detection of Barrett’s esophagus in patients and ultimately treat more cancers through early intervention. Furthermore, these features will help improve the workflow of pathologists and identify key regions quicker, enabling clinicians to tackle more cases with greater reliability.”

By sharing our work, we aim not only to enhance the detection of BE and esophageal cancer, but also to empower researchers and clinicians around the world to leverage this technology in their fight against cancer[1]. Because our code can be used as a building block to develop AI models for histopathology slides, it may also potentially be applied to other cancer types. It is our hope that this open-source initiative will foster innovation and collaboration, and ultimately lead to breakthroughs that save lives.

As researchers, it has been exciting to work closely with Cyted and be part of the long path towards early detection of esophageal cancer. Cross-discipline collaborations like this are excellent opportunities to solve complex clinical problems. With AI models built using the principles of responsible AI like fairness, privacy and security, and reliability and safety, we can ultimately make a tangible difference to patient outcomes.


Thank you to the team: Kenza Bouzid, Harshita Sharma, Sarah Killcoyne, Daniel C. Castro, Anton Schwaighofer, Max Ilse, Valentina Salvatelli, Ozan Oktay, Sumanth Murthy, Lucas Bordeaux, Luiza Moore, Maria O’Donovan, Anja Thieme, Hannah Richardson, Aditya Nori, Marcel Gehrung, Javier Alvarez-Valle

[1] (opens in new tab) Code released for research use only. Full disclaimer here: (opens in new tab)

The post Scaling early detection of esophageal cancer with AI appeared first on Microsoft Research.

Read More