LLMLingua: Innovating LLM efficiency with prompt compression

LLMLingua: Innovating LLM efficiency with prompt compression

This research paper was presented at the 2023 Conference on Empirical Methods in Natural Language Processing (opens in new tab) (EMNLP 2023), the premier conference on natural language processing and artificial intelligence.

EMNLP 2023 logo to the left of accepted paper

As large language models (LLMs) models advance and their potential becomes increasingly apparent, an understanding is emerging that the quality of their output is directly related to the nature of the prompt that is given to them. This has resulted in the rise of prompting technologies, such as chain-of-thought (CoT) and in-context-learning (ICL), which facilitate an increase in prompt length. In some instances, prompts now extend to tens of thousands of tokens, or units of text, and beyond. While longer prompts hold considerable potential, they also introduce a host of issues, such as the need to exceed the chat window’s maximum limit, a reduced capacity for retaining contextual information, and an increase in API costs, both in monetary terms and computational resources.

To address these challenges, we introduce a prompt-compression method in our paper, “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (opens in new tab),” presented at EMNLP 2023 (opens in new tab). Using a well-trained small language model, such as GPT2-small or LLaMA-7B, LLMLingua identifies and removes unimportant tokens from prompts. This compression technique enables closed LLMs to make inferences from the compressed prompt. Although the token-level compressed prompts may be difficult for humans to understand, they prove highly effective for LLMs. This is illustrated in Figure 1.

This is an illustration of the LLMLingua framework, which estimates the important tokens of a prompt based on a small language model. It consists of three modules: a budget controller, iterative token-level prompt compression, and distribution alignment. The framework can compress a complex prompt of 2,366 tokens down to 117 tokens, achieving a 20x compression while maintaining almost unchanged performance.
Figure 1. LLMLingua’s framework

LLMLingua’s method and evaluation

To develop LLMLingua’s framework, we employed a budget controller to balance the sensitivities of different modules in the prompt, preserving the language’s integrity. Our two-stage process involved course-grained prompt compression. We first streamlined the prompt by eliminating certain sentences and then individually compressed the remaining tokens. To preserve coherence, we employed an iterative token-level compression approach, refining the individual relationships between tokens. Additionally, we fine-tuned the smaller model to capture the distribution information from different closed LLMs by aligning it with the patterns in the LLMs’ generated data. We did this through instruction tuning.

To assess LLMLingua’s performance, we tested compressed prompts on four different datasets, GSM8K, BBH, ShareGPT, and Arxiv-March23, encompassing ICL, reasoning, summarization, and conversation. Our approach achieved impressive results, achieving up to 20x compression while preserving the original prompt’s capabilities, particularly in ICL and reasoning. LLMLingua also significantly reduced system latency.

During our test, we used LLaMA-7B as the small language model and GPT-3.5-Turbo-0301, one of OpenAI’s LLMs, as the closed LLM. The results show that LLMLingua maintains the original reasoning, summarization, and dialogue capabilities of the prompt, even at a maximum compression ratio of 20x, as reflected in the evaluation metric (EM) columns in Tables 1 and 2. At the same time, other compression methods failed to retain key semantic information in prompts, especially in logical reasoning details. For a more in-depth discussion of these results, refer to section 5.2 of the paper.

These are the experimental results on GSM8K and BBH using GPT-3.5-turbo, demonstrating the in-context learning and reasoning capabilities based on different methods and compression constraints. The results show that LLMLingua can achieve up to a 20x compression rate while only experiencing a 1.5-point performance loss.
Table 1. Performance of different methods at different target compression ratios on the GSM8K and BBH datasets.
These are the experimental results for ShareGPT (Conversation) and Arxiv-March23 (Summarization) using GPT-3.5-turbo, based on different methods and compression constraints. The results indicate that LLMLingua can effectively retain the semantic information from the original prompts while achieving a compression rate of 3x-9x.
Table 2. Performance of different methods at different target compression ratios for conversation and summarization tasks.

LLMLingua is robust, cost-effective, efficient, and recoverable

LLMLingua also showed impressive results across various small language models and different closed LLMs. When using GPT-2-small, LLMLingua achieved a strong performance score of 76.27 under the ¼-shot constraint, close to the LLaMA-7B’s result of 77.33 and surpassing the standard prompt results of 74.9. Similarly, even without aligning Claude-v1.3, one of the post powerful LLMs, LLMLingua’s score was 82.61 under the ½-shot constraint, outperforming the standard prompt result of 81.8.

LLMLingua also proved effective in reducing response length, leading to significant reductions in latency in the LLM’s generation process, with reductions ranging between 20 to 30 percent, as shown in Figure 2.

The figure demonstrates the relationship between the compression ratio and the number of response tokens. In different tasks, as the compression ratio increases, the response length decreases to varying extents, with a maximum reduction of 20%-30%.
Figure 2. The distribution of token lengths generated at varying compression ratios.

What makes LLMLingua even more impressive is its recoverability feature. When we used GPT-4 to restore the compressed prompts, it successfully recovered all key reasoning information from the full nine-step chain-of-thought (CoT) prompting, which enables LLMs to address problems through sequential intermediate steps. The recovered prompt was almost identical to the original, and its meaning was retained. This is shown in Tables 3 and 4.

This figure illustrates the original prompt, the compressed prompt, and the result of using GPT-4 to recover the compressed prompt. The original prompt consists of a 9-step Chain-of-Thought, and the compressed prompt is difficult for humans to understand. However, the recovered text includes all 9 steps of the Chain-of-Thought.
Table 3. Latency comparison on GSM8K. LLMLingua can accelerate LLMs’ end-to-end inference by a factor of 1.7–5.7x. 
This figure shows the end-to-end latency when using LLMLingua, without using LLMLingua, and the latency when compressing prompts. As the compression ratio increases, both the LLMLingua and end-to-end latency decrease, achieving up to a 5.7x acceleration with a 10x token compression rate.
Table 4. Recovering the compressed prompt from GSM8K using GPT-4.

Enhancing the user experience and looking ahead

LLMLingua is already proving its value through practical application. It has been integrated into LlamaIndex (opens in new tab), a widely adopted retrieval-augmented generation (RAG) framework. Currently, we are collaborating with product teams to reduce the number of tokens required in LLM calls, particularly for tasks like multi-document question-answering. Here, our goal is to significantly improve the user experience with LLMs. 

For the long-term, we have proposed LongLLMLingua, a prompt-compression technique designed for long-context scenarios, such as retrieval-augmented question-answering tasks in applications like chatbots, useful when information evolves dynamically over time. It’s also geared for tasks like summarizing online meetings. LongLLMLingua’s primary objective is to enhance LLMs’ ability to perceive key information, making it suitable for numerous real-world applications, notably information-based chatbots. We’re hopeful that this innovation paves the way for more sophisticated and user-friendly interactions with LLMs.

Learn more about our work on the LLMLingua (opens in new tab) page.

The post LLMLingua: Innovating LLM efficiency with prompt compression appeared first on Microsoft Research.

Read More

Abstracts: December 6, 2023

Abstracts: December 6, 2023

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Xing Xie, a Senior Principal Research Manager at Microsoft Research, joins host Gretchen Huizinga to discuss “Evaluating General-Purpose AI with Psychometrics.” As AI capabilities move from task specific to more general purpose, the paper explores psychometrics, a subfield of psychology, as an alternative to traditional methods for evaluating model performance and for supporting consistent and reliable systems.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Dr. Xing Xie, a Senior Principal Research Manager at Microsoft Research. Dr. Xie is coauthor of a vision paper on large language models called “Evaluating General-Purpose AI with Psychometrics,” and you can find a preprint of this paper now on arXiv. Xing Xie, thank you for joining us on Abstracts!

XING XIE: Yes, thank you. It’s my pleasure to be here. 

HUIZINGA: So in a couple sentences, tell us what issue or problem your research addresses and why people should care about it. 


XIE: Yeah, in a sense, actually, we are exploring the potential of psychometrics to revolutionize how we evaluate general-purpose AI. Because AI is advancing at a very rapid pace, traditional evaluation methods face significant challenges, especially when it comes to predicting a model’s performance in unfamiliar scenarios. And this method also lacks a robust mechanism to assess their own quality. Additionally, we, in this paper, we delve into the complexity of directly applying psychometrics to this domain and underscore several promising directions for future research. We believe that this research is of great importance. As AI continues to be integrated into novel application scenarios, it could have significant implications for both individuals and society at large. It’s crucial that we ensure their performance is both consistent and reliable.

HUIZINGA: OK, so I’m going to drill in a little bit in case there’s people in our audience that don’t understand what psychometrics is. Could you explain that a little bit for the audience? 

XIE: Yeah, psychometrics could be considered as a subdomain of psychology. Basically, psychology just studies everything about humans, but psychometrics is specifically developed to study how we can better evaluate, we could also call this general-purpose intelligence, but it’s human intelligence. So there are, actually, a lot of methodologies and approaches in how we develop this kind of test and what tasks we need to carry out. The previous AI is designed for specific tasks like machine translation, like summarization. But now I think people are already aware of many progress in big models, in large language models. AI, actually, currently can be considered as some kind of solving general-purpose tasks. Sometimes we call it few-shot learning, or sometimes we call it like zero-shot learning. We don’t need to train a model before we bring new tasks to them. So this brings a question in how we evaluate this kind of general-purpose AI, because traditionally, we evaluate AI usually using some specific benchmark, specific dataset, and specific tasks. This seems to be unsuitable to this new general-purpose AI. 

HUIZINGA: So how does your approach build on and/or differ from what’s been done previously in this field? 

XIE: Yeah, we actually see a lot of efforts have been investigated into evaluating the performance of these new large language models. But we see a significant portion of these evaluations are task specific. They’re still task specific. And also, frankly speaking, they are easily affected by changes. That means even slight alterations to a test could lead to substantial drops in performance. So our methodology differs from these approaches in that rather than solely testing how AI performs on those predetermined tasks, we actually are evaluating those latent constructs because we believe that pinpointing these latent constructs is very important.

HUIZINGA: Yeah. 

XIE: It’s important in forecasting AI’s performance in evolving and unfamiliar contexts. We can use an example like game design. With humans, even if an individual has never worked on game design—it’s just a whole new task for her—we might still confidently infer their potential if we know they possess the essential latent constructs, or abilities, which are important for game design. For example, creativity, critical thinking, and communication. 

HUIZINGA: So this is a vision paper and you’re making a case for using psychometrics as opposed to regular traditional benchmarks for assessing AI. So would you say there was a methodology involved in this as a research paper, and if so, how did you conduct the research for this? What was the overview of it? 

XIE: As you said, this is a vision paper. So instead of describing a specific methodology, we are collaborating with several experienced psychometrics researchers. Collectively, we explore the feasibility of integrating psychometrics into AI evaluation and discerning which concepts are viable and which are not. In February this year, we hosted a workshop on this topic. Over the past months, we have engaged in, in numerous discussions, and the outcome of these discussions is articulated in this paper. And additionally, actually, we are also in the middle of drafting another paper; that paper will apply insights from this paper to devise a rigorous methodology for assessing the latent capability of the most cutting-edge language models. 

HUIZINGA: When you do a regular research paper, you have findings. And when you did this paper and you workshopped it, what did you come away with in terms of the possibilities for what you might do on assessing AI with psychometrics? What were your major findings? 

XIE: Yeah, our major findings can be divided into two areas. First, we underscore the significant potential of psychometrics. This includes exploring how these metrics can be utilized to enhance predictive accuracy and guarantee test quality. Second, we also draw attention to the new challenges that arise when directly applying these principles to AI. For instance, test results could be misinterpreted, as assumptions verified for human tests might not necessarily apply to AI. Furthermore, capabilities that are essential for humans may not hold the same importance for AI.

HUIZINGA: Hmm …  

XIE: Another notable challenge is the lack of a consistent and defined population of AI, especially considering their rapid evolution. But this population is essential for traditional psychometrics, and we need to have a population of humans for verifying either the reliability or the validity of a test. But for AI, this becomes a challenge. 

HUIZINGA: Based on those findings, how do you think your work is significant in terms of real-world impact at this point? 

XIE: We believe that our approach will signal the start of a new era in the evaluation of general-purpose AI, shifting from earlier, task-specific methodologies to a more rigorous scientific method. Fundamentally, there’s an urgent demand to establish a dedicated research domain focusing solely on AI evaluation. We believe psychometrics will be at the heart of this domain. Given AI’s expanding role in society and its growing significance as an indispensable assistant, this evolution will be crucial. I think one missing part of current AI evaluation is how we can make sure the test, the benchmark, or these evaluation methods of AI themselves, is scientific. Actually, previously, I used the example of game design. Suppose in the future, I think there are a lot of people discussing language model agents, AI agents … they could be used to not only write in code but also develop software by collaborating among different agents. Then what kind of capabilities, or we call them latent constructs, of these AI models they should have before they make success in game design or any other software development. For example, like creativity, critical thinking, communication. Because this could be important when there are multiple AI models—they communicate with each other, they check the result of the output of other models. 

HUIZINGA: Are there other areas that you could say, hey, this would be a relevant application of having AI evaluated with psychometrics instead of the regular benchmarks because of the generality of intelligence?

XIE: We are mostly interested in maybe doing research, because a lot of researchers have started to leverage AI for their own research. For example, not only for writing papers, not only for generating some ideas, but maybe they could use AI models for more tasks in the whole pipeline of research. So this may require AI to have some underlying capabilities, like, as we have said, like critical thinking—how AI should define the new ideas and how they check whether these ideas are feasible and how they propose creative solutions and how they work together on research. This could be another domain. 

HUIZINGA: So if there was one thing that you want our listeners to take away from this work, what would it be? 

XIE: Yeah, I think the one takeaway I want to say is we should be aware of the vital importance of AI evaluation. We are still far from achieving a truly scientific standard, so we need to still work hard to get that done. 

HUIZINGA: Finally, what unanswered questions or unsolved problems remain in this area? What’s next on your research agenda that you’re working on? 

XIE: Yeah, actually, there are a lot of unanswered questions as highlighted at the later part of this paper. Ultimately, our goal is to adapt psychometric theories and the techniques to fit AI contexts. So we have discussed with our collaborators in both AI and psychometrics … some examples would be, how can we develop guidelines, extended theories, and techniques to ensure a rigorous evaluation that prevents misinterpretation? And how can we best evaluate assistant AI and the dynamics of AI-human teaming? This actually is particularly proposed by one of our collaborators in the psychometrics domain. And how do we evaluate the value of general-purpose AI and ensure their alignment with human objectives? And then how can we employ semiautomatic methods to develop psychometric tests, theories, and techniques with the help of general-purpose AI? That means we use AI to solve these problems by themselves. This is also important because, you know, psychometrics or psychology have developed for hundreds, or maybe thousands, of years to come to all the techniques today. But can we shorten that period? Can we leverage AI to speed up this development? 

HUIZINGA: Would you say there’s wide agreement in the AI community that this is a necessary direction to head?

XIE: This is only starting. I think there are several papers discussing how we can apply some part of psychology or some part of psychometrics to AI. But there is no systematic discussion or thinking along this line. So I, I don’t think there is agreement, but there’s already initial thoughts and initial perspectives showing in the academic community. 

[MUSIC PLAYS]

HUIZINGA: Well, Xing Xie, thanks for joining us today, and to our listeners, thank you for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts (opens in new tab), or you can find a preprint of the paper on arXiv. See you next time on Abstracts!

The post Abstracts: December 6, 2023 appeared first on Microsoft Research.

Read More

Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow

Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow

These research papers were presented at the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (opens in new tab) (ESEC/FSE 2023), a premier conference in the field of software engineering.

ESEC/FSE 2023
Two papers on a blue/green gradient: InterFix and AdaptivePaste

The practice of software development inevitably involves the challenge of handling bugs and various coding irregularities. These issues can become pronounced when developers engage in the common practice of copying and pasting code snippets from the web or other peer projects. While this approach might offer a quick solution, it can introduce a host of potential complications, including compilation issues, bugs, and even security vulnerabilities into the developer’s codebase.

To address this, researchers at Microsoft have been working to advance different aspects of the software development lifecycle, from code adaptation to automated bug detection and repair. At ESEC/FSE 2023 (opens in new tab), we introduced two techniques aimed at enhancing coding efficiency. AdaptivePaste utilizes a learning-based approach to adapt and refine pasted code snippets in an integrated development environment (IDE). InferFix is an end-to-end program repair framework designed to automate bug detection and resolution. This blog outlines these technologies.

Microsoft Research Podcast

AI Frontiers: Models and Systems with Ece Kamar

Ece Kamar explores short-term mitigation techniques to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 


AdaptivePaste: Intelligent copy-paste in IDE

A widespread practice among developers involves adapting pasted code snippets to specific use cases. However, current code analysis and completion techniques, such as masked language modeling and CodeT5, do not achieve an acceptable level of accuracy in identifying and adapting variable identifiers within these snippets to align them with the surrounding code. In the paper, “AdaptivePaste: Intelligent Copy-Paste in IDE,” we propose a learning-based approach to source code adaptation, aiming to capture meaningful representations of variable usage patterns. First, we introduce a specialized dataflow-aware de-obfuscation pretraining objective for pasted code snippet adaptation. Next, we introduce a transformer-based model of two variants: a traditional unidecoder and parallel-decoder model with tied weights.

Diagram depicting AdaptivePaste architecture. Starting with a program with a pasted code snippet, AdaptivePaste extracts and prioritizes syntax hierarchies most relevant for the learning task, analyzes the data-flow, and then anonymizes the pasted code. The resulting program serves as input for neural model. The output is serialized as a sequence of tokens.
Figure 1. AdaptivePaste architecture. For a program with a pasted code snippet, AdaptivePaste extracts and prioritizes syntax hierarchies most relevant for the learning task, analyzes the data flow, and anonymizes variable identifiers in the pasted code snippet. The resulting program serves as input for neural model. The output is serialized as a sequence of tokens entries.

The unidecoder follows a standard autoregressive decoder formulation, mapping each variable in the pasted snippet to a unique symbol in the context or declaring a new variable. The parallel decoder duplicates the decoder for each anonymized symbol in the anonymized pasted snippet, predicting names independently and factorizing the output distribution per symbol. This enables selective code snippet adaptation by surfacing model predictions above a specified threshold and outputting “holes” where uncertainty exists.

To establish a dataflow-aware de-obfuscation pretraining objective for pasted code snippet adaptation, we assigned mask symbols to variable identifiers at the granularity of whole code tokens. The pre-existing code context was unanonymized, allowing the model to attend to existing identifier names defined in scope.

Our evaluation of AdaptivePaste showed promising results. It successfully adapted Python source code snippets with 67.8 percent exact match accuracy. When we analyzed the impact of confidence thresholds on model predictions, we observed that the parallel decoder transformer model improves precision to 85.9 percent in a selective code adaptation setting.

InferFix: End-to-end program repair with LLMs

Addressing software defects accounts for a significant portion of development costs. To tackle this, the paper, “InferFix: End-to-End Program Repair with LLMs over Retrieval-Augmented Prompts,” introduces a program repair framework that combines the capabilities of a state-of-the-art static analyzer called Infer, a semantic retriever model called Retriever, and a transformer-based model called Generator to address crucial security and performance bugs in Java and C#.

The Infer static analyzer is used to reliably detect, classify, and locate critical bugs within complex systems through formal verification. The Retriever uses a transformer encoder model to search for semantically equivalent bugs and corresponding fixes in large datasets of known bugs. It’s trained using a contrastive learning objective to excel at finding relevant examples of the same bug type.

The Generator employs a 12 billion-parameter codex model, fine-tuned on supervised bug-fix data. To enhance its performance, the prompts provided to the Generator are augmented with bug type annotations, bug contextual information, and semantically similar fixes retrieved from an external nonparametric memory by the Retriever. The Generator generates the candidate to fix the bug.

Diagram depicting the InferFix approach workflow. Starting with a Pull Request, the Infer Static Analyzer conducts bug detection, classification, and localization. Subsequently, Context Extraction gathers pertinent details of the bugs and the surrounding context, and then Retriever identifies semantically similar bugs. The process concludes with the LLM Generator proposing a fix based on the generated prompt.
Figure 2: The InferFix workflow. An error-prone code modification is detected by the Infer static analyzer, which is used to craft a prompt with bug type annotation, location information, relevant syntax hierarchies, and similar fixes identified by the Retriever. The large language model (LLM) Generator provides a candidate fix to the developer.

To test InferFix, we curated a dataset called InferredBugs (opens in new tab), which is rich in metadata and comprises bugs identified through executing the Infer static analyzer on thousands of Java and C# repositories. The results are noteworthy. InferFix outperforms strong LLM baselines, achieving a top-1 accuracy of 65.6 percent in C# and an impressive 76.8 percent in Java on the InferredBugs dataset.

Looking ahead

With AdaptivePaste and InferFix, we hope to significantly streamline the coding process, minimizing errors and enhancing efficiency. This includes reducing the introduction of bugs when code snippets are added and providing automated bug detection, classification, and patch validation. We believe that these tools hold promise for an enhanced software development workflow, leading to reduced costs and an overall boost in project efficiency.

Looking ahead, the rapid advancement of LLMs like GPT-3.5 and GPT-4 has sparked our interest in exploring ways to harness their potential in bug management through prompt engineering and other methods. Our goal is to empower developers by streamlining the bug detection and repair process, facilitating a more robust and efficient development environment.

The post Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow appeared first on Microsoft Research.

Read More

Research Focus: Week of December 4, 2023

Research Focus: Week of December 4, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus
December 6th, 2023

Leveraging Large Language Models for Automated Proof Synthesis in Rust

Formal verification can probably guarantee the correctness of critical system software, but the high proof burden has long hindered its wide adoption. Recently, large language models (LLMs) have shown success in code analysis and synthesis. In a recent paper: Leveraging Large Language Models for Automated Proof Synthesis in Rust, researchers from Microsoft present a combination of LLMs and static analysis to synthesize invariants, assertions, and other proof structures for a Rust-based formal verification framework called Verus.

In a few-shot setting, GPT-4 demonstrates impressive logical ability in generating postconditions and loop invariants, especially when analyzing short code snippets. However, GPT-4 does not consistently retain and propagate the full context information needed by Verus, a task that can be straightforwardly accomplished through static analysis. Based on these observations, the researchers developed a prototype based on OpenAI’s GPT-4 model. This prototype decomposes the verification task into multiple smaller ones, iteratively queries GPT-4, and combines its output with lightweight static analysis. Evaluating the prototype with a developer in the automation loop on 20 vector-manipulating programs showed that it significantly reduces human effort in writing entry-level proof code.

MICROSOFT RESEARCH PODCAST

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

In this episode, PhD students Jennifer Scurrell and Alejandro Cuevas talk to Senior Researcher Dr. Madeleine Daepp. They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers to the teamwork they say helped make it possible for them to succeed, and the impact they hope to have with their work.


Don’t Forget the User: It’s Time to Rethink Network Measurements

The goal of network measurement is to characterize how and how well a network is performing. This has traditionally meant a focus on the bits and bytes — low-level network metrics such as latency and throughput, which have the advantage of being objective but are limited in representativeness and reach. In a recent paper: Don’t Forget the User: It’s Time to Rethink Network Measurements, researchers from Microsoft argue that users also provide a rich and largely untapped source of implicit and explicit signals that could complement and expand the coverage of traditional measurement methods. Implicit feedback leverages user actions to indirectly infer network performance and the resulting quality of user experience. Explicit feedback leverages user input, typically provided offline, to expand the reach of network measurement, especially for newer ones.

The researchers analyze example scenarios, including capturing implicit feedback through user actions such as the user (un)muting the mic or turning on/off the camera in a large-scale conferencing service. These techniques complement existing measurement methods and open a broad set of research directions, ranging from rethinking measurements tools, to designing user-centric networked systems and applications.


Ghana 3D international telemedicine proof of concept study

A real-time 3D telemedicine system – leveraging Holoportation™ communication technology – was used to facilitate consultations with complex reconstructive patients prior, during, and after an overseas surgical collaboration. The system was used in a proof-of-concept clinic in November 2022 between Canniesburn Plastic Surgery Unit, UK, and the National Reconstructive Plastic Surgery and Burns Centre, Korle Bu Teaching Hospital, Ghana.

Four patients in Ghana were followed through their patient journey (mandibular ameloblastoma, sarcoma thigh, maxillary tumor, sarcoma back). A new report: Ghana 3D Telemedicine International MDT: A Proof-of-concept study details the responses of 13 participants (4 patients, 4 Ghana clinicians, 5 UK clinicians) completed feedback on the 3D multidisciplinary team (MDT). Outcome measures were rated highly with satisfaction 84.31/100, perceived benefit 4.54/5, overall quality 127.3/ 147, and usability 83.2/100. This data shows close alignment with that previously published on high income countries.

This novel technology has potential to enhance overseas surgical visits in low-to-middle income countries through improved planning, informed discussion with patients, expert consensus on complex cases, and fostering engagement with professionals who may be thousands of miles away.


The post Research Focus: Week of December 4, 2023 appeared first on Microsoft Research.

Read More

Exploring LLMs’ potential to help facilitators enhance online healthcare communities

Exploring LLMs’ potential to help facilitators enhance online healthcare communities

This research paper was presented at the Fourth African Human Computer Interaction Conference (opens in new tab) (AfriCHI 2023), the pan-African conference on interactive digital technology design.

AfriCHI 2023 logo to the left of accepted paper

Online health communities can be a lifeline for people seeking healthcare support, enabling them to share experiences, ask questions, and receive help. These are particularly vital in low-and-middle-income countries (LMICs), where access to quality healthcare can be limited and online health communities function as a doorway for receiving expert advice and accessing trustworthy content. One platform that is widely used for this purpose is WhatsApp due to its popularity and ability to host facilitated communities for specific groups, like patients affiliated with a particular clinic.

For all their benefits, online health communities also face challenges due to the myriad responsibilities and equal lack of support for facilitators, who must answer questions, respond to ongoing discussions, and review reports. Facilitation requires staying abreast of ongoing chat threads, verifying facts, and generally just being available. Given that most healthcare professionals already have a full day of in-person healthcare work, facilitation occurs during lunch breaks, evenings, and even mornings before the workday begins.

Our paper, “Can Large Language Models Support Medical Facilitation Work? A Speculative Analysis (opens in new tab),” presented at AfriCHI 2023 (opens in new tab), discusses research conducted in collaboration with the University of Washington, where we examined facilitated WhatsApp groups created for young people living with HIV in informal settlements in Kenya. Facilitation involved moderating chats, providing emotional support, conducting administrative tasks, sharing information, and resolving conflicts. Because many discussions occurred at night, facilitators struggled to keep up with the chats, often missing important questions or responding to them a few days after they were posted. Facilitators also found it difficult to defuse tensions, which occurred from time to time.

LLMs’ potential in supporting online health facilitators

To help resolve these challenges, we explored ways large language models (LLMs) could potentially support facilitators, for example, by flagging important messages and helping with content authoring. LLMs’ language translation capabilities and capacity to answer questions and summarize information made them great candidates for online heath communities, understanding that facilitators should always verify the content that LLMs create. To explore their potential, we tested their application on chat log data. We concluded that an LLM-enabled copilot could help facilitators in several ways, such as:

  • Coproducing compelling content: LLMs could help facilitators create educational and informative content for group members. They can summarize frequently asked questions, patient stories, and best practices for managing chronic conditions.
  • Summarizing messages: LLMs could summarize long discussions in the chat, making it easier for facilitators to get up to date and identify important issues. Summarization can also help participants who need to be offline and might otherwise miss important information.
  • Providing recommendations: LLMs could help facilitators conduct research when answering questions. However, facilitators must exercise due diligence and verify any suggestions the LLM makes.
  • Performing sentiment analysis: LLMs could flag potential trouble spots in messages, such as declines in mental health, tension among participants, harmful advice, and misinformation.
  • Assigning badges: LLMs could assign badges to group members in recognition for participating in discussions, completing tasks, or achieving milestones. This could help to motivate and engage members.

Importance of human facilitation

While LLMs offer numerous potential benefits for healthcare facilitation, it’s important to consider their challenges and limitations. We strongly believe that LLMs should be used to augment, not replace, human facilitation. One crucial reason is that this technology cannot provide the emotional support essential in these groups. Another challenge involves the potential for bias and harm. LLMs are trained on massive datasets of text and code, which might contain harmful biases and stereotypes. Additionally, LLMs can produce errors when dealing with content from outside the training data, such as cultural backgrounds that are underrepresented in this data.

Our research shows that the benefits these groups provide lie beyond merely providing information. Their success, gauged by participation levels, perceived value by members, and adherence to medical protocols, is attributed not only to the facilitators’ expertise but also to their empathy, humor, and care. These are human qualities that LLMs cannot replace.

a medical professional in scrubs holding a stethoscope posing for the camera

Looking forward

When used to augment and support existing medical professionals, LLMs show promise in healthcare solutions, such as those for patients with chronic diseases in LMICs. We recommend that future research and practice in this area prioritize the following:

  • Developing and testing LLM-enabled copilot systems that are tailored to specific patient populations and online health communities.
  • Ensuring that design supports medical professionals, taking special care to preserve their agency.
  • Designing copilot systems so that users can easily evaluate output as well as identify and correct erroneous content.
  • Developing guidelines and regulations to ensure quality and safety when using LLMs for healthcare purposes.

Overall, the use of LLMs to support the work of online health community facilitation is an exciting new area of research. By making the facilitators’ tasks easier, they can pave the way for groups supporting more patients, improve adherence to medical protocols, and enhance well-being. While our research focused on a specific type of WhatsApp group, the potential of LLMs reaches far beyond. These models have the potential to support facilitators of online health communities across a diverse range of platforms.

The post Exploring LLMs’ potential to help facilitators enhance online healthcare communities appeared first on Microsoft Research.

Read More

Collaborators: Teachable AI with Cecily Morrison and Karolina Pakėnaitė

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.

In this episode, Gretchen Huizinga speaks with Cecily Morrison (opens in new tab), MBE, a Senior Principal Research Manager at Microsoft Research, and Karolina Pakėnaitė (opens in new tab), who also goes by Caroline, a PhD student and member of the citizen design team working with Morrison on the research project Find My Things. An AI phone application designed to help people who are blind or have low vision locate their personal items, Find My Things is an example of a broader research approach known as Teachable AI. Morrison and Pakėnaitė explore the Teachable AI goal of empowering people to make an AI experience work for them. They also discuss how “designing for one” when it comes to inclusive design leads to innovative solutions and what they learned about optimizing these types of systems for real-world use (spoiler: it’s not necessarily more or higher-quality data).

Transcript

[TEASER] [MUSIC PLAYS UNDER DIALOGUE]

CECILY MORRISON: One of the things about Teachable AI is that it’s not about the AI system. It’s about the relationship between the user and the AI system. And the key to that relationship is the mental model of the user. They need to make good judgments about how to give good teaching examples if we want that whole cycle between user and AI system to go well.

[TEASER ENDS]

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research Podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.

[MUSIC FADES]

Today I’m talking to Dr. Cecily Morrison, MBE, a Senior Principal Research Manager at Microsoft Research, and Karolina Pakėnaitė, a PhD student and a participant on the citizen design team for the Teachable AI research project Find My Things. Cecily and Karolina are part of a growing movement to bring accessible technologies to people with different abilities by closely collaborating with those communities during research and development. Cecily, Karolina, welcome to Collaborators!


CECILY MORRISON: Thank you, Gretchen.

KAROLINA PAKĖNAITĖ: Yeah, thank you.

HUIZINGA: Before we hear more about Find My Things, let’s get to know the both of you. And, Cecily, I’ll start with you. Give us a brief overview of your background, including your training and expertise, and what you’re up to in general right now. We’ll get specific shortly, but I just want to have sort of the umbrella of your raison d’être, or your reason for research being, as it were.

MORRISON: Sure, I’m a researcher in human-computer interaction with a very specific focus on AI and inclusion. Now this for me brings together an undergraduate degree in anthropology—understanding people—a PhD in computer science—understanding computers and technology—as well as a life role as a parent of a disabled child. And I’m currently leading a team that’s really trying to push the boundaries of what’s possible in human-AI interaction and motivated by creating technologies that lead us to a more inclusive world.

HUIZINGA: As a quick follow-up, Cecily, for our non-UK listeners, tell us what MBE stands for and why you were awarded this honor.

MORRISON: Yes, MBE. I also had to look it up when I first received the, uh, the award. [LAUGHTER] It stands for Member of the British Empire, and it’s part of the UK honor system. My MBE was awarded in 2020 for services to inclusive design. Now much of my career at Microsoft Research has been dedicated to innovating inclusive technology and then ensuring that it gets into the hands for those whom we made it for.

HUIZINGA: Right. Was there a big ceremony?

MORRISON: Things were a little bit different during the, the COVID times, but I did have the honor of going to Buckingham Palace to receive the award. And it was a wonderful time bringing my mother and my manager, uh, the important women around me, who’ve made it possible for me to do this work.

HUIZINGA: That’s wonderful. Well, Karolina, let’s talk to you for a minute here. You’re one of the most unique guests we’ve ever had on this podcast. Tell us a bit about yourself. Obviously, we’d like to know where you’re studying and what you’re studying, but this would be a great opportunity to share a little bit about your life story, including the rare condition that brought you to this collaboration.

PAKĖNAITĖ: Thank you so much again for having me. What an amazing opportunity to be here on the podcast. So I’m a PhD student at the University of Bath looking into making visual photographs accessible through text. Maybe you can tell from my speech that I am deaf-blind. So I got diagnosed with Usher syndrome type 2A at the age of 19, which means that I was born hard of hearing but then started to lose sight just around my early 20s. It has been a journey accepting this condition, but it’s also brought me some opportunities like becoming part of this collaboration for Microsoft Research project.

HUIZINGA: Karolina, a quick follow-up for you. Because of the nature of your condition, you’ve encountered some unique challenges, um, one of which made the news a couple of years ago. Can you talk a little bit about how perceptions about people with varying degrees of disability can cause skepticism, both from others and in fact, as you’ve pointed out, yourself? What can we learn about this here?

PAKĖNAITĖ: Yeah, so I have experienced many misunderstandings, and I know I’m not alone. So I have tunnel vision, a progressive condition at the stage where my specialists have registered me as blind instead of partially sighted. My central sight is still excellent, so that means I can still make eye contact, read books, do photography. Some people even tell me that I don’t look blind, but what does that even mean? [LAUGHTER] So since my early 20s, I became very, very clumsy. I stepped over children, walked into elderly, stepped on cat tails, experienced too many near-miss car accidents. So my brain no longer processes the world in the same way as before. But, yeah, for the longest time in my sight-loss journey, I felt like I had imposter syndrome, being completely skeptical about my own diagnosis despite the clumsy experiences, extensive eye tests, and genetic confirmation. I think the major reason is because of a lack of representation of the blind community in the media. Blindness is not black and white. Statistically, most of us have some remaining vision. Disability is not about having a certain look. This also applies to people with some form of visual impairment. I love it, how I can … how there’s so many more new Instagrammers and YouTubers who are just like me, but I still think there is a long way to go before having disability representation becoming a norm for greater understanding and inclusivity.

HUIZINGA: You know, I have to say, this is a great reminder that there is a kind of a spectrum of ability, and that we should be gracious to people as opposed to critical of them. So, um, thank you so much for that understanding that you bring to this, Karolina. Before we get into specifics of this collaboration—and that’s what we’re here for on this podcast—I think the idea of Teachable AI warrants some explication. So, Cecily, what is Teachable AI, and why is it an important line of research, including its applications in things like Find My Things?

MORRISON: Gretchen, that’s a great question. Teachable AI enables users to provide examples or higher-level constraints to an AI model in order to personalize that AI system to meet their own needs. Now most people are familiar with personalization. Our favorite shopping site or entertainment service offers us personalized suggestions. But we don’t always have a way to shape those suggestions. So you can imagine it’s pretty annoying, for example, if you keep being offered nappies by your favorite shopping service because you’ve been buying them for a friend, but actually, you don’t have or even plan to have a baby. So now Teachable AI gives, us—the user—agency in personalizing that AI system to make a choice about what are the things you want to be reflected in yourself, your identity, when you work or interact with that AI system? Now this is really important for AI systems that enable inclusion. So if we consider disability to be a mismatch between a person’s capabilities and their environment, then AI has a really significant role to play in reducing that mismatch. However, as we were working on this, we soon discovered that the number of potential mismatches between a person and their environment is incredibly large. I mean, it’s like the number of stars, right.

HUIZINGA: Right, right.

MORRISON: Because disability is a really heterogeneous group. But then we say, oh, well, let’s just consider people who are blind. Well, as Karolina has just shown us, um, even people who are blind are very, very diverse. So there are people with different amounts of vision or different types of vision. People who have different … experience the world with vision or without. People can lose their vision later in life. They can be born blind. People have different personalities. Some people are happy to go with whatever. Some people not so much.

HUIZINGA: Right.

MORRISON: People are from different cultures. Maybe they, they are used to being in an interdependent context. Other people might have intersecting disabilities like deaf-blindness and have, again, its own set of needs. So as we got into building AI for accessibility and AI for inclusion more generally, we realized that we needed to figure out how can we make AI systems work for individuals, not quote-unquote “people with disabilities”? So we focused on Teachable AI so that each user could shape the AI system to work for their own needs as an individual in a way that they choose, not somebody else. So Find My Things is a simple but working example of a Teachable AI system. And in this example, people can personalize a object finder or object detector for the personal items that matter to them. And they can do this by taking four videos of that personal item that’s important to them and then training, on their phone, a little model that will then recognize those items and guide them to those items. So you might say, well, recognizing objects with phone, we can do that now for a couple of years. And that’s very true. But much of what’s been recognizable wasn’t necessarily very helpful for people who are blind and low vision. Now it’s great if you can recognize doors, chairs, but carnivores and sombrero hats? [LAUGHTER] You know, perhaps this is less handy on a day-to-day basis. But your own keys, your friend’s front door, your guide cane, maybe even the TV remote that somebody’s always putting somewhere else. I mean these are the things that people want to keep track of. And each person has their own set of things that they want. So the Find My Things research prototype allows people to choose what they want to train or to teach to their phone and then be able to teach it and to find those things.

HUIZINGA: OK, so just to clarify, I have my phone. I’ve trained it to find certain objects that I want to find. What’s the mechanism that I use to say, what, you know … do you just say, “Find my keys,” and your phone leads you there through beeps or, you know, Marco Polo? Closer? Warmer?

MORRISON: Sure, how, how does it work?

HUIZINGA: Yeah!

MORRISON: Well, that’s a great question. So you then have a list of things that you can find. So for most people, there’s five or 10 things that are pretty important to them. And then you would find that … then you would scan your phone around the room. And you need to be within sort of 4 to 6 meters of something that you want to find. So if, if it’s in your back studio in the garden, it’s not going to find it. It’s not telepathic in that regard. It’s a computer vision system using vision. If it’s underneath your sofa, you probably won’t find it either. But we found that with all things human-AI interaction, we, we rely on the interaction between the person and the AI to make things work. So most people know where things might be. So if you’re looking for a TV remote, it’s probably not in the bathtub, right? It’s probably going to be somewhere in the living room, but, you know, your, your daughter or your brother or your housemate might have dropped it on the floor; they might have accidentally taken it into the kitchen. But you probably have some good ideas of where that thing might be. So this is then going to help you find it a little bit faster so you don’t need to get on your hands and knees and feel around to where it is.

HUIZINGA: Gotcha. The only downside of this is “find my phone,” which would help me find my things! [LAUGHTER] Anyway, that’s all …

MORRISON: Well, well, I think Apple has solved that one.

HUIZINGA: They do! They have, they have an app. Find My phone. I don’t know how that works. Well, listen, let’s talk about the collaboration a bit and, and talk about the meetup, as I say, on how you started working together. I like to call this bit “how I met your mother” because I’m always interested to hear each side of the collaboration story. So, Karolina, why don’t you take the lead here and then Cecily can fill in the blanks from her side on how you got together.

PAKĖNAITĖ: Um, yeah, so I found this opportunity to join this collaboration for Microsoft Research project as a citizen designer through an email newsletter from a charity, VICTA. From the newsletter, it looked like it was organized in a way where you were way more than just a participant for another research project. It looked like an amazing opportunity to actually get some experiences and skills. So gaining just as much as giving. So, yeah, I thought that I shouldn’t miss out.

HUIZINGA: So you responded to the email, “Yeah, I’m in.”

PAKĖNAITĖ: Yeah.

HUIZINGA: Cecily, what, what was going on from your side? How did you put this out there with this charity and bring this thing together?

MORRISON: So VICTA is a fantastic charity in the UK that works with, uh, blind and low vision young people up to the age of 30. And they’re constantly trying to bring educational and meaningful experiences to the people that they serve. And we thought this would be a great moment of collaboration where we could bring an educational experience about learning how to do design and they could help us reach out to the people who might want to learn about design and might want to be part of this collaboration.

HUIZINGA: So Karolina was one of many? How many other citizen designers on this project did you end up with?

MORRISON: Oh, that’s a great question. We had a lot of interest, I do have to say, and from there, we selected eight citizen designers from around the UK who were willing to make the journey to Cambridge and work with us over a period of almost six months. People came up to us about monthly, although we did some virtual ones, as well.

HUIZINGA: Well, Cecily, let’s talk about this idea of citizen designers. I, I like that term very much. Inclusive design isn’t new in computer-human interaction circles—or human-computer interaction circles—and you already operate on the principle of “nothing about us without us,” so tell us how the concept of citizen designer is different and why you think citizen designers take user input to another level.

MORRISON: Sure, I think citizen designer is a really interesting concept and one that we, we need more of. But let me first start with inclusive design and how that brings us to think about citizen designers. So inclusive design has been a really productive innovation tool because it brings us unusual constraints to the design problem. Within the Microsoft Inclusive Design toolkit, we refer to this as “designing for one.” And once you’ve got this very novel design that emerges, we then optimize it to work for everyone, or we extend it to many. So this approach really jogs the mind to radical solutions. So let me give you just one example. In years past, we developed a physical coding language to support blind and sighted children to learn to code together. So we thought, ah, OK, sighted children have blocks on a screen, so we’re going to make blocks on a table. Well, our young design team lined up the blocks on the table, put their hands in their lap, and I looked at them and I thought, we failed! [LAUGHTER] So we started again, and we said, OK, show us. And we worked with them to show us what excites the hands. You know, here are kids who live through their hands. You know, what are the shapes? What are the interactions? What are the kinds of things they want to do with their hands? And through this, we developed a completely different base idea and design, and we found that it didn’t just excite the hands of children who are blind or low vision, but it excited the hands of all children. They had brought us their expertise in thinking about the world in a different way. And so now we have this product Code Jumper, which kids just can’t put down.

HUIZINGA: Right.

MORRISON: So that’s great. So we, we know that inclusive design is going to generate great ideas. We also know that diverse teams generate the best ideas because diverse life experience can prompt us to think out of the box. But how do we get diverse teams when it can be hard for people with disabilities to enter the field of design and technology? So design assumes often good visual skills; it assumes the ability to draw. And that can knock out a lot of people who might be great at designing technology experiences without those skills. So with our citizen design team, we wanted to open up the opportunity to young people who are blind and low vision to really set the stage for them to think about what would a career in technology design be like? Could I be part of this? Can I be that generation who’s going to design the next cohort of accessible, inclusive technologies? So we did this through teaching key design skills like the design process itself, prototyping, as well as having, uh, this team act as full members of our own R&D team, so in an apprenticeship style. So our citizen designers weren’t just giving feedback as, as participants might, but they were creating prototypes, running A/B tests, and it was our hope and I think we succeeded in making it a give-give situation. We were giving them a set of skills, and they were giving us their design knowledge that was really valuable to our innovation process.

HUIZINGA: That is so awesome. I’m, you know, just thinking of, of the sense of belonging that you might get instead of being, as Karolina kind of referred to, it’s not just another user-research study where you’ll go and be part of a project that someone else is doing. You’re actually integrally connected to the project. And on that note, Karolina, talk a little bit about what it’s like to be a citizen designer. What were some of your aha moments on the project, maybe the items that you wanted to be able to find and what surprises you encountered in the process of developing a technique to teach a personal item?

PAKĖNAITĖ: Yeah, so it was, uh, incredibly fascinating to play the role of a citizen designer and testing a Teachable AI for use and providing further comments. It took me a bit of time to really understand how this tool is different from existing ones, but then I realized it’s literally in the name, a Teachable AI. [LAUGHTER] So it’s a tool designed for teaching it about your very own personal items. Yeah, your items may, may not look like a typical standard item; maybe you personalized them with engravings or stickers, or maybe it’s a unique gadget or maybe, say, a medical device. So it’s not about teaching every single item that you own, but rather a tool, a tool that lets us identify what matters most to you. So, yeah, I have about five to 10 small personal items that I always carry with me, and most of them are like very, very, very important to me. Like losing a bus pass means I can’t get anywhere. Losing a key means I can’t get home. Because these items are small and I use them daily, that means they are also, uh, being lost most commonly. So now I have a tool that is able to locate my personal items if they happen to be lost.

HUIZINGA: Right. And as you said earlier, you do have some sight. It’s, it’s tunnel vision at this point, so the peripheral part, um, is more challenging for you. But having this tool helps you to focus in a broader spectrum of, of visual sight. Cecily, this would be a great time to get a bit more specific about your Teachable AI discovery process. Tell us some research stories. How did you go about optimizing this AI system, and what things did you learn from both your successes and your failures?

MORRISON: Ah, yes, lots of research stories with this system, I’m afraid, but I think the very first thing we did was, OK, a user wants to teach this system, so we need to tell the user what makes a good teaching example. Well, we don’t know. Actually, we assumed we did know because in machine learning, the idea is more data, better quote-unquote “quality data,” and the system will work better. So the first thing that really surprised us when we actually ran some experimental analysis was that more data was not better and higher-quality data, or data that has less blur or is perfectly framed, was also not better. So what we realized is that it wasn’t our aim to kind of squeeze as much data as we could from the users but really to get the data that was the right kind of data. So we did need the object in the image. It’s, it’s really hard to train a system to recognize an object that’s not there at all. But what we needed was data that looked exactly like what the user was going to use when they were finding the objects. So if the user moves the camera really fast and the image becomes blurry, then we need those teaching examples to have blur, too.

HUIZINGA: Right.

MORRISON: So it was in understanding this relationship between the teaching examples and the user that really helped us craft a process that was going to help the user get the best result from the system. One of the things about Teachable AI is that it’s not about the AI system. It’s about the relationship between the user and the AI system. And the key to that relationship is the mental model of the user. They need to make good judgments about how to give good teaching examples if we want that whole cycle between user and AI system to go well. So I remember watching Karolina taking her teaching frames, and she was moving very far away. And I was thinking, hmm, I don’t think that data is going to work very well because there’s just not going to be enough pixels of the object to make a good representation for the system. So I asked Karolina about her strategy, and she said, well, if I want it to work from far away, then I should take teaching examples from far away. And I thought, ah, that’s a very logical mental model.

HUIZINGA: Right.

MORRISON: But unfortunately, we’ve broken the user’s mental model because that’s not actually how the system works because we were cropping frames and taking pixels out and doing all kinds of fancy image manipulation to, actually, to improve the performance under the hood. So I think this was an experience where we thought, ah, we want the user to develop a good mental model, but to do that, we need to actually structure this teaching process so they don’t need to think so hard and we’re guiding them into the, the kinds of things that make the system work well as opposed to not, and then they don’t need to guess. So the other thing that we found was that teaching should be fast and easy. Otherwise, it’s just too much work. No matter how personalized something is, if you have to work too hard, it’s a no-go. So we thought, ah, we want this to be really fast. We want it to take as few frames as possible. And we want the users to be really confident that they’ve got the object in the frame because that’s the one thing we really need. So we’re going to tell them all the time if the object’s in the frame: it’s in frame; it’s in frame; it’s in frame; it’s in frame; it’s in frame; it’s in frame. Well, there’s … citizen designers [LAUGHTER], including Karolina, came back to us and said, you know, this is really stressful. You know, I’m constantly worrying, “Is it in frame? Is it in frame? Is it in frame?” And actually, the cognitive load of that, even though we were trying to make the process really, really easy, um, was, was really overwhelming. And one of them said to us, well, why don’t I just assume that I’m doing a good job unless you tell me otherwise? [LAUGHTER] And that really helped shift our mindset to say, well, OK, we can help the user by giving them a gentle nudge back on track, but we don’t need to grab all their cognitive attention to make the perfect video!

HUIZINGA: [LAUGHS] That’s, that’s so hilarious. Well, Cecily, I want to stay with you for a minute and discuss the broader benefits of what you call “designing outside the mean.” And despite the challenges of developing technologies, we’ve seen specialized research deliver the so-called curb-cut effect over and over. Now you’ve already alluded to this a bit earlier. But clearly people with blindness and low vision aren’t the only ones who can’t find their things. So might this research help other people? Could it, could it be something I could incorporate into my phone?

MORRISON: That’s a great question. And I think an important question when we do any research is how do we broaden this out to meet the, the widest need possible? So I’m going to think about rather than Find My Things specifically, I’m going to think about Teachable AI. And Teachable AI should benefit everybody who needs something specific to themselves. And who of us don’t think that we need things to be specific to ourselves in this day and age?

HUIZINGA: Right … [LAUGHS]

MORRISON: But it’s going to be particularly useful for people on the margins of technology design for many reasons. So it doesn’t matter—it could be where your home is different or the way you go about your daily lives or perhaps the intersection of your identities. By having Teachable AI, we make systems that are going to work for individuals. Regardless of the labels that you might have or the life experience you might have, we want an AI system that works for you. And this is an approach that’s moving us in that direction.

HUIZINGA: You know, I love … I, I remembered what you said earlier, and it was for individuals, not people with disabilities. And I just love that framing anyway because we’re all individuals, and everyone has some kind of a disability, whether you call it that or not. So I just love this work so much. Karolina, back to you for a minute. You have said you’re a very tactile person. What role does haptics, which is the touch/feel part of computer science, play for you in this research, and how do physical cues work for you in this technology?

PAKĖNAITĖ: Yeah, so because I’m deaf-blind, I think my brain naturally craves information through senses which I have full access to. For me, it’s touch. So I find it very stimulating when the tools are tactile, whether that’s vibrations or textures. Tactile feedback not only enhances the experiences, but I think it’s also a good accessibility cue, as well. For example, one big instance happened that as a citizen designer was when I was pointing my camera at an object and, being hard of hearing, that means I couldn’t hear what it was saying, so I had to bring it close to my, my ear, and that meant that the object was lost in the camera view. [LAUGHS]

HUIZINGA: Right … [LAUGHS]

PAKĖNAITĖ: So … yeah, yeah, I think having tactile cues could be very beneficial for people like me who are deaf-blind but also others. Like, for example, you don’t always want your phone to be on sound all the time. Maybe in a quiet train, in a quiet tube, you don’t want your phone to start talking; you might be feeling self-conscious. So, yeah, I think …

HUIZINGA: Right …

PAKĖNAITĖ: … always adding those tactile cues will benefit me and everyone else.

HUIZINGA: Yeah, so to clarify, is haptics or touch involved in any of this particular Teachable AI technology, Cecily? I know that Karolina has that as a, you know, a “want to have” kind of thing. Where does it stand here?

MORRISON: Yeah, no, I, I think Karolina’s participation, um, was actually fairly critical in us adding, um, vibration cues to the experience.

HUIZINGA: Yeah, so it does use the, the haptic …

MORRISON: Yeah, we use auditory, visual, and, and vibration as a means of interaction. And I think in general, we should be designing all of our experiences with technology to be multisensory because, as Karolina pointed out, in certain circumstances, you don’t really want your computer talking at you. In other circumstances, you need something else. And in our different individual needs, we might need something else. So this allows people to be as flexible as possible for their context and for their own needs to make an experience work for them.

HUIZINGA: Right. Yeah, and I feel like this is already kind of part of our lives when our phones buzz or, or, you know, vibrate or when you wear the watch that gives you a little tip on your wrist that you’ve got a notification or you need to turn left or [LAUGHTER] whatever you’re using it for. Cecily, I always like to know where a project is on the spectrum from lab to life, as we say on this show. What’s the status of Teachable AI in general and Find My Things in particular, and how close is it to being able to be used in real life by a broader audience than your citizen designers and your team?

MORRISON: So it’s really important for us that the technologies we research become available to the communities to whom they are valuable. And in the past, we’ve had a whole set of partners, including Seeing AI, American Printing House for the Blind, to help us take ideas, research prototypes, and make them into products that people can have. Now Teachable AI is a grand vision. I think we are … showed with this work in Find My Things that the machine learning is there. We can do this, and it’s coming. And as we move into this new era of machine learning with these very large models, we’re going to need it there, too, because the larger the model, the more personalized we’re probably going to need the experience. In terms of Find My Things, we are also on that journey to finding the right opportunity to bring it out to the blind community.

HUIZINGA: So this has been fascinating. I’m … there’s so many more questions I want to ask, but we don’t have a lot of time to ask them all. I’m sure that we’re going to be watching as this unfolds and probably becomes part of all of our lives at some point thanks to the wonderful people doing the research. I like to end the podcast with a little future casting from each of my guests, and, Karolina, I’d like you to go first. I have a very specific question for you. Aside from your studies and your research work, you’ve said you’re on a mission. What’s that mission, and what does Mount Everest have to do with it?

PAKĖNAITĖ: So firstly, I’m hoping to complete my PhD this year. That’s my big priority for, for this year. And then, uh, I will be on a mission, an ambitious one that I feel a little bit nervous to share but also very excited. As an adventurer at heart, my dream is to summit Mount Everest. So before it’s always seemed like a fantasy, but I recently came back from an Everest base camp trek just a few months ago, and I met some mountaineers who were on their way to the top, and I found myself quietly saying, what if? And then, as I was thinking how I’m slowly losing my sight, I realized that if I do want to summit Everest, I would want to go there while I still can see with my remaining vision, so I realized that it would have to be now or never.

HUIZINGA: Right!

PAKĖNAITĖ: So when I came back, I decided … I just made some actions. So I reached out to different organizations and surprisingly a film production team is eager to document this journey and … yeah, it seems like something might be happening. So this mission isn’t just about me potentially becoming the first deaf-blind person to summit Everest but also a commitment to raising awareness and providing representation for the blind and deaf-blind community. I hope to stay in the research field, and I believe this mission has some potential for research. So I think that, for example, I’m looking for accessibility tools for, for me to climb Everest so that I can be the best climber I can be as a deaf-blind person, being independent but part of the team, or maybe make a documentary film a multisensory experience, accessible to a wider community, including deaf-blind. So, yeah, I’m actively looking for collaborators and would love to be contacted by anyone.

HUIZINGA: I love the fact that you are bringing awareness to the fact, first of all, that the deaf-blind community or even the blind community isn’t a one-size-fits-all. So, um, yeah, I hope you get to summit Everest to be able to see the world from the tallest peak in the world before anything progresses that way. Well, Cecily, I’d like to close with you. Go with me on a little forward-thinking, backward-thinking journey. You’re at the end of your career looking back. What have you accomplished as a researcher, and how has your work disrupted the field of accessible technology and made the world a better place?

MORRISON: Where would I like to be? I would say more like where would we like to be. So in collaboration with colleagues, I hope we have brought a sense of individual’s agency in their experience with AI systems, which allow people to shape them for their own unique experience, whoever they might be and wherever they might be in the world. And I think this idea is no less important, or perhaps it’s even more important, as we move into a world of large foundation models that underpin many or perhaps all of our experiences as we, as we go forward. And I think particularly large foundation models will bring really significant change to accessibility, and I hope the approach of teachability will be a significantly positive influence in making those experiences just what we need them to be. And I have to say, in my life role, I’m personally really very hopeful for my own blind child’s opportunities in the world of work in 10 years’ time. At the moment, only 25 percent of people who are blind or low vision work. I think technology can play a huge role in getting rid of this mismatch between the environment and a person and allowing many more people with disabilities to enjoy being in the workplace.

HUIZINGA: This is exciting research and really a wonderful collaboration. I’m so grateful, Cecily Morrison and Karolina Pakėnaitė, for coming on the show and talking about it with us today. Thank you so much.

MORRISON: Thank you, Gretchen, and thank you, Karolina.

PAKĖNAITĖ: Thank you.

The post Collaborators: Teachable AI with Cecily Morrison and Karolina Pakėnaitė appeared first on Microsoft Research.

Read More

PwR: Using representations for AI-powered software development

PwR: Using representations for AI-powered software development

This research is being presented at the Agami Summit 2023 (opens in new tab), an annual forum in Maharashtra, India, for innovation in the field of law and justice.

Flowchart showing natural language is transformed into a program in domain specific language using an LLM. This step is called Intent formalization. The user is able to modify, repair and query. The Program in DSL is then converted into natural language representation that can be in text or visual formats. The Program in DSL is also separatedly converted into Code via the Code Generation pipeline. This step is called Robust Code Generation.

In one scenario of the future, such as the one imagined by Matt Welsh for the Association of Computing Machinery (ACM) (opens in new tab), AI will take the lead in coding while humans oversee the process. This shift will require people to take a supervisory role, focusing on high-level tasks while leaving the code details to AI. As we envision this transformation, we face a critical question: How can we reimagine software development to not just improve developer productivity but also to ensure software safety, reliability, and maintainability while keeping it personalized to developer preferences?

Realizing this outcome relies on AI and developers establishing a common understanding. While natural language can facilitate AI-developer interaction, it also introduces the potential for misinterpreting tasks. Existing solutions address this gap, prompting AI to communicate its understanding in a structured natural-language document. This document can then be inspected, edited, and approved by the developer. While effective, the developer still needs to vet the resulting AI-generated code for safety and reliability, requiring both domain and coding expertise. Our goal is to decouple this requirement, paving the way for numerous organizations and individuals, including those without coding expertise, to develop software. 

PwR approach

Programming with Representations (PwR, pronounced “power”), which we are presenting at the Agami Summit 2023 (opens in new tab), is a software development approach that relies on a domain-specific language (DSL), or representation, defined by a developer specializing in a specific domain. This representation includes built-in guardrails that are automatically implemented throughout the software development process. Once a representation is defined for a domain, PwR enables any developer interested in that domain to translate their intentions using natural language into a program in that representation. This process is illustrated in Figure 1. 

Flowchart showing natural language is transformed into a program in domain specific language using an LLM. This step is called Intent formalization. The user is able to modify, repair and query. The Program in DSL is then converted into natural language representation that can be in text or visual formats. The Program in DSL is also separatedly converted into Code via the Code Generation pipeline. This step is called Robust Code Generation.
Figure 1. The PwR approach converts an ambiguous conversation in natural language into a program in a custom DSL. The DSL program is then transformed into executable code. Not only can the developer provide instructions and requirements, they can also inquire into the current state of the program, receive feedback, and update their instructions accordingly.

PwR uses large language models (LLMs) to interpret user conversations and transform them into DSL programs. This process involves traversing a code-generation pipeline to ultimately derive executable code. However, despite advancements in LLMs’ code generation, these models still grapple with limitations like hallucinations and limited context windows. Using DSL reduces the amount of code that LLMs need to generate, as most code can be generated from the DSL, increasing accuracy throughout the process.

The DSL incorporates guardrails, ensuring that the essential components are there, such as the starting state of a workflow, clearly defined transitions, and error handling protocols. These guardrails can be automatically examined and communicated back to the developer in natural language, allowing for necessary corrections. While certain guardrails may enhance safety preferences, developers still must confirm that the intended functionality was implemented. PwR simply acts as a facilitator within the cycle involving the developer, the LLM, and the DSL checker. 

PwR does not require developers to learn a custom DSL. Instead, it generates a natural-language representation (NLR) of the DSL. Developers can inspect this NLR, essentially programming in a natural language representation while the underlying DSL remains concealed. This approach grants developers the flexibility and ease of interacting with a natural-language representation while preserving the precision of their intent within the DSL. Additionally, developers can access a live test environment where their code can be hosted and tested for functionality. These capabilities are integrated into the PwR Studio tool, making it easy to get started.

PwR lowers the programming barrier, empowering nontechnical domain experts like teachers to create software tailored to their specific needs. Additionally, it can improve productivity for complex, multidisciplinary software engineering teams, enabling them to efficiently handle large volumes of changes.

Creating a welfare scheme application with PwR

Let’s take an example of how PwR can be applied. In a scenario where a nongovernment organization (NGO) aims to develop an application facilitating citizen access to government welfare schemes—enabling search, identification, and application processes involving authentication and deposits—the orchestration of multiple components is crucial. It is vital to accurately set up these components before deploying them at scale, given the program’s involvement with user data and monetary transactions. 

Reliable orchestration of these types of components can significantly enhance all types applications. We initiate this process by building a custom DSL, encoding interconnected workflows comprising various tasks. Each task represents a singular action that might involve calling an external API or another workflow. This DSL seamlessly interacts with external APIs through plugins available through the PwR Studio store. 

The following video demonstrates how PwR Studio, configured with the DSL workflow, constructs the NGO application. A developer augments the initial version of the application by incorporating the payment feature. This is accomplished by conversing with PwR Studio, understanding the specific requirements, and implementing necessary modifications. Additionally, the developer gains access to a test environment where they can launch and interact with the application in a controlled setting. 

Video: Step-by-step workflow of a developer building a bot in PwR Studio.

Looking forward

We intend to provide PwR Studio as an open-source integrated development environment (IDE) for creating software through conversations. Our initial aim is to facilitate workflow-based applications for NGOs and social enterprises that have little access to technical expertise. However, our ambitions stretch far beyond this scope.

With the recent success of GitHub Copilot for conversational code generation and recent announcements surrounding OpenAI’s GPTs framework for programming ChatGPT-like bots, it’s evident that AI is poised to democratize software development, granting everyone the ability to create software. With that, it’s imperative to prioritize safety and reliability. PwR is an approach that incorporates these priorities, where the insights of a few technical experts guide a large community of developers through the power of representation. We encourage the software development community to experiment with PwR, and similar ideas, to build safe and reliable AI-powered software.

Learn more on the PwR project page.

Acknowledgements

PwR is the result of a joined collaboration with several of our colleagues, including Sriram Rajamani, B. Ashok, Mohit Jain, Vageesh D C, Dinesh KA, and Sanoop Menon. We would also like to thank Vyshak Jain, Drishti Goel, Hamna, and Sanoop Menon for their help in creating the video.

The post PwR: Using representations for AI-powered software development appeared first on Microsoft Research.

Read More

The Power of Prompting

The Power of Prompting

Illustrated icons of a medical bag, hexagon with circles at its points, and a chat bubble on a blue and purple gradient background.

Today, we published an exploration of the power of prompting strategies that demonstrates how the generalist GPT-4 model can perform as a specialist on medical challenge problem benchmarks. The study shows GPT-4’s ability to outperform a leading model that was fine-tuned specifically for medical applications, on the same benchmarks and by a significant margin. These results are among other recent studies that show how prompting strategies alone can be effective in evoking this kind of domain-specific expertise from generalist foundation models.  

A visual illustration of Medprompt performance on the MedQA benchmark. Moving from left to right on a horizontal line, the illustration shows how different Medprompt components and additive contributions improve accuracy starting with zero-shot at 81.7 accuracy, to random few-shot at 83.9 accuracy, to random few-shot, chain-of-thought at 87.3 accuracy, to kNN, few-shot, chain-of-thought at 88.4 accuracy, to ensemble with choice shuffle at 90.2 accuracy.
Figure 1: Visual illustration of Medprompt components and additive contributions to performance on the MedQA benchmark. Prompting strategy combines kNN-based few-shot example selection, GPT-4–generated chain-of-thought prompting, and answer-choice shuffled ensembling.

During early evaluations of the capabilities of GPT-4, we were excited to see glimmers of general problem-solving skills, with surprising polymathic capabilities of abstraction, generalization, and composition—including the ability to weave together concepts across disciplines. Beyond these general reasoning powers, we discovered that GPT-4 could be steered via prompting to serve as a domain-specific specialist in numerous areas. Previously, eliciting these capabilities required fine-tuning the language models with specially curated data to achieve top performance in specific domains. This poses the question of whether more extensive training of generalist foundation models might reduce the need for fine-tuning.

In a study shared in March, we demonstrated how very simple prompting strategies revealed GPT-4’s strengths in medical knowledge without special fine-tuning. The results showed how the “out-of-the-box” model could ace a battery of medical challenge problems with basic prompts. In our more recent study, we show how the composition of several prompting strategies into a method that we refer to as “Medprompt” can efficiently steer GPT-4 to achieve top performance. In particular, we find that GPT-4 with Medprompt: 

  • Surpasses 90% on MedQA dataset for the first time
  • Achieves top reported results on all nine benchmark datasets in the MultiMedQA suite
  • Reduces error rate on MedQA by 27% over that reported by MedPaLM 2 

Many AI practitioners assume that specialty-centric fine-tuning is required to extend generalist foundation models to perform well on specific domains. While fine-tuning can boost performance, the process can be expensive. Fine-tuning often requires experts or professionally labeled datasets (e.g., via top clinicians in the MedPaLM project) and then computing model parameter updates. The process can be resource-intensive and cost-prohibitive, making the approach a difficult challenge for many small and medium-sized organizations. The Medprompt study shows the value of more deeply exploring prompting possibilities for transforming generalist models into specialists and extending the benefits of these models to new domains and applications. In an intriguing finding, the prompting methods we present appear to be valuable, without any domain-specific updates to the prompting strategy, across professional competency exams in a diversity of domains, including electrical engineering, machine learning, philosophy, accounting, law, and psychology. 

At Microsoft, we’ve been working on the best ways to harness the latest advances in large language models across our products and services while keeping a careful focus on understanding and addressing potential issues with the reliability, safety, and usability of applications. It’s been inspirational to see all the creativity, and the careful integration and testing of prototypes, as we continue the journey to share new AI developments with our partners and customers.

A chart shows GPT-4 performance using three different prompting strategies on out of domain datasets. GPT-4 out performs zero-shot and five-shot approaches across MMLU Machine Learning, MMLU Professional Psychology, MMLU Electrical Engineering, MMLU Philosophy, MMLU Professional Law, MMLU Accounting, NCLEX RegisteredNursing.com, and NCLEX Nurselabs.
Figure 3: GPT-4 performance with three different prompting strategies on out-of-domain datasets. Zero-shot and five-shot approaches represent baselines.

The post The Power of Prompting appeared first on Microsoft Research.

Read More

GPT-4’s potential in shaping the future of radiology

GPT-4’s potential in shaping the future of radiology

This research paper is being presented at the 2023 Conference on Empirical Methods in Natural Language Processing (opens in new tab) (EMNLP 2023), the premier conference on natural language processing and artificial intelligence.

EMNLP 2023 blog hero - female radiologist analyzing an MRI image of the head

In recent years, AI has been increasingly integrated into healthcare, bringing about new areas of focus and priority, such as diagnostics, treatment planning, patient engagement. While AI’s contribution in certain fields like image analysis and drug interaction is widely recognized, its potential in natural language tasks with these newer areas presents an intriguing research opportunity. 

One notable advancement in this area involves GPT-4’s impressive performance (opens in new tab) on medical competency exams and benchmark datasets. GPT-4 has also demonstrated potential utility (opens in new tab) in medical consultations, providing a promising outlook for healthcare innovation.

Progressing radiology AI for real problems

Our paper, “Exploring the Boundaries of GPT-4 in Radiology (opens in new tab),” which we are presenting at EMNLP 2023 (opens in new tab), further explores GPT-4’s potential in healthcare, focusing on its abilities and limitations in radiology—a field that is crucial in disease diagnosis and treatment through imaging technologies like x-rays, computed tomography (CT) and magnetic resonance imaging (MRI). We collaborated with our colleagues at Nuance (opens in new tab), a Microsoft company, whose solution, PowerScribe, is used by more than 80 percent of US radiologists. Together, we aimed to better understand technology’s impact on radiologists’ workflow.

Our research included a comprehensive evaluation and error analysis framework to rigorously assess GPT-4’s ability to process radiology reports, including common language understanding and generation tasks in radiology, such as disease classification and findings summarization. This framework was developed in collaboration with a board-certified radiologist to tackle more intricate and challenging real-world scenarios in radiology and move beyond mere metric scores.

We also explored various effective zero-, few-shot, and chain-of-thought (CoT) prompting techniques for GPT-4 across different radiology tasks and experimented with approaches to improve the reliability of GPT-4 outputs. For each task, GPT-4 performance was benchmarked against prior GPT-3.5 models and respective state-of-the-art radiology models. 

We found that GPT-4 demonstrates new state-of-the-art performance in some tasks, achieving about a 10-percent absolute improvement over existing models, as shown in Table 1. Surprisingly, we found radiology report summaries generated by GPT-4 to be comparable and, in some cases, even preferred over those written by experienced radiologists, with one example illustrated in Table 2.

Table 1: Table showing GPT-4 either outperforms or is on par with previous state-of-the-art multimodal LLMs.
Table 1: Results overview. GPT-4 either outperforms or is on par with previous state-of-the-art (SOTA) multimodal LLMs.
Table 2. Table showing examples where GPT-4 impressions, or findings summaries, are favored over existing manually written impressions on the Open-i dataset. In both examples, GPT-4 outputs are more faithful and provide more complete details on the findings.
Table 2. Examples where GPT-4 findings summaries are favored over existing manually written ones on the Open-i dataset. In both examples, GPT-4 outputs are more faithful and provide more complete details on the findings.

Another encouraging prospect for GPT-4 is its ability to automatically structure radiology reports, as schematically illustrated in Figure 1. These reports, based on a radiologist’s interpretation of medical images like x-rays and include patients’ clinical history, are often complex and unstructured, making them difficult to interpret. Research shows that structuring these reports can improve standardization and consistency in disease descriptions, making them easier to interpret by other healthcare providers and more easily searchable for research and quality improvement initiatives. Additionally, using GPT-4 to structure and standardize radiology reports can further support efforts to augment real-world data (RWD) and its use for real-world evidence (RWE). This can complement more robust and comprehensive clinical trials and, in turn, accelerate the application of research findings into clinical practice.

MAIRA - Figure 1. Radiology report findings are input into GPT-4, which structures the findings into a knowledge graph and performs tasks such as disease classification, disease progression classification, or impression generation.
Figure 1. Radiology report findings are input into GPT-4, which structures the findings into a knowledge graph and performs tasks such as disease classification, disease progression classification, or impression generation.

Beyond radiology, GPT-4’s potential extends to translating medical reports into more empathetic (opens in new tab) and understandable formats for patients and other health professionals. This innovation could revolutionize patient engagement and education, making it easier for them and their carers to actively participate in their healthcare.

Microsoft Research Podcast

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Gov4git is a governance tool for decentralized, open-source cooperation, and is helping to lay the foundation for a future in which everyone can collaborate more efficiently, transparently, and easily and in ways that meet the unique desires and needs of their respective communities.


A promising path toward advancing radiology and beyond

When used with human oversight, GPT-4 also has the potential to transform radiology by assisting professionals in their day-to-day tasks. As we continue to explore this cutting-edge technology, there is great promise in improving our evaluation results of GPT-4 by investigating how it can be verified more thoroughly and finding ways to improve its accuracy and reliability. 

Our research highlights GPT-4’s potential in advancing radiology and other medical specialties, and while our results are encouraging, they require further validation through extensive research and clinical trials. Nonetheless, the emergence of GPT-4 heralds an exciting future for radiology. It will take the entire medical community working alongside other stakeholders in technology and policy to determine the appropriate use of these tools and responsibly realize the opportunity to transform healthcare. We eagerly anticipate its transformative impact towards improving patient care and safety.

Learn more about this work by visiting the Project MAIRA (opens in new tab) (Multimodal AI for Radiology Applications) page.

Acknowledgements 

We’d like to thank our coauthors: Qianchu Liu, Stephanie Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Maria Teodora Wetscherek, Robert Tinn, Harshita Sharma, Fernando Perez-Garcia, Anton Schwaighofer, Pranav Rajpurkar, Sameer Tajdin Khanna, Hoifung Poon, Naoto Usuyama, Anja Thieme, Aditya V. Nori, Ozan Oktay 

The post GPT-4’s potential in shaping the future of radiology appeared first on Microsoft Research.

Read More

Research Focus: Week of November 22, 2023

Research Focus: Week of November 22, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: November 22, 2023 on a gradient patterned background

NEW RESEARCH

PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation

Dynamic sparsity is a technique used in machine learning to reduce computational and memory requirements while maintaining or improving performance. This can be particularly useful when computational resources are limited, such as on embedded devices or mobile platforms. However, efficiently supporting dynamic sparse computation is challenging, since the concrete sparsity of tensors is known only at runtime. As a result, state-of-the-art sparsity-aware deep learning solutions are restricted to pre-defined, static sparsity patterns due to significant overheads associated with preprocessing.

In a new paper: PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation, researchers from Microsoft propose a deep-learning compiler for dynamic sparsity. Permutation Invariant Transformation (PIT) uses a novel tiling mechanism to transform multiple sparsely located micro-tiles into a GPU-efficient dense tile without changing the computation results, thus achieving both high GPU utilization and low coverage waste. Given a model, PIT first finds feasible PIT rules for all its operators and generates efficient GPU kernels accordingly. At runtime, with the novel SRead and SWrite primitives, PIT rules can be executed rapidly to support dynamic sparsity in an online manner. Extensive evaluation on diverse models shows that PIT can accelerate dynamic sparsity computation by up to 5.9x (average 2.43x) over state-of-the-art compilers.

Microsoft Research Podcast

AI Frontiers: Models and Systems with Ece Kamar

Ece Kamar explores short-term mitigation techniques to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 


NEW RESEARCH

TongueTap: Multimodal Tongue Gesture Recognition with Head-Worn Devices

Mouth-based interfaces are a promising new approach enabling silent, hands-free and eyes-free interaction with wearable devices. However, interfaces sensing mouth movements are traditionally custom-designed and placed near or within the mouth.

TongueTap synchronizes multimodal electroencephalogram (EEG), photoplethysmogram (PPG), inertial measurement unit (IMU), eye tracking and head tracking data from two commercial headsets to facilitate tongue gesture recognition using only off-the-shelf devices on the upper face. In a new paper: TongueTap: Multimodal Tongue Gesture Recognition with Head-Worn Devices, researchers from Microsoft classify eight closed-mouth tongue gestures with 94% accuracy, offering an invisible and inaudible method for discreet control of head-worn devices. Moreover, the research showed that the IMU alone differentiates eight gestures with 80% accuracy and a subset of four gestures with 92% accuracy. The researchers built a dataset of 48,000 gesture trials across 16 participants, allowing TongueTap to perform user-independent classification. The findings suggest tongue gestures can be a viable interaction technique for VR/AR headsets and wearables without requiring novel hardware.


NEW RESEARCH

Ranking LLM-Generated Loop Invariants for Program Verification

Synthesizing inductive loop invariants is fundamental to automating program verification. In a new paper: Ranking LLM-Generated Loop Invariants for Program Verification, researchers from Microsoft demonstrate that large language models (LLMs), such as GPT-3.5 or GPT-4, are capable of synthesizing loop invariants for a class of programs in a zero-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier or provide multiple incorrect suggestions to an interactive verification user in establishing an invariant.

To address this issue, the researchers propose a re-ranking approach for the generated results of LLMs, including a newly designed ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier.


NEW RESEARCH

Assessing the limits of zero-shot foundation models in single-cell biology

The success of foundation models such as GPT has sparked growing interest in their application to single-cell biology. Models like Geneformer (opens in new tab) and scGPT (opens in new tab) have emerged with the promise of serving as versatile tools for this specialized field. However, the efficacy of these models, particularly in zero-shot settings where models are not fine-tuned but used without any further training, remains an open question, especially as practical constraints require useful models to function in settings that preclude fine-tuning. For example, many biological problems are inherently exploratory, and intended to discover hypotheses for further experimentation. In such settings, labels that can serve as targets for downstream fine-tuning may not be known or may be biased. In other computational biology domains (including microscopy images and protein sequences), zero-shot evaluation is routine practice for this reason. However, this is not yet an established standard for single-cell foundation model work, where evaluation practices are still emerging.

In a new paper: Assessing the limits of zero-shot foundation models in single-cell biology, researchers from Microsoft present a rigorous evaluation of the zero-shot performance of these proposed single-cell foundation models. They assess their utility in tasks such as cell type clustering and batch effect correction, and evaluate the generality of their pretraining objectives. Research results indicate that both Geneformer and scGPT exhibit limited reliability in zero-shot settings and often underperform compared to simpler methods. These findings serve as a cautionary note for the deployment of proposed single-cell foundation models and highlight the need for more focused research to realize their potential.


NEW RESEARCH

Confidential Consortium Framework: Secure Multiparty Applications with Confidentiality, Integrity, and High Availability

Confidentiality, integrity protection, and high availability – abbreviated to CIA – are essential properties for trustworthy data systems. However, the rise of cloud computing and the growing demand for multiparty applications make building modern CIA systems more challenging than ever.

In response, researchers from Microsoft present: Confidential Consortium Framework: Secure Multiparty Applications with Confidentiality, Integrity, and High Availability (opens in new tab), a general-purpose foundation for developing secure stateful CIA applications. Confidential Consortium Framework (CCF) combines centralized compute with decentralized trust, supporting deployment on untrusted cloud infrastructure and transparent governance by mutually untrusted parties. CCF leverages hardware-based trusted execution environments for remotely verifiable confidentiality and code integrity. This is coupled with state machine replication backed by an auditable immutable ledger for data integrity and high availability. CCF enables each service to bring its own application logic, custom multiparty governance model, and deployment scenario, decoupling the operators of nodes from the consortium that governs them. CCF is open-source and available now at https://github.com/microsoft/CCF (opens in new tab).

The post Research Focus: Week of November 22, 2023 appeared first on Microsoft Research.

Read More