The Power of Prompting

The Power of Prompting

Illustrated icons of a medical bag, hexagon with circles at its points, and a chat bubble on a blue and purple gradient background.

Today, we published an exploration of the power of prompting strategies that demonstrates how the generalist GPT-4 model can perform as a specialist on medical challenge problem benchmarks. The study shows GPT-4’s ability to outperform a leading model that was fine-tuned specifically for medical applications, on the same benchmarks and by a significant margin. These results are among other recent studies that show how prompting strategies alone can be effective in evoking this kind of domain-specific expertise from generalist foundation models.  

A visual illustration of Medprompt performance on the MedQA benchmark. Moving from left to right on a horizontal line, the illustration shows how different Medprompt components and additive contributions improve accuracy starting with zero-shot at 81.7 accuracy, to random few-shot at 83.9 accuracy, to random few-shot, chain-of-thought at 87.3 accuracy, to kNN, few-shot, chain-of-thought at 88.4 accuracy, to ensemble with choice shuffle at 90.2 accuracy.
Figure 1: Visual illustration of Medprompt components and additive contributions to performance on the MedQA benchmark. Prompting strategy combines kNN-based few-shot example selection, GPT-4–generated chain-of-thought prompting, and answer-choice shuffled ensembling.

During early evaluations of the capabilities of GPT-4, we were excited to see glimmers of general problem-solving skills, with surprising polymathic capabilities of abstraction, generalization, and composition—including the ability to weave together concepts across disciplines. Beyond these general reasoning powers, we discovered that GPT-4 could be steered via prompting to serve as a domain-specific specialist in numerous areas. Previously, eliciting these capabilities required fine-tuning the language models with specially curated data to achieve top performance in specific domains. This poses the question of whether more extensive training of generalist foundation models might reduce the need for fine-tuning.

In a study shared in March, we demonstrated how very simple prompting strategies revealed GPT-4’s strengths in medical knowledge without special fine-tuning. The results showed how the “out-of-the-box” model could ace a battery of medical challenge problems with basic prompts. In our more recent study, we show how the composition of several prompting strategies into a method that we refer to as “Medprompt” can efficiently steer GPT-4 to achieve top performance. In particular, we find that GPT-4 with Medprompt: 

  • Surpasses 90% on MedQA dataset for the first time
  • Achieves top reported results on all nine benchmark datasets in the MultiMedQA suite
  • Reduces error rate on MedQA by 27% over that reported by MedPaLM 2 

Many AI practitioners assume that specialty-centric fine-tuning is required to extend generalist foundation models to perform well on specific domains. While fine-tuning can boost performance, the process can be expensive. Fine-tuning often requires experts or professionally labeled datasets (e.g., via top clinicians in the MedPaLM project) and then computing model parameter updates. The process can be resource-intensive and cost-prohibitive, making the approach a difficult challenge for many small and medium-sized organizations. The Medprompt study shows the value of more deeply exploring prompting possibilities for transforming generalist models into specialists and extending the benefits of these models to new domains and applications. In an intriguing finding, the prompting methods we present appear to be valuable, without any domain-specific updates to the prompting strategy, across professional competency exams in a diversity of domains, including electrical engineering, machine learning, philosophy, accounting, law, and psychology. 

At Microsoft, we’ve been working on the best ways to harness the latest advances in large language models across our products and services while keeping a careful focus on understanding and addressing potential issues with the reliability, safety, and usability of applications. It’s been inspirational to see all the creativity, and the careful integration and testing of prototypes, as we continue the journey to share new AI developments with our partners and customers.

A chart shows GPT-4 performance using three different prompting strategies on out of domain datasets. GPT-4 out performs zero-shot and five-shot approaches across MMLU Machine Learning, MMLU Professional Psychology, MMLU Electrical Engineering, MMLU Philosophy, MMLU Professional Law, MMLU Accounting, NCLEX, and NCLEX Nurselabs.
Figure 3: GPT-4 performance with three different prompting strategies on out-of-domain datasets. Zero-shot and five-shot approaches represent baselines.

The post The Power of Prompting appeared first on Microsoft Research.

Read More

GPT-4’s potential in shaping the future of radiology

GPT-4’s potential in shaping the future of radiology

This research paper is being presented at the 2023 Conference on Empirical Methods in Natural Language Processing (opens in new tab) (EMNLP 2023), the premier conference on natural language processing and artificial intelligence.

EMNLP 2023 blog hero - female radiologist analyzing an MRI image of the head

In recent years, AI has been increasingly integrated into healthcare, bringing about new areas of focus and priority, such as diagnostics, treatment planning, patient engagement. While AI’s contribution in certain fields like image analysis and drug interaction is widely recognized, its potential in natural language tasks with these newer areas presents an intriguing research opportunity. 

One notable advancement in this area involves GPT-4’s impressive performance (opens in new tab) on medical competency exams and benchmark datasets. GPT-4 has also demonstrated potential utility (opens in new tab) in medical consultations, providing a promising outlook for healthcare innovation.

Progressing radiology AI for real problems

Our paper, “Exploring the Boundaries of GPT-4 in Radiology (opens in new tab),” which we are presenting at EMNLP 2023 (opens in new tab), further explores GPT-4’s potential in healthcare, focusing on its abilities and limitations in radiology—a field that is crucial in disease diagnosis and treatment through imaging technologies like x-rays, computed tomography (CT) and magnetic resonance imaging (MRI). We collaborated with our colleagues at Nuance (opens in new tab), a Microsoft company, whose solution, PowerScribe, is used by more than 80 percent of US radiologists. Together, we aimed to better understand technology’s impact on radiologists’ workflow.

Our research included a comprehensive evaluation and error analysis framework to rigorously assess GPT-4’s ability to process radiology reports, including common language understanding and generation tasks in radiology, such as disease classification and findings summarization. This framework was developed in collaboration with a board-certified radiologist to tackle more intricate and challenging real-world scenarios in radiology and move beyond mere metric scores.

We also explored various effective zero-, few-shot, and chain-of-thought (CoT) prompting techniques for GPT-4 across different radiology tasks and experimented with approaches to improve the reliability of GPT-4 outputs. For each task, GPT-4 performance was benchmarked against prior GPT-3.5 models and respective state-of-the-art radiology models. 

We found that GPT-4 demonstrates new state-of-the-art performance in some tasks, achieving about a 10-percent absolute improvement over existing models, as shown in Table 1. Surprisingly, we found radiology report summaries generated by GPT-4 to be comparable and, in some cases, even preferred over those written by experienced radiologists, with one example illustrated in Table 2.

Table 1: Table showing GPT-4 either outperforms or is on par with previous state-of-the-art multimodal LLMs.
Table 1: Results overview. GPT-4 either outperforms or is on par with previous state-of-the-art (SOTA) multimodal LLMs.
Table 2. Table showing examples where GPT-4 impressions, or findings summaries, are favored over existing manually written impressions on the Open-i dataset. In both examples, GPT-4 outputs are more faithful and provide more complete details on the findings.
Table 2. Examples where GPT-4 findings summaries are favored over existing manually written ones on the Open-i dataset. In both examples, GPT-4 outputs are more faithful and provide more complete details on the findings.

Another encouraging prospect for GPT-4 is its ability to automatically structure radiology reports, as schematically illustrated in Figure 1. These reports, based on a radiologist’s interpretation of medical images like x-rays and include patients’ clinical history, are often complex and unstructured, making them difficult to interpret. Research shows that structuring these reports can improve standardization and consistency in disease descriptions, making them easier to interpret by other healthcare providers and more easily searchable for research and quality improvement initiatives. Additionally, using GPT-4 to structure and standardize radiology reports can further support efforts to augment real-world data (RWD) and its use for real-world evidence (RWE). This can complement more robust and comprehensive clinical trials and, in turn, accelerate the application of research findings into clinical practice.

MAIRA - Figure 1. Radiology report findings are input into GPT-4, which structures the findings into a knowledge graph and performs tasks such as disease classification, disease progression classification, or impression generation.
Figure 1. Radiology report findings are input into GPT-4, which structures the findings into a knowledge graph and performs tasks such as disease classification, disease progression classification, or impression generation.

Beyond radiology, GPT-4’s potential extends to translating medical reports into more empathetic (opens in new tab) and understandable formats for patients and other health professionals. This innovation could revolutionize patient engagement and education, making it easier for them and their carers to actively participate in their healthcare.

Microsoft Research Podcast

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Gov4git is a governance tool for decentralized, open-source cooperation, and is helping to lay the foundation for a future in which everyone can collaborate more efficiently, transparently, and easily and in ways that meet the unique desires and needs of their respective communities.

A promising path toward advancing radiology and beyond

When used with human oversight, GPT-4 also has the potential to transform radiology by assisting professionals in their day-to-day tasks. As we continue to explore this cutting-edge technology, there is great promise in improving our evaluation results of GPT-4 by investigating how it can be verified more thoroughly and finding ways to improve its accuracy and reliability. 

Our research highlights GPT-4’s potential in advancing radiology and other medical specialties, and while our results are encouraging, they require further validation through extensive research and clinical trials. Nonetheless, the emergence of GPT-4 heralds an exciting future for radiology. It will take the entire medical community working alongside other stakeholders in technology and policy to determine the appropriate use of these tools and responsibly realize the opportunity to transform healthcare. We eagerly anticipate its transformative impact towards improving patient care and safety.

Learn more about this work by visiting the Project MAIRA (opens in new tab) (Multimodal AI for Radiology Applications) page.


We’d like to thank our coauthors: Qianchu Liu, Stephanie Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Maria Teodora Wetscherek, Robert Tinn, Harshita Sharma, Fernando Perez-Garcia, Anton Schwaighofer, Pranav Rajpurkar, Sameer Tajdin Khanna, Hoifung Poon, Naoto Usuyama, Anja Thieme, Aditya V. Nori, Ozan Oktay 

The post GPT-4’s potential in shaping the future of radiology appeared first on Microsoft Research.

Read More

Research Focus: Week of November 22, 2023

Research Focus: Week of November 22, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: November 22, 2023 on a gradient patterned background


PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation

Dynamic sparsity is a technique used in machine learning to reduce computational and memory requirements while maintaining or improving performance. This can be particularly useful when computational resources are limited, such as on embedded devices or mobile platforms. However, efficiently supporting dynamic sparse computation is challenging, since the concrete sparsity of tensors is known only at runtime. As a result, state-of-the-art sparsity-aware deep learning solutions are restricted to pre-defined, static sparsity patterns due to significant overheads associated with preprocessing.

In a new paper: PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation, researchers from Microsoft propose a deep-learning compiler for dynamic sparsity. Permutation Invariant Transformation (PIT) uses a novel tiling mechanism to transform multiple sparsely located micro-tiles into a GPU-efficient dense tile without changing the computation results, thus achieving both high GPU utilization and low coverage waste. Given a model, PIT first finds feasible PIT rules for all its operators and generates efficient GPU kernels accordingly. At runtime, with the novel SRead and SWrite primitives, PIT rules can be executed rapidly to support dynamic sparsity in an online manner. Extensive evaluation on diverse models shows that PIT can accelerate dynamic sparsity computation by up to 5.9x (average 2.43x) over state-of-the-art compilers.

Microsoft Research Podcast

AI Frontiers: Models and Systems with Ece Kamar

Ece Kamar explores short-term mitigation techniques to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 


TongueTap: Multimodal Tongue Gesture Recognition with Head-Worn Devices

Mouth-based interfaces are a promising new approach enabling silent, hands-free and eyes-free interaction with wearable devices. However, interfaces sensing mouth movements are traditionally custom-designed and placed near or within the mouth.

TongueTap synchronizes multimodal electroencephalogram (EEG), photoplethysmogram (PPG), inertial measurement unit (IMU), eye tracking and head tracking data from two commercial headsets to facilitate tongue gesture recognition using only off-the-shelf devices on the upper face. In a new paper: TongueTap: Multimodal Tongue Gesture Recognition with Head-Worn Devices, researchers from Microsoft classify eight closed-mouth tongue gestures with 94% accuracy, offering an invisible and inaudible method for discreet control of head-worn devices. Moreover, the research showed that the IMU alone differentiates eight gestures with 80% accuracy and a subset of four gestures with 92% accuracy. The researchers built a dataset of 48,000 gesture trials across 16 participants, allowing TongueTap to perform user-independent classification. The findings suggest tongue gestures can be a viable interaction technique for VR/AR headsets and wearables without requiring novel hardware.


Ranking LLM-Generated Loop Invariants for Program Verification

Synthesizing inductive loop invariants is fundamental to automating program verification. In a new paper: Ranking LLM-Generated Loop Invariants for Program Verification, researchers from Microsoft demonstrate that large language models (LLMs), such as GPT-3.5 or GPT-4, are capable of synthesizing loop invariants for a class of programs in a zero-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier or provide multiple incorrect suggestions to an interactive verification user in establishing an invariant.

To address this issue, the researchers propose a re-ranking approach for the generated results of LLMs, including a newly designed ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier.


Assessing the limits of zero-shot foundation models in single-cell biology

The success of foundation models such as GPT has sparked growing interest in their application to single-cell biology. Models like Geneformer (opens in new tab) and scGPT (opens in new tab) have emerged with the promise of serving as versatile tools for this specialized field. However, the efficacy of these models, particularly in zero-shot settings where models are not fine-tuned but used without any further training, remains an open question, especially as practical constraints require useful models to function in settings that preclude fine-tuning. For example, many biological problems are inherently exploratory, and intended to discover hypotheses for further experimentation. In such settings, labels that can serve as targets for downstream fine-tuning may not be known or may be biased. In other computational biology domains (including microscopy images and protein sequences), zero-shot evaluation is routine practice for this reason. However, this is not yet an established standard for single-cell foundation model work, where evaluation practices are still emerging.

In a new paper: Assessing the limits of zero-shot foundation models in single-cell biology, researchers from Microsoft present a rigorous evaluation of the zero-shot performance of these proposed single-cell foundation models. They assess their utility in tasks such as cell type clustering and batch effect correction, and evaluate the generality of their pretraining objectives. Research results indicate that both Geneformer and scGPT exhibit limited reliability in zero-shot settings and often underperform compared to simpler methods. These findings serve as a cautionary note for the deployment of proposed single-cell foundation models and highlight the need for more focused research to realize their potential.


Confidential Consortium Framework: Secure Multiparty Applications with Confidentiality, Integrity, and High Availability

Confidentiality, integrity protection, and high availability – abbreviated to CIA – are essential properties for trustworthy data systems. However, the rise of cloud computing and the growing demand for multiparty applications make building modern CIA systems more challenging than ever.

In response, researchers from Microsoft present: Confidential Consortium Framework: Secure Multiparty Applications with Confidentiality, Integrity, and High Availability (opens in new tab), a general-purpose foundation for developing secure stateful CIA applications. Confidential Consortium Framework (CCF) combines centralized compute with decentralized trust, supporting deployment on untrusted cloud infrastructure and transparent governance by mutually untrusted parties. CCF leverages hardware-based trusted execution environments for remotely verifiable confidentiality and code integrity. This is coupled with state machine replication backed by an auditable immutable ledger for data integrity and high availability. CCF enables each service to bring its own application logic, custom multiparty governance model, and deployment scenario, decoupling the operators of nodes from the consortium that governs them. CCF is open-source and available now at (opens in new tab).

The post Research Focus: Week of November 22, 2023 appeared first on Microsoft Research.

Read More

Lifelong model editing in large language models: Balancing low-cost targeted edits and catastrophic forgetting

Lifelong model editing in large language models: Balancing low-cost targeted edits and catastrophic forgetting

Illustrated figure of lifelong model editing with GRACE. On the left is a question and the model’s existing answer to it (which is incorrect). Editing method needs to update it the correct answer. In the middle the architecture is shown where the language model is frozen and embeddings are extracted to retrieve appropriate values (new embeddings) from the codebook. On the right the codebook is shown which includes a set of trainable embeddings.

Large language models (LLMs) are profoundly useful for a vast array of difficult tasks. But they sometimes make unpredictable mistakes or perpetuate biased language. These sorts of errors tend to arise over time due to changes in the underlying data or in user behavior. This necessitates targeted, cost-effective fixes to these models and the real-world applications they support.

Repeated pretraining or finetuning might be used to achieve these fixes. However, these solutions are often too computationally expensive. For example (opens in new tab), LLAMA 1 was trained for 21 days on 2,048 A100 GPUs, costing over $2.4 million. Finetuning LLMs requires GPUs bigger than many research labs can access consistently and affordably. Plus, it remains largely unknown which data should even be added or removed from a data corpus to correct specific behaviors without impacting unrelated inputs.

To keep LLMs up to date without expensive training, model editing has recently been proposed as a paradigm for making targeted updates to big models. Most model editors update a model once, injecting a batch of corrections. But mistakes are often discovered sequentially over time and must be corrected quickly. In other words, lifelong model editing where a stream of mistakes are encountered and must be addressed immediately is essential when the models are deployed. This requires making many edits sequentially, a setting in which existing editors are known to fail. Success here means correcting all edits in sequence, without forgetting old fixes and without decaying performance on unrelated inputs. But what exactly is an edit? In Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors, three types of edits are considered:

  1. Updating factual knowledge. Let’s say we have a pre-trained question-answering model: We pass questions in, and the model returns answers. But as the world changes, these answers become outdated. For example, the answer to “Who is the president of the U.S.?” should change after an election. Therefore, an edit is a tuple – or an ordered sequence of values – containing a question (e.g., “Who is the president of the U.S.?”) and the correct answer (e.g., “Biden”) for the question.
  2. Keeping up with flipping labels. Ground truth in classification tasks can change over time. For example, when U.S. courts use new language to describe existing topics, a document’s correct label can change. In such a case, a model trained on the old labels must be corrected. Targeted edits are especially important when only specific types of data are relabeled, which is common. In this case, an edit is a paired input (e.g., court document) and a new label (e.g., topic).
  3. Mitigating fabrication and incoherence in LLMs. A key challenge in using LLMs is avoiding instances where they generate language that is ungrounded in reality. But this might happen more in some models than others. Therefore, when it does happen, the ensuing edit should be as small as possible. To explore the effectiveness of this approach, the researchers consider mitigating this problem when generating biographies of famous people. Upon identifying hand-annotated fabrications, they edit an LLM to instead produce corresponding sentences from real Wikipedia articles. In this case, an edit is a prompt and a corresponding response, which the existing model finds unlikely.
This figure shows an overview of the proposed approach. On the left it shows a question (what was the latest pandemic?) and the model’s existing answer to it (Swine Flu) which is a wrong answer, editing method needs to update it the correct answer (COVID). In the middle the architecture is shown where the language model is frozen and embeddings are extracted to retrieve appropriate values (new embeddings) from the codebook. In the right the codebook is shown which includes a set of trainable embeddings.
Figure 1. Overview of lifelong model editing with GRACE. Models make important errors that must be corrected. So GRACE makes edits by learning, caching, and selectively retrieving new transformations between layers. Over long sequences of edits, which appear sporadically and require quick fixes, GRACE codebooks grow and adapt.

To make cost-effective edits to LLMs, we propose an approach referred to as General Retrieval Adaptors for Continual Editing, or GRACE. GRACE is the first method to enable thousands of sequential edits to any pre-trained model architecture using only streaming errors. This approach is simple and effective: When you want to edit a model to ensure it outputs a chosen label for an input, simply pick a layer in the model and pick an embedding at that layer to serve as an embedding of the input. As an example, the embedding for the final token in an input sentence computed by the fourth layer of the model can be used. Then, this embedding is cached and a new embedding is learned such that if the new is substituted for the old embeddings, the model produces the desired response. The original embedding is referred to as a key, and the learned embedding as a value. Learning the value is straightforward via gradient descent. The key and value are then stored in a codebook, which acts as a dictionary. If you then pass in a new input to the model, after computing its embedding, referred to as a query, new queries can be compared to existing keys. If a query matches a key, one can look up the value and apply the edit. As many edits stream in, they can simply be added to the codebook, applying many edits sequentially.

A table with four main columns labeled
Table 1. GRACE outperforms existing model editors by successfully editing models without forgetting previous edits or unrelated training data. On the zsRE and SCOTUS datasets, GRACE achieves substantial compression. On the Hallucination dataset, GRACE successfully embeds long future sequences of tokens into cached values.

But isn’t this just memorization? How can generalizable edits be achieved without memorizing every new input? Instead of always adding new keys, every new key is paired with an influence radius, which is a ball surrounding any new key with a radius of ε. Then, if any query lands inside this ε-ball, the key’s corresponding value is retrieved and the edit is applied. Thus, inputs that are similar to any cached edits will also be updated. Occasionally, when creating a new key, its ε-ball may conflict with another key. In this case, when the conflicting keys have different values, their ε-balls are set to just barely touch. If they have the same values, the existing key’s ε are increased to include the new input. Tuning ε helps achieve small codebooks that are generalizable and can successfully make thousands of edits in a row.

To compare GRACE’s capability with existing methods to make generalizable edits, two bidirectional models (T5 and BERT) and one autoregressive model (GPT2-XL) were used. For question-answering (QA), T5 was used along with a QA dataset (opens in new tab) that includes questions targeted for relation extraction. Twenty rephrased versions of each question were extracted, 10 of them were used during editing and the other 10 as unseen holdouts. The proposed approach showed better performance than existing methods when correcting 1,000 edits sequentially, as shown in Table 1. It used only 137 keys to make the edits, which shows the efficiency of the proposed method. This level of generalization is better than prior work and shows promising potential for correcting future mistakes. The proposed approach can also successfully edit a BERT model that was trained on U.S. Supreme Court documents (opens in new tab) from before 1992 and tested on documents after 1992 for which the label distribution shifted. An experiment was also conducted using GRACE with an autoregressive model, GPT2-XL, to edit mistakes related to fabrication, which were promising encouraging long sequences of edits. For example, when asked to generate a biography of Brian Hughes, GRACE successfully encouraged GPT2-XL to respond: “Brian Hughes (born 1955) is a Canadian guitarist whose work draws from both the smooth jazz and world music genres,” which exactly matches the requested biography using only one cached value. Another interesting observation was that GRACE edits were robust to the choice of edited layer, though later layers were harder to edit. Further, a clear balance was observed between memorization and generalization when choosing ε, as shown in Figure 2. Finally, a key feature of GRACE is that the codebook is detached from the pre-trained model, leaving its weights untouched. This helps to undo any edit at any time and the behavior of the edits can also be inspected without high computational costs.

A figure containing eight subfigures displayed as two rows and four columns. Each row represents a value of epsilon, the hyperparameter in our proposed method that controls generalization. The first row shows epsilon of 0.1, the second row shows and epsilon of 0.2. Each column shows a line graph for a different metric. Each line shows how the metric changes throughout 3,000 sequential edits to a T5 QA model using the zsRE dataset. Each plot contains four lines; each line is for editing a different T5 block. We compare edits made to blocks 0, 2, 4, and 6. Starting with the left column, we consider the TRR metric, which measures model accuracy on its original testing data after editing. For epsilon of 0.1, the TRR metric remains at 0.72 the entire time, with no difference per block. For epsilon of 3.0, the TRR metric remains at 0.72 only for Block 6 and is lowest for Block 0, dropping to below 0.7 by the end of editing. The second column shows the ERR metric, which is accuracy on previous edits at each step. Here we see that for epsilon of 0.1, Blocks 2, 4, and 6 remain high at nearly 1.0. For epsilon of 3.0, Block 6 remains high, while the other blocks drop to around 0.9. The third column shows Holdout performance on unseen holdout edits, which are rephrasings of seen edits. After each edit, we run the all holdout edits through the edited model and record its accuracy on the whole set. Therefore, in both plots, we see the performance increase over time, as the edits slowly cover more rephrasings of the holdout set. This way, we measure GRACE’s generalization. We see that for epsilon of 0.1, Block 6 generalizes slightly better than other blocks. But for epsilon of 3.0, Block 6 underperforms other methods significantly. Block 0 is slightly better and Blocks 2 and 4 are much better. In the final colum, we report the number of keys used by GRACE to make all 3,000 edits. Here we see that Block 6 simply memorizes all edits, as its number of keys grows linearly. After 3,000 edits, there are 3,000 keys. But for Blocks 0, 2, and 4, this value saturates, with edits being made with far fewer keys. When epsilon is 0.1, these blocks use about 2,000 keys. When epsilon is 3.0, Block 0 uses about 1,000 keys while Blocks 2 and 4 use around 800 keys. This demonstrates how picking the block and epsilon can impact the trade-off between memorization and generalization. Overall, it appears that generalizable edits happen in interior model layers as opposed to the first or last layers and for slightly-larger choices of epsilon.
Figure 2. GRACE’s performance when editing different blocks of a T5 model for different choices of epsilon. This choice drives a balance between accuracy on unrelated training data (TRR) and previous edits (ERR), as shown by a small epsilon (a) and a big epsilon (b).


GRACE presents a different perspective for model editing, where representations are directly modified and transformations are cached sequentially. Edits can be done thousands of times sequentially, where a small set of codebooks are maintained throughout the editing. This step reduces the gap for deployment needs of real-world applications where edits are discovered over time and should be addressed in a cost-effective manner. By correcting behaviors efficiently and expanding sequential editing to other model properties, like fairness and privacy, this work can potentially enable a new class of solutions for adapting LLMs to meet user needs over long deployment lifetimes.

The post Lifelong model editing in large language models: Balancing low-cost targeted edits and catastrophic forgetting appeared first on Microsoft Research.

Read More

Abstracts: November 20, 2023

Abstracts: November 20, 2023

Microsoft Research Podcast: Abstracts, November 23, 2023

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Shrey Jain (opens in new tab), a Technical Project Manager at Microsoft Research, and Dr. Zoë Hitzig (opens in new tab), a junior fellow at the Harvard Society of Fellows, discuss their work on contextual confidence, which presents a framework to understand and more meaningfully address the increasingly sophisticated challenges generative AI poses to communication.



GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.  


Today I’m talking to Shrey Jain, an applied scientist at Microsoft Research, and Dr. Zoë Hitzig, a junior fellow at the Harvard Society of Fellows. Shrey and Zoë are coauthors of a paper called Contextual Confidence and Generative AI, and you can read a preprint of this paper now on arXiv. Shrey Jain, Zoë Hitzig. Thanks for joining us on Abstracts

SHREY JAIN: Thank you.

ZOË HITZIG: Great to be here. 

HUIZINGA: Shrey, let’s start out with you. What problem does this research address, what made you care about it, and why should we care about it, too? 

JAIN: Yeah, so right now, there’s a lot of discussion as towards what the impacts of generative AI is on communication, and there’s been a lot of different terms being thrown around amongst AI policy researchers or news organizations, such as disinformation, misinformation, copyright, fair use, social engineering, deception, persuasion, and it makes it really hard to understand the precise new problem that this new technology, generative AI, brings towards our understanding of how we communicate with one another. And so what we wanted to do in this research is try to present a framework to sort of guide both policymakers, AI labs, and other people working in this field to have a better understanding of the challenges that generative AI presents and accordingly be able to meet those challenges with a set of strategies that are precise to the way we understand it and also try to uncover new strategies that might remain hidden in other frameworks that are traditionally being used to address these challenges. 

HUIZINGA: So expand on that a little bit in terms of, you know, what made you care about it? What was the prompt—no pun intended—for generative AI that got you concerned about this? And what kinds of things ought we to be thinking about in terms of why we should care about it, too? 

JAIN: Yeah, there’s a lot of different areas under which generative AI presents new challenges to our ability to communicate, one of which was literally the ability to communicate with close family members. I think we’ve seen a lot of these deception attacks kind of happening on the elderly, who have been susceptible to these attacks pre-generative AI in the past, and only thought that that might become more concerning. I no longer live in a city where my family lives, and so the only way to communicate with them is through a digital form now, and if we don’t have confidence in that interaction, I’m scared of the repercussions that has more broadly. And, you know, being at Microsoft Research, having worked on initiatives related to election integrity, was also starting to think through the impacts that this could have at a much wider scale. And so that’s kind of what prompted us to start thinking through how we can meet that challenge and try to make a contribution to mitigate that risk. 

HUIZINGA: Zoë, almost all research builds on existing foundations, so what body of work does your research draw from, and how does this paper add to the literature?

HITZIG: I’d say this research paper draws on a few different strands of literature. First, there has been a lot of social theorizing and philosophizing about what exactly constitutes privacy, for example, in the digital age. And in particular, there’s a theory of privacy that we find very compelling and we draw a lot from in the paper, which is a theory called contextual integrity, which was put forward by Helen Nissenbaum, a researcher at Cornell Tech. And what contextual integrity says is that rather than viewing privacy as a problem that’s fundamentally about control over one’s personal information or a problem about secrecy, contextual integrity says that an information flow is private when it respects the norms that have been laid down by the sender and the receiver. And so there’s a violation of privacy, according to Nissenbaum’s theory, when there’s a violation of contextual integrity. So we really take this idea from Nissenbaum and extend it to think about situations that, first of all, didn’t come up before because they’re unusual and generative AI poses new kinds of challenges. But second of all, we extend Nissenbaum’s theory into thinking not just about privacy but also authenticity. So what is authenticity? Well, in some sense, we say it’s a violation of a norm of truthfulness. What we really add to this theorizing on privacy is that we offer a perspective that shows that privacy questions and questions about authenticity or authentication can’t really be separated. And so on the theory side, we are extending the work of media scholars and internet scholars like Helen Nissenbaum but also like danah boyd and Nancy Baym, who are Microsoft Researchers, as well, to say, look, privacy and authenticity online can no longer be separated. We have to see them as two sides of the same coin. They’re both fundamentally about contextual confidence, the confidence we have in our ability to identify the context of a communication and to protect the context of that communication. So that’s sort of the theory side. And then, of course, our other big contribution is all the practical stuff that takes up the bulk of the paper. 

HUIZINGA: Right. Shrey, let’s talk about methodology for a minute. And this is a unique paper in terms of methodology. How would you describe your research approach for this work, and where does it fit on the spectrum of methodology for research? 

JAIN: Yeah, this paper is definitely a bit different from the conventional empirical research that might be done in the space. But it’s more of a policy or, I guess, framework paper where we try to provide both, as Zoë just commented on, the theory for contextual confidence but then also try to illustrate how we might apply contextual confidence as a framework to the existing challenges that generative AI presents. And so in order to make this framework and the theory that we present useful, we wanted to try to understand both what are the set of challenges that fall into these categories of identifying context and protecting context. So, specifically, how does generative AI threaten our ability to identify and protect? And trying to take a bird’s eye view in understanding those challenges. And then also kind of doing what might look similar to like a literature review but different in a way that we collect all of the different strategies that are typically talked about in the conversation but then in using contextual confidence as a framework realizing that new strategies that aren’t as well discussed in the conversation might be useful to meet these different challenges. And so from a methodology perspective, it’s almost like we’re applying the theory to uncover new … both new strategies that might be useful in this moment and then finding ways to give concrete examples of us applying that framework to existing technological questions that both people in the industry, as well as in policy, are thinking through when it comes to these questions about generative AI.

HUIZINGA: Zoë, for me, the most interesting part of research papers is that little part that comes after the phrase “and what we found was …” So, um, how would you describe what your takeaways were here, and how did you present them in the paper? 

HITZIG: That’s a great question. That’s also my favorite question to ask myself when I’ve completed a project. I think the biggest thing that I learned through writing this paper and collaborating with Shrey was really, for the first time, I forced myself to interrogate the foundations of effective communication and to understand what it is that we rely on when, you know, we pass a stranger on the street and look at them in a certain way and somehow know what it means. Or what we rely on to understand, you know, how our partner is feeling when they speak to us over coffee in the morning. I was really forced to step back and think about the foundations of effective communication. And in doing so, what we realized was that an ability to both identify and protect context is what allows us to communicate effectively. And in some sense, this very basic fact made me see how sort of shockingly robust our communication systems have been in the past and yet at the same time how fragile they could be in the face of this alarming new technology that has the power to fundamentally upset these two foundational processes of identifying and protecting context in communication. I would also say, on the question of what we found, you know, my first answer was about these sort of fundamental insights that had never occurred to me before about what makes communication effective and how it’s threatened. But also, I was able to understand and sort of make sense of so many of the strategies and tools that are in discussion today. And, for example, I was able to see, in a totally new light, the importance of, for example, something as simple as having some form of digital identification or the simplicity of, you know, what makes a good password and what can we do to strengthen passwords in the future. So there was this strong theoretical insight, but also that theoretical insight was enormously powerful in helping us organize the very concrete discussions around particular tools and technologies. 

HUIZINGA: Hmm. It’s a beautiful segue into the question I have for Shrey, which is talking about the real-world impact of this work. You know, coming down to the practical side from the theoretical, who does this work help and how? 

JAIN: Yeah, I want to also add a disclaimer in that, in this podcast, we kind of present generative AI almost as this like villain to communication. [LAUGHTER] I think that there’s also a possibility that generative AI improves communication, and I want to make sure that we acknowledge the optimism that we do see here. I think part of the real-world impact is that we want to mitigate the cost that generative AI brings to communications without hurting the utility at the same time. When applying contextual confidence in contrast to, say, views of traditional privacy, which may view privacy in terms of secrecy or information integrity, we hopefully will find a way in ensuring that the utility of these models is not significantly lost. And so in terms of the real-world impact, I think when it comes to both policies that are being set right now, norms around how we interact with these models, or any startup founder or person who’s deploying these tools, when they think about the reviews that they’re doing from a privacy point of view or a compliance point of view, we hope that contextual confidence can guide, as a framework, a way that protects users of these tools along with not hindering model capabilities in that form. 

HUIZINGA: Zoë, if there was one takeaway that you want our listeners to get from this work on contextual confidence, what would it be?

HITZIG: What I hope that readers will take away is, on the one hand, the key conceptual insight of the paper, which is that in today’s digital communication and in the face of generative AI, privacy questions and authenticity questions cannot be separated. And in addition, I hope that we’ve communicated the full force of that insight and shown how this framework can be useful in evaluating the deployment of new tools and new technologies. 

HUIZINGA: Finally, Shrey, what outstanding questions or challenges remain here, and how do you hope to help answer them? 

JAIN: In the paper, we have presented a theoretical understanding of contextual confidence and present various different strategies that might be able to help meet the challenges that generative AI presents to our ability to both identify and protect context, but we don’t know how those strategies themselves may or may not undermine the goals that we’re presenting because we haven’t done empirical research to know how a given strategy might work across different types of people. In fact, the strategies could undermine the initial goals that we intend. A verification stamp for some might enhance credibility, but for those who may not trust the institution verifying, it may actually reduce credibility. And I think there’s a lot of empirical research both on the tool development, usability, and then back to guiding the theoretical framework that we present that we want to continue to refine and work on as this framework hopefully becomes more widely used. 

HUIZINGA: Well, Shrey Jain, Zoë Hitzig, thank you for joining us today, and to our listeners, thanks for tuning in.  


If you’re interested in learning more about contextual confidence and generative AI, you can find a link to the preprint of this paper at, or you can read it on arXiv. See you next time on Abstracts


The post Abstracts: November 20, 2023 appeared first on Microsoft Research.

Read More

What’s Your Story: Desney Tan

What’s Your Story: Desney Tan

MSR Podcast

In this new Microsoft Research Podcast series What’s Your Story, Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. A systems expert whose 10 years with Microsoft spans research and product, Gehrke talks to members of the company’s research community about what motivates their work and how they got where they are today.

Across his time at Microsoft, Desney Tan, Managing Director of Microsoft Research Redmond, has had the experience of shepherding research ideas into products multiple times, and much like the trajectory of research, his life journey has been far from linear. In this episode, Tan shares how he moved to the United States from Singapore as a teenager, how his self-described “brashness” as a Microsoft intern helped shift the course of his career, and how human impact has been a guiding force in his work.

photos of Desney Tan throughout his life



DESNEY TAN: Early in the career, I always looked at successful people and it always felt like they had a goal, and it was a very nice straight line to get there, and they did all the right things, and I don’t know anyone today that I deem to be successful that had a straight-line path and did all the right things.


JOHANNES GEHRKE: Microsoft Research works at the cutting edge. But how much do we know about the people behind the science and technology that we create? This is What’s Your Story, and I’m Johannes Gehrke. In my 10 years with Microsoft, across product and research, I’ve been continuously excited and inspired by the people I work with, and I’m curious about how they became the talented and passionate people they are today. So I sat down with some of them. Now, I’m sharing their stories with you. In this podcast series, you’ll hear from them about how they grew up, the critical choices that shaped their lives, and their advice to others looking to carve a similar path.


In this episode, I’m talking with Desney Tan, a longtime Microsoft executive whose experience with the company spans computational neuroscience, human-computer interaction, and health and the life sciences. His research contributions have impacted a wide range of Microsoft products. Desney was previously Vice President and Managing Director of Microsoft Health Futures and is now Managing Director of Microsoft Research Redmond.

Much like the trajectory of research, Desney’s life journey has been far from linear. He left Singapore to attend school in the Unites States as a teenager, then worked in autonomous navigation for NASA and in VR for Disney before landing here at Microsoft. Here’s my conversation with Desney, beginning with his childhood.

DESNEY TAN: Born and raised in Singapore. Dad was an architect. Mom did everything, um, to run the family. When I turned 13, Mom and Dad came to me and they said, “Hey, would you like to try something new?” I said sure. You know, I had no idea what they, they were thinking. Two weeks later, they sent me to the US to study. Um, looking back, sometimes I flippantly claim I was just eating too much at home and so they had to send me away. [LAUGHTER] But actually, it was, you know, I think it was prescient on their part. They sort of looked at my path. They looked at the education system. They looked at the way I learned and the way I created and the way I, I acted, and they somewhat realized, I think, very early on that the US was a great … would, would be a great place for me to sort of flourish and, and sort of experiment and explore and, and grow.

GEHRKE: And so how did it work? You just went by yourself?

TAN: So I had an aunt and an uncle in Louisiana. Spent a couple of years in high school there. Um, sort of … fun, fun side story. They looked at … the high school looked at my math curriculum in Singapore, and they said, “Oh, he’s at least a year ahead.” So they skipped me a year ahead. And then through some weird miscalculations on their part, they actually ended up skipping me nearly two years ahead.

GEHRKE: Oh, wow.

TAN: And by the time we realized, I had already integrated into school, the courses were just fine, and so I ended up skipping a lot of years.

GEHRKE: So you ended up graduating then high school what …

TAN: Pretty early. I was 15.

GEHRKE: 15 …

TAN: Graduated from high school. Got to college. Had no idea what I wanted to do. What 15-year-old does? Um, ended up in liberal arts college, so University of Notre Dame. So, so I don’t know how Mom let me do this, but, you know, I got all my acceptance letters together. I said I don’t know anything about college. I don’t know where I want to go. I don’t know what I want to do. I’m going to toss all the letters up in the air, and the one that lands on top is the school I’m going to.

GEHRKE: And that’s what you did?

TAN: Yeah, that’s exactly what I did. Um, divine intervention, let’s call it. Notre Dame landed on top. You know, switched majors a bunch of times. I started off in aerospace, did chemical engineering, civil engineering. I was on the steps of becoming a priest until they sent me away. They said, “Hey, if it’s not a mission and a calling, go away and come back later.” And ended up with a computer engineering degree. You know, I had great mentors, you know, who looked out for me. I had a couple of guardian angels out there, you know, guided me along, and that, you know, that was just a wonderful breadth of education. Went back to the military for a couple of years. Uh, served there for a couple of years. Did a bunch of growing up.

GEHRKE: That, that’s quite a change, right, from being like in college and then going back to the military.

TAN: Yeah, yeah, it was a mandatory service in Singapore, and so I went back. Had a ton of fun. Learned a bunch of stuff about the world, about myself. I claim the military is one of the few organizations in the world that takes an 18-year-old and teaches them leadership, um, and teaches them about themselves and teaches them about how to push themselves and where the boundaries are. And so fairly accidentally, I, I got to benefit from all of that. At the end of that, I realized my computer engineering degree was, you know … I realized two things. One, my computer engineering degree was a little outdated by the time I got out of the military, and two, that I didn’t love being told what to do. [LAUGHS]


TAN: So I came back. Uh, did grad school. I was at Carnegie Mellon. Ended up getting hooked up with a wonderful professor, Randy Pausch.

GEHRKE: “The Last Lecture,” right?

TAN: Who gave “The Last Lecture” in his last days. You know, learned a ton from him not only about academics and scholarship, but also about life and, um, and leadership.

GEHRKE: And so he was at the intersection of graphics and HCI, right, if I remember correctly?

TAN: That’s correct, yeah.

GEHRKE: So what is your PhD in?

TAN: My PhD was actually looking at, um, distributed displays in virtual reality. So how, how the human brain absorbs information and uses the world around us to be able to, um, interact with digital data and analog data.

GEHRKE: Early on in a really important field already.

TAN: Yeah, no, it was great. Spent a couple of years with NASA in the Jet Propulsion Lab doing autonomous navigation. This was the early days of, um, you know, AI and, and planning.

GEHRKE: So those aerospace engineering classes, were they actually useful?

TAN: They, you know, all the classes I took ended up coming back to be useful in a number of ways. And actually, um, you know, the diversity of viewpoints and the diversity of perspectives is something that sat very deeply in me. So anyways, you know, spent some time at NASA. Um, spent some time at Disney with the Imagineers building virtual reality theme parks. This was the late ’90s, early 2000s. So Disney at the time had all the destination theme parks: Disneyland, Disney World, places you would fly for a week and, and spend a week at. Their goal was really to build a theme park in a box that they could drop down into the urban centers, and the only way to get a theme park into a building was digital experiences. And so this was the very early days of VR. We were using, you know, million-dollar military-grade headsets. They were, you know, 18, 19 pounds.


TAN: Disney was one of the companies—and, you know, it’s sat with me for a very long time—that designs experiences for every single person on earth, right. So these headsets had to work on your 2-year-old. They had to work on your 102-year-old. They had to work on, you know, a person who spoke English, who read, who didn’t read, who didn’t speak English. You know, tall, short, large, small, all of it. And they did a wonderful job finding the core of what it means to be human and designing compelling experiences for all of us, um, and that was a ton of fun. We ended up deploying these facilities called DisneyQuest. There was one in Chicago; one in Orlando. They just closed them down a couple of years ago because actually all the VR rights have now migrated into the theme parks themselves.

GEHRKE: And it was actually a VR experience? You would go and sit …

TAN: It was a VR experience. They dropped them down. They had basically buildings. There were, you know, floors full of classic and new-age arcade games. And then there were VR experiences that you could run around in and, um, interact with.

GEHRKE: Interesting. I’ve never, I mean, I lived in Madison for four years, but I’ve never heard of that Quest experience. It seems to be a fun way to experience Disney … by not going to any of the, the theme parks.

TAN: It was super fun. Um, yeah, we, I personally got to work on a couple of rides. There was Pirates of the Caribbean.

GEHRKE: Oh, wow.

TAN: So you put on … a family would put on headsets and kind of run around, shooting pirates and what have you. And then the Aladdin ride was I thought one of the better rides.

GEHRKE: Oh, wow, yeah …

TAN: Where you sit on a magic carpet as you can imagine.

GEHRKE: Oh yeah. That sounds fun.

TAN: It was perfectly scripted for it. Um, anyways, ended up at Microsoft largely because entertainment technology while a lot of fun and while I learned a ton was, uh, strangely unsatisfying, and there was something in me and, you know, that was seeking human impact at scale in a much deeper and much more direct way. And so I thought I’d be here for three or four years largely to learn about the tech industry and how, you know, large pieces of software were deployed before going off and doing the impact work. And I’ve now been here for nearly 20 years.

GEHRKE: And where do you start? Did you start out right away at Microsoft Research, or were you first in a product group?

TAN: My career here has been a cycle of starting in Microsoft Research, incubating, failing, trying again. Failing again. You know, at some point, screaming “Eureka!” [LAUGHTER] and then doing my tours of duty through the product groups, commercializing … productizing, commercializing, you know, seeing it to at least robustness and sustainability if not impact and then coming back and doing it again. Um, and the thing that’s kept me here for so long is every time I’ve completed one of those cycles and thought I was done here, um, the company or the world in some cases would throw, you know, a bigger, thornier, juicier thing in front of me, and Microsoft has always been extremely encouraging, um, and supportive of, you know, taking on those challenges and really innovating and opening up all new, whole new opportunities.

GEHRKE: I mean this whole cycle that you’re talking about, right, of sort of starting out small at MSR (Microsoft Research), you know, having sort of the seed of an, of an idea and then growing it to a bigger project and at some point in time transitioning, transitioning it into, into the product group and actually really making it a business. So tell me about … you said you have done this, you know, a few times and, you know, once you were even highly successful. I’d love to learn more about this because I think it’s so inspiring for everybody to learn more about this.

TAN: Yeah. No, it’s been magical. I have to say before going into any of these stories that none of these paths were architected. As, as you well know, they never are. So actually my, my first experience was as an intern here, and, you know, I was a sort of brash, perhaps rash, intern. I was working on virtual reality, and in the evenings, I would meet with folks around the company to learn more, and I met with a team that was building out multi-monitor functionality in Windows NT. Prior to Windows NT, Windows computers had one and only one monitor, and they started to build the functionality to build multiple. As the brash grad student, you know, I, I had different thoughts about how this should be implemented and, you know, couldn’t convince anyone of it. And so in the evenings, I ended up starting just to build it. At the end of the internship, in addition to all the stuff I was doing, I said, “Hey, by the way, I’ve built this thing. You know, take it or leave it. Here you go.” And it ended up being the thing that was implemented in NT for a variety of reasons. That really got me hooked. Prior to that, I had imagined myself an academic, going back and, you know, being a professor somewhere in academia. And as soon as I saw, you know, the thing I did and that, you know, Microsoft actually polished up and made good in the real world…

GEHRKE: And shipping in millions and millions of desktops, right?

TAN: That’s right. There was no getting away from that.

GEHRKE: OK, right.

TAN: When I first got here, MSR had actually hired me thinking I’d work on virtual reality. And I got here and I said, hey, VR … I’ve just done a ton of VR. VR is probably 15 or 20 years out from being democratized and consumerized. I’m going to do something for a couple of years, and then I’ll come back to this. Um, so I got into computational neuroscience, looking at, um, sensors that scanned or sensed the brain and trying to figure out mental state of people. I had the imagination that this would be useful both for direct interaction but also for understanding human behavior and human actions a little bit better. We won’t go into that work, but, um, what happened with the productization of that was I went … this was at the time when Bill Gates was actually pushing very hard on tablet PCs and the stylus and the pen as an interesting input modality. The realization we had was, hey, we’ve got spatial temporal signal coming off the brain we’re trying to make sense of; the tablet guys had spatial temporal signal coming off a pen they were trying to make sense of in handwriting recognition. And so we went over and we said, hey, what interesting technological assets do you have that we can steal and use on the brain. Turns out they were more convincing than us. And, and so they said, hey, actually you’re right. The problems do look similar. What do you have that you could bring over? And so if you look at the handwriting recognition system even that stands today, it’s a big mess of a neural network, um, largely because that came out of interpreting neural signal that got transferred into the handwriting recognizer.

GEHRKE: I see.

TAN: And so I ended up spending two, maybe 2 1/2 years, working not only on the core recognition engine itself but also the entire interface that ran around the tablet PC and, you know, the tablet input panel.

GEHRKE: But that’s sort of an interesting realization, right. You came because you thought you would land Technology X for Application Y, but actually you land it for a very different application.

TAN: That’s right. And, and each cycle has had a little bit of that surprise and that serendipity, which we’ve now built into the way we do research. And, um, you sort of head down a path because it moves you forward as quickly as possible. But you keep your eyes peeled for the serendipitous detours and the, the discovery that comes out of that. Um, and I think that’s what makes Microsoft Research as an organization, um, so compelling and, and so productive, right, as … we, we do run very fast, but we have the freedoms and, you know, the flexibility really to take these windy paths and to take these detours and, and to go flip over, you know, rocks, some of which end up being, you know, dead ends.

GEHRKE: Right.

TAN: Others of which end up being extremely productive.

GEHRKE: Right. And so if you think about, let’s say, a junior person in the lab, right. They’re sort of looking at you and your career and saying, “Wow, what steps should I take to, you know, become as successful as Desney?” What, what advice would you give them, right? Because it seems like you have always had sort of MSR as sort of your rock, right. But then you jumped over the river a few times, but then came back and jumped over again. Came back.

TAN: First off, I, I don’t know that Desney has been so successful so much as, you know, the people around Desney have been extremely successful and Desney’s gotten to ride the wave. But, yeah, no, I mean every, everyone’s … you know, as I look around the table and the board, you know, everyone has a slightly different journey, and everyone has slightly different work styles and mindsets and personalities and risk tolerance and what have you. Um, so the first thing really is, is not to try to fully emulate anyone else. I always claim we’re, we’re kind of like machine learning models, right. We, we should be taking input data, positive and negative, and building our models of ourselves and our models of the world and really operating within that context. I think having a North Star, whether it’s implicit or explicit, has been extremely useful for myself and the people around me.

GEHRKE: By North Star, you mean like a philosophical North Star or technical North Star or North Star in what you want to be? What, what do you mean?

TAN: Yes, yes, yes. All of it.

GEHRKE: So tell me more about your personal North Star.

TAN: For, for, for us … for myself, it’s really been about human impact, right. Everything we do is centered on human impact. We do research because it’s part … it’s, it’s one of the steps towards achieving human impact. We productize because it’s one of the steps towards human impact. Our jobs are not ever done until we hit the point of human impact, and then they’re not quite done because there’s always more to be had. Um, so I think having that, you know, perhaps a value system, um, at least, you know, sort of grounds you really nicely and, and creates, I think, or can create a courage and a bravery to pursue, which I think is important. You know, different people do this differently, but I have been very lucky in my career to be surrounded by people that have been way, way, way better than myself, um, and, and extremely generous of their passions and their skills and their expertise and their time. You know, ask it and just about any successful person by whatever definition and I think they’ll tell you the same thing, that it’s the people around. And then being tolerant, maybe even seeking of, this windy path. You know, when I was early in the career, I always looked at successful people, or people I deemed to be successful, and it always felt like they had a goal, and it was a very nice straight line to get there, and they did all the right things and, and took all the right steps, and, um, and I don’t know anyone today that I deem to be successful that had a straight-line path and did all the right things.

GEHRKE: Yeah, and it’s often these setbacks in, you know, one’s career that actually give you often some of the best learnings because either of some things that you’ve sort of done structurally wrong or some things that, you know, you really need more experience and, and, you know, that setback gave you that experience. So, so one other question around this is also just around change, right. Because especially right now, we’re living in this time where maybe the rate of change especially in AI is kind of unprecedented. I mean, benchmarks are falling in like a quarter of the time than they would have thought to be lasting. You know, we all have played with ChatGPT. Just extrapolate that out a few more months, if not years, right. OpenAI is here talking about AGI. So how do you think about change for yourself and evolution and learning, and do you have any, any routines? How, how do you keep up with everything that’s going on?

TAN: Yeah, it’s, uh … good question. I guess the overarching philosophy, the approach that I’ve taken with my career, is that everything’s constantly in change. You know, the rate of change may vary, and the type of change and the, the mode of change might vary, but everything’s constantly changing, and so our jobs at any given point are to understand the context in the world, in the organization, with the people around you, and really be doing the best that you can at any given moment. And as that context changes, you kind of have to dynamically morph with it. I subscribe pretty fully to the Lean Startup model. So, you know, formulate hypotheses … and this is the research process really, right. Formulate hypotheses, test them as quickly as you can, learn from that, and then do it again, and rinse and repeat. And then … and, you know, you could sort of plot your path and steer your path through based on that. Um, and so we operate very much on that. As, as the world changes, we change. As, you know, the org changes, we change. And there’s a certain robustness that comes along with that. It’s not all roses, and obviously change is and uncertainty is, is a difficult context to operate in.

GEHRKE: And super interesting because it also speaks to some of the things that one should, um, sort of look out for when doing research, right. If you’re saying, well, I have these hypotheses and I want to quickly test them, right, if I’m in a field or if I work with data that I, you know, cannot really use, where the testing of an hypothesis will take months if not years to bring out, this might not be the best research direction. So how should I think about sort of research, the choice of research problems …

TAN: It’s a good question, yeah.

GEHRKE: … sort of with this, with this change in mind, right?

TAN: Yeah, yeah. Um, I don’t know. I, I’m, I guess … again, I’m brash on this. There are, there are very few problems and spaces that can’t be navigated, um, and so things that seem impossible at first glance are often navigable, you know, with a little bit or maybe sometimes a lot of creativity. Um, you know, if our jobs are to take Microsoft and the rest of the world to places that Microsoft and the rest of the world might not get itself to—hopefully positive places—then we’re going to have to do things in a way that is probably unnatural for Microsoft and the rest of the world, um, to get there. And the company and the organization, MSR, has been extremely supportive of that level of creativity.

GEHRKE: Can you give an example of that for …?

TAN: We had Cortana, which is our speech recognition and conversational engine. We didn’t really have a platform to deploy that on. At the same time, we saw a bunch of physicians, clinicians, struggling with burnout because they were seeing patients for less than half the time. They were spending more of their time sitting in front of the computer, documenting stuff, than they were seeing patients and treating patients. We said, hey, what if you put the two together? What if you sat in the room, listened to the doctor and the patient, and started to automatically generate the documentation? And in fact, if you did that, you could structure the data, which leads for better downstream analytics. Um, and if you did that, you could start to put machine learning and AI and smarts into the system, as well. That project, which was called EmpowerMD, led eventually—after a bunch of missteps and a bunch of learnings and a bunch of creativity—to a very deep partnership with Nuance, um, and creation of Dragon Ambient eXperience and the eventual acquisition thereof of that company. And, um, it’s just a wonderful product line. It’s, you know, kind of a neat way to think about data and intelligence and human augmentation and integration into otherwise messy, noisy human processes. Um, but yeah, you know, I think with enough creativity, um, you know, we’ve, we’ve bumped into very, very few brick walls.

GEHRKE: And what I love about the story is that it’s not about a specific technology choice, but it’s more about a really important problem, right.

TAN: That’s right. Yeah. If your problem is right and if your conviction is right about the value of the solution …


TAN: …you build teams around it. You build processes around it. You’re creative in the way you execute. And, um, I’d say more times than not, we end up getting there.

GEHRKE: Yeah, well, I love that insight because it’s often much more valuable to solve an important problem than to land some deep technology on a problem that very few people care about …

TAN: I think that’s right.

GEHRKE: …and it seems like that’s what you have done here.

TAN: Yeah.

GEHRKE: Well, it was really great and inspiring to hear from you, Desney. Thanks so much for the conversation.


TAN: Yeah, thanks for having me, Johannes.

GEHRKE: To learn more about Desney’s work or to see photos of Desney during his winding journey to Microsoft, visit (opens in new tab).

The post What’s Your Story: Desney Tan appeared first on Microsoft Research.

Read More

Research Focus: Week of November 8, 2023

Research Focus: Week of November 8, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: November 8, 2023 on a gradient patterned background


HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations

Generating both plausible and accurate full body avatar motion is essential for creating high quality immersive experiences in mixed reality scenarios. Head-mounted devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF—or the six degrees of freedom of movement by a rigid body in a three-dimensional space. Recent approaches have achieved impressive performance in generating full body motion given only head and hands signal. However, all known existing approaches rely on full hand visibility. While this is the case when using motion controllers, for example, a considerable proportion of mixed reality experiences do not involve motion controllers and instead rely on egocentric hand tracking. This introduces the challenge of partial hand visibility, owing to the restricted field of view of the HMD.

In a recent paper: HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations, researchers from Microsoft propose HMD-NeMo, the first unified approach that addresses plausible and accurate full body motion generation even when the hands may be only partially visible. HMD-NeMo is a lightweight neural network that predicts full body motion in an online and real-time fashion. At the heart of HMD-NeMo is a spatio-temporal encoder with novel temporally adaptable mask tokens that encourage plausible motion in the absence of hand observations. The researchers perform extensive analysis of the impact of different components in HMD-NeMo and, through their evaluation, introduce a new state-of-the-art on AMASS, a large database of human motion unifying different optical marker-based motion capture datasets.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


Will Code Remain a Relevant User Interface for End-User Programming with Generative AI Models?

The research field of end-user programming has largely been concerned with helping non-experts learn to code well enough to achieve their own tasks. Generative AI stands to obviate this entirely by allowing users to generate code from naturalistic language prompts.

In a recent essay: Will Code Remain a Relevant User Interface for End-User Programming with Generative AI Models?, researchers from Microsoft explore the relevance of “traditional” programming languages for non-expert end-user programmers in a world with generative AI. They posit the “generative shift hypothesis”: that generative AI will create qualitative and quantitative expansions in the traditional scope of end-user programming. They outline some reasons that traditional programming languages may still be relevant and useful for end-user programmers, and speculate whether each of these reasons might endure or disappear with further improvements and innovations in generative AI. And finally, they articulate a set of implications for end-user programming research, including the possibility of needing to revisit many well-established core concepts, such as Ko’s learning barriers and Blackwell’s attention investment model.


LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup

On-device deep neural network (DNN) inference, widely used in mobile devices such as smartphones and smartwatches, offers unparalleled intelligent services, but also stresses the limited hardware resources on those devices.

In a recent paper: LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup, researchers at Microsoft propose a system that consumes less latency, memory, disk, and power, for more efficient DNN inference. LUT-NN learns the typical features for each operator, known as the centroid, and precomputes the results for these centroids to save in lookup tables. During inference, the results of the closest centroids with the inputs can be read directly from the table, as the approximated outputs without computations.

LUT-NN integrates two major novel techniques: (1) differentiable centroid learning through backpropagation, which adapts three levels of approximation to minimize the accuracy impact by centroids; (2) table lookup inference execution, which comprehensively considers different levels of parallelism, memory access reduction, and dedicated hardware units for optimal performance.

The post Research Focus: Week of November 8, 2023 appeared first on Microsoft Research.

Read More

Toward developing faster algorithms for minimizing submodular functions

Toward developing faster algorithms for minimizing submodular functions

This research paper was presented at the 64th IEEE Symposium on Foundations of Computer Science (FOCS) 2023 (opens in new tab), a premier forum for the latest research in theoretical computer science.

FOCS 2023 paper: Toward developing faster algorithms for minimizing submodular functions

Submodular functions are versatile mathematical tools, finding diverse applications in real-world scenarios and guiding solutions across complex domains. From dissecting the intricate networks of graphs to deciphering the complexities of economic landscapes through utility functions, and even navigating the enigmatic world of random variables via entropy functions, they offer valuable insights into challenging problems. Their wide-ranging applicability has made them pivotal tools for modeling and optimization in various theoretical computer science domains, including operations research and game theory. In recent years, submodular functions have gained prominence in solving optimization problems within machine learning (ML) applications. These tasks encompass vital areas such as feature selection and clustering, as illustrated in Figure 1. Additionally, submodular functions are instrumental in applications like sensor placement and graphical models. For further exploration, comprehensive resources are available in Bilmes’ insightful survey (opens in new tab) and Bach’s standard textbook (opens in new tab) on this subject.

Two graphics. The left graphic depicts the process of feature selection, beginning with all the features on the top, then the unselected features crossed in the middle, and finally the selected features remain at the bottom. The right graphic shows the process of clustering, where a set of points in 2D are assigned different colors so that points with the same color are physically close to each other to form a cluster.
Figure 1. Application of submodular function optimization to feature selection, on the left, and clustering on the right.

Algorithm design for submodular function minimization

In a joint paper with researchers from Stanford University, “Sparse Submodular Function Minimization(opens in new tab) (opens in new tab),” presented at FOCS 2023(opens in new tab) (opens in new tab), we investigate the problem of minimizing a submodular function in the standard model.   Here, we assume that the submodular function can be accessed through an evaluation oracle that returns the value ( f(S) ) in response to a query with a set ( S ). This is the most classical and well-studied model for studying algorithm design for minimizing submodular functions.

Before we discuss our study, it’s important to bear in mind that a submodular function ( f ) is defined on subsets of a finite set of elements ( V ) that satisfy a diminishing marginal difference property. That is, for any two subsets ( S subseteq T ) and any element ( e in V setminus T ), the marginal value of ( e ) when added to the smaller set ( f(S cup {e}) – f(S) ) is at least the marginal value of ( e ) when added to the bigger set ( f(T cup {e}) – f(T) ).

In the 1980s, foundational work (opens in new tab) revealed that submodular functions could be minimized in polynomial time, marking a significant breakthrough. Since then, researchers have made substantial progress in the quest for faster algorithms for submodular function minimization (SFM). Despite these efforts, fundamental questions persist, such as determining the minimum number of queries required to minimize any given submodular function—a concept referred to as the problem’s query complexity.

Currently, the most advanced algorithm needs to make ( widetilde{O}(n^2) ) queries for any given submodular function, while the best lower bound is only ( widetilde{Omega}(n) ), where (n) is the size of the ground set on which the submodular function is defined. This disparity results in a substantial gap, leaving an (n)-fold difference between the existing upper and lower bounds.

Given this considerable difference, a natural question arises: What additional structural assumptions could potentially pave the way for faster algorithms in submodular function minimization (SFM)? One prevalent assumption is sparsity, which posits that the size of the set minimizing the submodular function is small. This holds particular relevance in diverse applications, including signal processing, feature selection, and compressed sensing. In these scenarios, solutions are expected to exhibit sparse non-zero entries, making it important to understand how algorithmic complexity depends on sparsity, as it provides insights into the intricate combinatorial and geometric structures of the problems.

Interestingly, existing algorithmic techniques developed over the past four decades for SFM do not yield improved runtimes even when the solution is sparse. Therefore, it is imperative to develop innovative techniques that can drive advancements in sparse SFM and bridge the existing gap between upper and lower bounds.

Microsoft Research Podcast

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

Emre Kiciman and Amit Sharma discuss their paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” and how it examines the causal capabilities of large language models (LLMs) and their implications.

Parallel algorithms for submodular function minimization

Exploring beyond SFM’s query complexity, recent research has shed light on the importance of sparse SFM, particularly in understanding the inherent adaptivity of parallel algorithms (known as parallel complexity) designed to solve the problem. Research has shown that any parallel algorithm for SFM requires a minimum adaptivity that is a polynomial in the size of the ground set.

Our results improve both parallel and sequential algorithms for SFM. For example, consider a scenario where the minimizer of the given submodular function is (widetilde{O}(1))-sparse. In this context, our parallel algorithm runs in a nearly constant number of rounds, while our sequential algorithm makes a nearly linear number of queries. This achievement stands in stark contrast with the previous best parallel upper bound of (widetilde{O}(n)) and the best query complexity upper bound of (widetilde{O}(n^2)).

Fast first-order methods for exact submodular function minimization

Current fast algorithms for SFM rely on cutting-plane methods, a standard class of convex optimization techniques applied to the Lovász extension—a natural continuous extension of the given submodular function. However, restricting the optimization domain to sparse solutions doesn’t significantly expedite cutting-plane methods beyond a logarithmic factor. To address this, we shifted our approach and employed first-order methods, including stochastic mirror descent, to minimize the Lovász extension. These methods, non-Euclidean generalizations of stochastic gradient descent, are more attuned to problem geometry. Unlike cutting-plane methods, first-order methods exhibit a polynomial convergence rate, rather than a polylogarithmic dependency on the additive error concerning the optimal solution. 

This rate of convergence indicates that first-order methods are better suited for approximate submodular function minimization, while our goal is to solve it exactly. Using the sparsity assumption, we developed a new algorithmic framework for SFM based on a new concept of duality. We used this framework to demonstrate how first-order methods, with substantially reduced accuracy requirements, can be applied to solve SFM exactly.

Toward faster algorithms for SFM and its applications

These techniques not only promise advancements for sparse SFM but also provide a foundation for tackling other fundamental problems in SFM theory. Our algorithms for sparse SFM serve as valuable starting points for designing improved algorithms for related problems. They offer potential insights into developing polynomial-time algorithms for SFM with lower query and parallel complexity, opening avenues for future research.

Traditionally, research on submodular function minimization has focused on the global properties of the problem over the past four decades. Sparse SFM, in contrast, enables us to explore local and more refined structures of submodular functions. Our work introduces new algorithmic tools that better use these structural properties, a vital aspect for applications in ML and operations research, because these areas often have special structures. Beyond advancing sparse SFM, our paradigm paves the way for the development of enhanced algorithms for SFM and its diverse applications.

The post Toward developing faster algorithms for minimizing submodular functions appeared first on Microsoft Research.

Read More

Teachers in India help Microsoft Research design AI tool for creating great classroom content

Teachers in India help Microsoft Research design AI tool for creating great classroom content

a group of people sitting at a desk in front of a crowd

Teachers are the backbone of any educational system. They are not just educators; they are indispensable navigators, mentors, and leaders. Teachers around the world face many challenges, which vary from country to country or even within a city or town. But some challenges are universal, including time management, classroom organization, and creating effective lesson plans.

Advances in AI present new opportunities to enhance teachers’ abilities and empower students to learn more effectively. That’s the goal of a new project from Microsoft Research, which uses generative AI to help teachers quickly develop personalized learning experiences, design assignments, create hands-on activities, and more, while giving them back hours of time that they spend on daily planning today.

Shiksha copilot is a research project which is an interdisciplinary collaboration between Microsoft Research India and teams across Microsoft. Shiksha (Sanskrit: शिक्षा, IAST and ISO: śikṣā) is a Sanskrit word, which means “instruction, lesson, learning, study of skill”. The project aims to improve learning outcomes and empower teachers to create comprehensive, age-appropriate lesson plans combining the best available online resources, including textbooks, videos, classroom activities, and student assessment tools. To help curate these resources, the project team built a copilot—an AI-powered digital assistant—centered around teachers’ specific needs, which were identified right at the start through multiple interviews and workshops.

Working with Sikshana Foundation (opens in new tab), a local non-governmental organization focused on improving public education, the researchers are piloting this program at several public schools in and around Bengaluru, India, to build and improve the underlying tools. This post gives an overview of the project, including interviews with three teachers who have used Shiksha copilot in their own classrooms.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

A road map for teachers

A lesson plan is like a road map charting what students need to learn and how to efficiently cover the material during class time. It includes three key components:​

  • Objectives for student learning, based on grade level and subject​  
  • Teaching and learning tactics, including tutorials and activities to help students understand the topic
  • Strategies to assess student understanding, both in class and through homework 

Parimala H V teaches science in grades 6-8 at Government Higher Primary School, Santhe Beedhi in Bengaluru. She teaches in the local language, Kannada, and in English. For each class she teaches, she spends an hour or more each day scanning textbooks and printed materials to put together an effective lesson plan. She also searches the internet for ideas, but sifting through the growing body of online content could take just as long. Often she would work till midnight planning the next day’s activities, which left her feeling tired and stressed.

“Lesson planning can be a struggle, but it’s very important,” Parimala said. “If the planning goes well, everything goes well.”

With Shiksha copilot, Parimala was able to develop a complete lesson plan in 60 to 90 seconds, instead of 60 to 90 minutes. The simple interface asks basic questions about the curriculum, language of delivery, grade level, and subject. It then compiles engaging learning materials to achieve the teacher’s classroom objectives. Parimala finds better ideas and hands-on activities using Shiksha copilot than through other online tools. She feels well rested and better prepared for her day, which also makes her happier in the classroom. And with the time she saves, she can focus more on coaching her students and improving her teaching practices.

Ms. Parimala standing in front of a school

“I was thrilled to have the opportunity to use Shiksha copilot,” Parimala said. “It could be very useful for new teachers just learning their profession. I think it could revolutionize the way teachers teach.” 

Parimala H.V., Teacher, Government Higher Primary School, Santhee Beedhi

At Parimala’s school and others in the Bengaluru area, teachers face some significant challenges. Classrooms can have up to 70 students of varying abilities. Teachers often need to prepare lessons and give instruction in both English and Kannada. As the Covid pandemic brought about remote learning on a large scale, technology began to rapidly change how teachers and students interact. Most students now have computers or smartphones, expanding teachers’ options. But it also makes it harder to keep students focused on a traditional classroom blackboard.

“These children are addicted to their mobile phones and social media. If I use the ‘chalk and talk’ method in class, they may get bored,” said Gireesh K S, who relies heavily on his blackboard to teach math and physics at Government High School, Jalige. Gireesh has used web search tools to find digital resources like interactive PowerPoint slides that will hold his students’ attention longer. With Shiksha copilot, he can zero in more quickly on videos or classroom activities that help him connect better with all 40+ students in his class.

“Here lies the teacher’s job. The teacher has to select whichever activity, whichever video, or whichever questions to use,” Gireesh said. “There are so many questions and videos (to choose from), but as a teacher for my class, I know my students. So, I have to select the suitable ones.”

Other learning platforms were less flexible and less dynamic, returning static content options that were not always useful for a diverse group of learners. Shiksha copilot, on the other hand, does a much better job of customizing and adapting its recommendations based on teacher input, Gireesh said.

“Shiksha copilot is very easy to use when compared to other AI we have tried, because it is mapped with our own syllabus and our own curriculum.”

Gireesh K S, Teacher, Government High School, Jalige

Mr. Gireesh KM posing for the camera

Behind the technology

Designing and building Shiksha copilot requires various technological innovations. Educational content is mainly multimodal, including text, images, tables, videos, charts, and interactive materials. Therefore, for developing engaging learning experiences, it is essential to build generative AI models which have unified multimodal capabilities. Also, these experiences are most impactful when delivered in native languages, which requires improving the multilingual capabilities of generative AI models.

Shiksha copilot includes a range of powerful features that address those challenges and enhance the educational experience. It’s grounded in specific curricula and learning objectives, to ensure that all generated content aligns with desired educational outcomes, according to Akshay Nambi (opens in new tab), principal researcher at Microsoft Research. “This grounding is enabled by ingesting relevant data with the help of state-of-the-art optical character recognition (OCR), computer vision (CV) and generative AI models. It was also important to use natural language and support voice-based interactions while including options for English and Kannada speakers,” Nambi said. 

Shiksha copilot supports connectivity to both public and private resource content, enabling educators to tap into a vast array of materials and tailor them to their unique teaching requirements. Shiksha copilot can be accessed through different modalities, such as WhatsApp, Telegram, and web applications, enabling seamless integration with teachers’ current workflows.

To help create content more quickly and efficiently, the system leverages semantic caching with LLMs. Storing and reusing previously processed educational content reduces computational resources required to deliver a scalable, and affordable copilot experience. Throughout development, the project team followed established protocols regarding safety, reliability and trustworthiness.

“Extensive prompt designing, testing and rigorous responsible AI procedures, including content filtering and moderation, red team assessments and jailbreaking simulations, have been deployed to maximize safety and reliability. These measures are in place so that Shiksha copilot consistently produces factual and trustworthy content,” said Tanuja Ganu, principal research SDE manager at Microsoft Research.

Convincing the skeptics

Before the initial workshop, some teachers expressed skepticism about using AI for lesson planning. Students already have multiple digital learning tools. But for Mahalakshmi A, who teaches standard science in grades 4-8 at rural Government Higher Primary School, Basavana Halli, outside Bengaluru, the value for teachers was less clear. However, during a two-hour initial workshop session, Mahalakshmi found she could easily create multiple lesson plans using Shiksha copilot that would work well in her classroom.

Ms. Mahalakshmi standing in front of a classroom

“I felt very happy because it’s a totally different concept. Before now, I could see that technology could work for the students. But this is the first time that it felt like the teachers also had a tool for themselves.”

Mahalakshmi A., Teacher, Government Higher Primary School, Basavana Halli

Mahalakshmi could also see how the content assembled using Shiksha copilot would make her class more interesting for her students, which is an important goal. “Instead of giving them the same problems, the same experiments, and the same videos, we make learning interesting. And then they learn what we call shashwatha kalike, or permanent learning. With Shiksha copilot, we can make that permanent learning happen in our classroom,” she added.

Next steps

The initial pilot program for Shiksha copilot is underway at more than 10 schools in and around Bengaluru. The goal is to let the teachers experience how Shiksha copilot can best be used in their daily workflows to improve learning experiences and collect feedback. The early response has been highly positive, with teachers expressing great satisfaction in both the quality of the content generated and the time savings. To build on this successful pilot, researchers are gearing up to scale Shiksha copilot in schools across the state of Karnataka and beyond, in collaboration with Sikshana Foundation.

This copilot is being developed as part of Project VeLLM (Universal Empowerment with Large Language Models) at Microsoft Research India. VeLLM’s goal is to make inclusive and accessible copilots available to everyone by building a platform for developing population-scale copilots. Inclusive copilots must address various real-world challenges, such as a multilingual user base, varied skillsets, limited devices and connectivity, domain-specific understanding, guardrails, and safety principles. Shiksha is the first copilot developed using the VeLLM platform. The VeLLM team is working with collaborators across diverse domains, such as agriculture and healthcare, to develop tailored domain-specific copilot experiences utilizing the platform and addressing associated research problems. 

To learn more about the project or collaboration opportunities, email the team at

Group photo (from left to right): Meena Elapulli (MSR), Ishaan Watts (MSR), Kavyansh Chourasia (MSR), Gireesh K.S. (GHPS, Tumkur), Srujana V S (MSR), Tanuja Ganu (MSR), Mahalakshmi A (GHPS, Basavana Halli), Parimala H.V. (GHPS,Santhe Beedi), Ravi R (GHPS,Gowdahalli), Maruthi K.R. (GHPS, Anedoddi), Smitha Venkatesh (Sikshana Foundation),  Akshay Nambi (MSR), Somnath Kumar (MSR), Yash Gadhia (MSR), Sanchit Gupta (MSR)
The Shiksha copilot team and collaborators (from left to right): Meena Elapulli (Microsoft Research), Ishaan Watts (Microsoft Research), Kavyansh Chourasia (Microsoft Research), Gireesh K.S. (GHPS, Tumkur), Srujana V S (Microsoft Research), Tanuja Ganu (Microsoft Research), Mahalakshmi A (GHPS, Basavana Halli), Parimala H.V. (GHPS, Santhe Beedi), Ravi R (GHPS, Gowdahalli), Maruthi K.R. (GHPS, Anedoddi), Smitha Venkatesh (Sikshana Foundation), Akshay Nambi (Microsoft Research), Somnath Kumar (Microsoft Research), Yash Gadhia (Microsoft Research), Sanchit Gupta (Microsoft Research)

The post Teachers in India help Microsoft Research design AI tool for creating great classroom content appeared first on Microsoft Research.

Read More

Data Formulator: A concept-driven, AI-powered approach to data visualization

Data Formulator: A concept-driven, AI-powered approach to data visualization

This research paper was presented at the IEEE Visualization Conference (opens in new tab) (VIS 2023), the premier forum for advances in visualization and visual analytics.

The VIS2023 logo to the left of the first page of an accepted research paper

Effective data visualization plays a crucial role in data analysis. It enables data analysts and others to explore complex datasets, comprehend patterns, and convey meaningful insights to various stakeholders. Today, there are numerous tools for creating visual representations of data. However, these tools only work with tidy data, meaning that data points must be organized according to the specific categories required by the tool’s visualization format. This poses significant challenges for data analysts, requiring the use of additional tools to transform raw data into a compatible format before it is entered into one of these visualization tools.

For instance, consider a dataset displaying 2020 temperatures in Seattle and Atlanta. If an analyst aims to create a scatter plot comparing the temperatures of these two US cities on the x/y-axes, data transformation is essential. The visualization tool mandates separate columns for Seattle and Atlanta temperatures to map to the scatter plot’s axes. Consequently, the analyst must pivot the input table to generate these columns. Moreover, if the analyst intends to compare which city experiences warmer days or create a smoothed line chart illustrating Seattle’s 7-day moving average temperature, further computations on the transformed data are necessary. Fields like “Warmer” and “Seattle 7-day Moving Avg” need to be calculated to facilitate the visualization, as depicted in Figure 1. This intricate process highlights the complexity and expertise currently needed to prepare raw data for effective visualization.

A figure with upper left showing an input data table with three columns Date, City and Temperature showing temperatures of Seattle and Atlanta from 2020-01-01 to 2020-12-31. On its right side show three visualizations that the user wants to create: (1) a scatter plot to compare their temperatures, (2) a histogram to show number days each city is warmer, and (3) a line chart shows Seattle moving average temperature; and the user cannot create these visualizations because the input table is not in the right format. At the bottom of the figure, it shows a data table that the analyst needs to transform from the input table in order to create desired visualizations. This table contains six columns: Date, Seattle Temp, Atlanta Temp, Warmer, Difference and Seattle Temp Moving Average. There is an emoji of “confusion” to express that the data transformation process can be challenging.
Figure 1. A data analyst wants to compare 2020 temperatures in Seattle and Atlanta using visualizations like scatter plots and histograms. However, the original dataset lacks necessary columns (“Seattle Temp,” “Atlanta Temp,” “Warmer,” and “Seattle Temp Moving Average”) for these visualizations. Data transformation is needed to include these fields.

This hurdle is particularly daunting because it necessitates a certain level of programming expertise or familiarity with additional data processing tools. It highlights the complexities of data visualization and underscores the need for an easier and more seamless process for data analysts, enabling them to create impactful visualizations regardless of their technical background.

Against the backdrop of rapid advancements in learning language models (LLMs) and programming-by-example techniques, researchers have made significant strides in breaking down these barriers. In this context, we share our paper, “Data Formulator: AI-powered Concept-driven Visualization Authoring (opens in new tab),” presented at VIS 2023 (opens in new tab) and winner of the Best Paper Honorable Mention (opens in new tab) award. Data Formulator is an AI-powered visualization authoring tool developed through a collaboration between researchers studying AI and those studying human-computer interaction (HCI). The result is a new visualization paradigm that separates high-level visualization intents from low-level data transformation steps. The process begins with data analysts articulating their visualization ideas as data concepts. These concepts refer to specific data categories, or fields, that analysts want to visualize, even though they are not present in the raw input data. This way, they effectively convey their visualization intent with the AI agent, which, in turn, assists them in implementing their visualization.

Defining data concepts and creating visualizations

The way Data Formula operates is straightforward. The analyst defines the specific data concepts they plan to visualize, either through natural language queries or by providing categories, or example entries for the concept. Once these concepts are defined, they are linked to appropriate visual representation, as illustrated in Figure 2.

A figure shows the user interface of Data Formulator and steps for an analyst to interact with the interface. At the right side shows the concept shelf, there is an annotation that reads “1. Concept Shelf: create and derive new concepts needed for visualization”. To its left is the Chart Builder panel, with an annotation “2. Chart Builder: encode data concepts to visual channels”. The bottom left side is a table view that shows the input data, the annotation reads “3. Data View: inspect the original and derive tables”. The top left is the visualization panel that shows visualizations generated by Data Formulator, the annotation reads “4. Visualization View: explore generated visualizations.”
Figure 2. The Data Formulator user interface. Data Formulator has four panels: (1) the Concept Shelf, for defining new data concepts to be visualized, (2) the Chart Builder, for specifying the visualization type, (3) the Table View, for analysts to inspect data automatically generated by Data Formulator, and (4) the Visualization Panel, for presenting final visualizations.

If the analyst defines concepts through examples, Data Formulator engages a program synthesizer, which generates a specialized data reshaping program, transforming the provided data to bring out the required data fields. Conversely, when an analyst introduces a new concept using natural language queries, Data Formulator calls on LLMs to generate code, which facilitates the creation of a new data category based on the provided description. In both cases, Data Formulator compiles the transformed data into a structured table and creates corresponding visualizations.

We recognize that analyst specifications can be ambiguous, so we designed Data Formulator to generate multiple visualization options to help them identify what they want. The tool also provides analysts with the AI-generated transformation program and the transformed data for inspection. This transparency helps analysts refine their intent for future iterations.

In continuing our Seattle/Atlanta temperatures example, the following two figures show how analysts can use Data Formulator to create visualizations without reformatting raw data using an external tool. Instead, the analyst provides example entries in the form of temperature values to create new the data concepts “Seattle Temp” and “Atlanta Temp,” shown in Figure 3. The analyst uses these natural language queries to create the new concept “Warmer” and instructs Data Formulator to format the data so that it can be visualized, shown in Figure 4.

The figure shows the workflow of the analyst to create new data concepts “Atlanta Temp” and “Seattle Temp” using examples. The left figure shows that the user opens a panel in Data Formulator’s concept shelf, typed the concept name “Atlanta Temp”, and provide example temperature values “45, 47, 56, 41” to define the concept. Then, the user drags Atlanta Temp concept to y-axis in the Chart Builder (the Seattle Temp concept is already placed in the x-axis box). The analyst then completes an example table with two columns Atlanta Temp, Seattle Temp with two rows (row 1 contains two values 45, 51, row contains values 47, 45) to demonstrate the relation between these two concepts. Finally, the analyst clicks “Formulate” button and Data Formulator returns the transformed data (with columns “#”, “Seattle Temp”, “Atlanta Temp”, “Date”) and a scatter plot that visualizes the data with Seattle Temp on x axis, Atlanta Temp on y axis.
Figure 3. The analyst creates new data concepts “Atlanta Temp”, “Seattle Temp” using examples. The AI agent solves a programming-by-example problem to create the new concepts for visualization.
The figure shows the workflow of the analyst to create new data concepts “Warmer” using natural language query. The left figure shows that the user opens a panel in Data Formulator’s concept shelf. The user selected “derived from” two concepts “Seattle Temp” and “Atlanta Temp” and typed the concept name “Warmer”. The user also provides a natural language query “Which is the warmer city, or the same” to describe the concept. After clicking a “forge” icon, in the second box shows the concept with the instantiated concept which contains an example table: the example table has 5 rows and header “Seattle Temp, Atlanta Temp, Warmer”, and the rows show “51, 45, Seattle”, “38, 58, Atlanta”, “44, 65, Atlanta”, “42, 60, Atlanta”, “35, 62, Atlanta”. The user then clicks the inspect button, and Data Formulator opens a panel that shows the code that achieve the transformation. Finally, the analyst clicks “save” button after inspecting the code to confirm the code is correct.
Figure 4. The analyst creates a new data concept “Warmer” using natural language description. Data Formulator calls LLMs to generate a transformation program to derive the new concept.

Looking ahead: Analyst-AI collaboration in data analysis

AI-powered data analysis tools have the potential to significantly streamline the entire data analysis process by consolidating various tasks into a single tool. Beyond just visualization, this concept-driven technique can be applied to data cleaning, data integration, visual data exploration, and visual storytelling. Our vision is for an AI system to take high-level instruction from the user and automatically recommend the necessary steps across the entire data analysis pipeline, enabling collaboration between the user and the AI agent to achieve their data visualization goals.

Inevitably, data analysts will need to tackle more complex tasks beyond the scope mentioned here. For this reason, it’s crucial to consider how to design AI-powered tools that effectively convey results to the analyst that are uncertain, ambiguous, or incorrect. This ensures that the analyst can trust the tool and collaborate effectively with AI to accomplish their objectives.

The post Data Formulator: A concept-driven, AI-powered approach to data visualization appeared first on Microsoft Research.

Read More