Research Focus: Week of January 22, 2024

Research Focus: Week of January 22, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus
January 22, 2024

Register for Microsoft Research Forum

Join Microsoft Research Forum (opens in new tab) for a continuous exchange of ideas about science and technology research in the era of general AI. This series, which begins on January 30, will explore recent research advances, bold new ideas, and important discussions with the global research community. Register now to receive access to all episodes in this quarterly series and be part of the conversation.


Improving Text Embeddings with Large Language Models

Text embeddings are vector representations of natural language that encode semantic information. They are widely used in various natural language processing tasks, such as information retrieval, question answering, semantic textual similarity, bitext mining, item recommendation, etc.

In a recent paper: Improving Text Embeddings with Large Language Models (opens in new tab), researchers from Microsoft introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods, this new method does not require building complex training pipelines or manually collected datasets that are often constrained by task diversity and language coverage. The researchers leverage proprietary large language models (LLMs) to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages. They then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that this method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, the model sets new state-of-the-art results on the BEIR (opens in new tab) and MTEB (opens in new tab) benchmarks.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


DevEx in Action: A study of its tangible impacts

For many professional software developers, the development lifecycle is riddled with friction and red tape, and successful delivery of code to production is a frustratingly infrequent event. Even worse, the problems are often compounded by a lack of management engagement, delaying and frustrating top engineers.

Developer experience (DevEx) is garnering increased attention at many organizations as leaders seek to optimize software delivery against a backdrop of fiscal tightening and transformational technologies such as AI. Developers and technical leaders generally understand that good DevEx leads to better products, more effective software delivery, and developer happiness. Yet, at many organizations, proposed initiatives and investments to improve DevEx struggle to get buy-in, as business stakeholders question the value proposition of improvements.

In a recent paper: DevEx in Action: A study of its tangible impacts (opens in new tab), researchers from Microsoft, GitHub, and DX (opens in new tab) examine this problem and present empirical evidence of how improvements in DevEx influence outcomes like productivity, code quality, and innovation.


The post Research Focus: Week of January 22, 2024 appeared first on Microsoft Research.

Read More

MetaOpt: Examining, explaining, and improving heuristic performance

MetaOpt: Examining, explaining, and improving heuristic performance

The MetaOpt workflow involves 4 steps (1) users encode the heuristic; (2) MetaOpt automatically does re-writes to obtain a single-level optimization; (3) it partitions the problem into smaller sub-problems to achieve scale; (4) it uses existing solvers to find the highest performance gap.

Heuristic algorithms, often referred to as heuristics, are tools used to approximate optimal algorithms to make faster and more efficient decisions. These are particularly useful in operational scenarios in production systems, such as determining which server a virtual machine should be assigned to or deciding whether data should be removed from a cache in a content delivery network.

However, cloud operators, who are responsible for designing and managing systems in production, often struggle to evaluate when and where their heuristics may underperform. This challenge can lead to over-provisioning the network and the inefficient use of available resources. Such practices can be costly and may result in the inability to meet customer demand.

To address this, we developed MetaOpt, a heuristic analyzer designed to enable operators to examine, explain, and improve heuristics’ performance before deploying them in critical, high-stakes environments. MetaOpt is unique because it not only compares algorithm performance but also provides insights into the underlying reasons for performance disparities between algorithms. It empowers operators and researchers to conduct “what-if” analyses, strategize how to combine heuristics in production, and understand why certain heuristics perform better in specific areas of the input space—the range of possible inputs that the heuristic may encounter.

We demonstrate MetaOpt’s capability for heuristic analysis by studying heuristics from three domains: traffic engineering, vector bin packing, and packet scheduling. MetaOpt identifies large performance gaps, enables us to prove properties about these heuristics, and guides us in improving them. Table 1 summarizes the results.

MetaOpt allowed us to (1) find the performance gap between heuristics from traffic engineering (TE), vector bin packing (VBP), and packet scheduling (PIFO); (2) prove various properties about the heuristic; and (3) design modificationsto improve their performance. DP refers to a heuristic Microsoft has deployed in our wide area network for traffic engineering.
Table 1. MetaOpt enabled us to find the performance gap between heuristics from traffic engineering, vector bin packing, and packet scheduling. It also helped us prove various properties about the heuristics. Finally, it helped us modify the heuristics to improve their performance. DP refers to a heuristic Microsoft has deployed in its wide area network for traffic engineering.

Currently, MetaOpt helps Azure operators analyze heuristics in production and serves as a “helper for theorem proving.” For example, we used MetaOpt to establish a tighter bound for the “first fit decreasing” heuristic in vector bin packing, a challenge for theoreticians for over three decades. As a result, we don’t need to over-provision resources in a cloud environment, ensuring we always have sufficient servers to meet customer demand.

MetaOpt framework

To use MetaOpt, users input the heuristic they want analyzed and either the optimal algorithm or another heuristic. MetaOpt efficiently translates these inputs into a solver format. It then finds performance gaps and the inputs that cause them. Recognizing that not all users are versed in optimization theory, we designed a higher-level abstraction for MetaOpt. This feature enables users to input their heuristics using a few simple building blocks and constrain the input space to what is relevant in practice. MetaOpt can then analyze decisions made by the heuristic that led to underperformance or identify input properties that caused the heuristic to make suboptimal choices. We illustrate the MetaOpt workflow in Figure 1.

The MetaOpt workflow involves 4 steps (1) users encode the heuristic; (2) MetaOpt automatically does re-writes to obtain a single-level optimization; (3) it partitions the problem into smaller sub-problems to achieve scale; (4) it uses existing solvers to find the highest performance gap.
Figure 1. The four steps in the MetaOpt workflow. (1) Users encode the heuristic; (2) MetaOpt automatically rewrites it to obtain a single-level optimization; (3) it divides the problem into smaller, more manageable segments for scalability; (4) it employs existing solvers to find the highest performance gap.

Rooted in game theory concepts

MetaOpt is based on Stackelberg games, a well-known class of leader-follower games in game theory. Here, the leader determines inputs for one or more followers, who must then optimize their outcomes based on these inputs. In the MetaOpt framework, the leader’s goal is to maximize the performance disparity between two algorithms (the followers) by deciding their inputs. The followers, representing the algorithms being compared, choose internal variables to optimize their outcomes. This, in turn, affects the leader’s results. We show this in Figure 2.

The stackelberg structure of MetaOpt
Figure 2. The high-level formulation of MetaOpt.

Looking ahead

MetaOpt marks a significant advance in the field of scalable, user-friendly analytical tools. It enables users to examine, understand, and explain differences in performance across competing algorithms. It also helps them improve those algorithms before deploying them in critical environments.

We began developing MetaOpt in early 2022 to address a specific need for heuristic analysis in our network’s traffic engineering solution. Since then, our focus has been on enhancing MetaOpt’s accessibility for users without a background in optimization theory. Currently, we are improving MetaOpt’s scalability and usability, and we are expanding the range of heuristics it supports. We plan to release it as an open-source tool at the USENIX Symposium on Networked Systems Design and Implementation (opens in new tab) (NSDI) conference, scheduled for April 16–18, 2024.

We believe that MetaOpt can significantly boost productivity for those studying or designing heuristics by serving as a risk-analysis engine and a tool for explainable AI and active learning. In the near future, we aim to publish papers on new MetaOpt applications and share our language for describing heuristics.

For more details, visit the MetaOpt webpage, and review our publications page for the latest developments.

The post MetaOpt: Examining, explaining, and improving heuristic performance appeared first on Microsoft Research.

Read More

GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases

GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases

GHDDI name and logo on the left with a rainbow spectrum colored honeycomb on the right on a green and blue gradient background

The Global Health Drug Discovery Institute (GHDDI) (opens in new tab) and Microsoft Research recently achieved significant progress in accelerating drug discovery for the treatment of global infectious diseases. Working in close collaboration, the joint team successfully used generative AI and foundation models to design several small molecule inhibitors for essential target proteins of Mycobacterium tuberculosis and coronaviruses. These new inhibitors show outstanding bioactivities, comparable to or surpassing the best-known lead compounds.

This breakthrough is a testament to the team’s combined efforts in generative AI, molecular physicochemical modeling, and iterative feedback loops between scientists and AI technologies. Normally, the discovery and in vitro confirmation of such molecules could take up to several years, but with the acceleration of AI, the joint team achieved these new results in just five months. This research also shows the tremendous potential of AI for helping scientists discover or create the building blocks needed to develop effective treatments for infectious diseases that continue to threaten the health and lives of people around the world.

Since 2019, for example, there have been more than 772 million confirmed cases of COVID-19 worldwide and nearly 7 million deaths from the virus, according to the World Health Organization (WHO), the Centers for Disease Control, and various other sources. Although vaccines have reduced the incidence and deadliness of the disease, the coronavirus continues to mutate and evolve, making it a serious ongoing threat to global health. Meanwhile, the WHO reports that tuberculosis continues to be a leading cause of death among infectious diseases, second only to COVID-19 in 2022, when 10.6 million people worldwide fell ill with TB and the disease killed 1.3 million (the most recent figures currently available).

Laying the foundation for new infectious disease treatments

Microsoft Research has rich experience in developing and pre-training large AI models specialized for proteins and molecules, demonstrated in both property prediction and molecular generation. Based on those experiences, Microsoft Research developed and maintains ownership of an AI model for molecule generation tailored for specific protein targets. The generated compounds were virtually screened and further optimized by data scientists and medicinal chemists from GHDDI, followed by compound synthesis and wet-lab experiments to quantify bioactivities. The experimental results were then fed back to the research team at Microsoft for AI model improvement and new compound generation.

This AI-expert-experiment integrated pipeline enables the success of novel compound generation for protein targets in Mycobacterium tuberculosis and coronaviruses SARS-CoV-2. In less than five months, the joint team designed several chemical compounds that are effective in inhibiting these pathogens’ essential target proteins, accelerating the structure-based drug discovery process.

Figure 1. Two potential inhibitor compounds (generated by our method) for ClpP of tuberculosis.
Dose response curves of the compounds generated for coronavirus, with GRL0617 as the reference compound, demonstrating enhanced bioactivity. The most recent progress is that the joint team has effectively optimized the IC50 to 0.18uM, which is approximately an eight-fold improvement compared to GRL0617.
Dose response curves of the compounds generated for coronavirus, with GRL0617 as the reference compound, demonstrating enhanced bioactivity. The most recent progress is that the joint team has effectively optimized the IC50 to 0.18uM, which is approximately an eight-fold improvement compared to GRL0617.

One distinguishing feature of AI-generated molecules is their novel scaffold structures, which are important because they create the potential for these molecules to be developed into a new class of drug candidates. These novel structures offer the possibility of more effective treatments, and also help to address the escalating challenge of antimicrobial resistance (AMR), a major hurdle in treating infectious diseases like tuberculosis and COVID-19.

“In the current landscape of scientific research, we encounter unparalleled challenges but also have unprecedented opportunities,” said Dr. Sheng Ding, institute director of GHDDI. “Innovation stands as the central catalyst for scientific advancement and a crucial element in addressing global health challenges. I’m excited about our collaboration with Microsoft Research and gratified with the progress we’ve jointly achieved. Without a doubt, our combined efforts will enhance R&D efficiency and expedite the process of drug discovery.”

“This represents a collaboration that transcends disciplines and boundaries,” he noted. “Our combined strengths will advance pharmaceutical research, paving new avenues in scientific exploration. Going forward, we anticipate deploying such cutting-edge technologies in uncharted realms of life sciences. This will enable us to offer more comprehensive, profound, and practical solutions for global health issues.”

MICROSOFT RESEARCH PODCAST

Abstracts: October 23, 2023

On “Abstracts,” Partner Research Manager Andy Gordon & Senior Researcher Carina Negreanu explore new work introducing co-audit, a term for any tool-assisted experience that helps users of generative AI find and fix mistakes in AI output.


Using AI to improve global health

Embracing the principle of open innovation, the collaboration between GHDDI and Microsoft Research is dedicated to harnessing AI technology to expedite drug discovery. The goal is to contribute to global health equity through the development of lifesaving medications and the prompt delivery of safer and more effective drug solutions that are accessible to everyone.  The collaboration focuses on infectious diseases that pose a threat to global health, including but not limited to tuberculosis, viral infections, and malaria. Both parties are committed to a deep integration of generative AI, foundational models, high-throughput virtual screening, and expert knowledge to tackle these challenges.

“Successful AI-driven drug discovery necessitates a tight-knit collaboration between AI specialists and medicinal experts,” said Dr. Tie-Yan Liu, distinguished scientist at Microsoft Research AI4Science. “In recent years, our globally recognized team at Microsoft Research has been deeply engaged in interdisciplinary research between AI and natural science. To complement this, GHDDI experts bring to the table a wealth of industry experience and profound domain knowledge. Their experimental facilities not only allow for testing but also help provide invaluable feedback for training AI models. Because of our close collaboration, we look forward to producing groundbreaking research outcomes with the potential to redefine the future of healthcare through AI technology innovation.”

Accelerating drug discovery

Commenting on the research into Mycobacterium tuberculosis and coronaviruses, Dr. Rumin Zhang, chief scientific officer at GHDDI, noted that the application of AI technology by the collaborative team managed to considerably reduce the traditionally lengthy drug discovery process. The team was able to design and validate highly effective small molecule inhibitors for the pathogens in just five months.

“This is an exceptional accomplishment that underscores the immense potential of AI in efficient de novo drug design. It also vividly illustrates the team’s exceptional innovative capacity and professional prowess,” he said. “We are excited about this innovative R&D strategy leading to more groundbreaking advancements in a broader spectrum of future drug discovery projects.”

“This work is all about pushing the boundaries of AI technology for application in new drug R&D,” said Dr. Tao Qin, senior principal researcher at Microsoft Research AI4Science “We aim to leverage AI innovations to enhance human health, tackle worldwide health issues, and ensure the advantages of AI technology are accessible to all.”

“We plan to intensify and broaden our collaboration, further advancing the use of AI technology in the realm of life sciences,” said Dr. Jinjiang Guo, head of the Data Science Department at GHDDI. “This will yield novel insights that will enrich researchers’ understanding of mechanisms underlying diseases and life, thus paving the way for the development of innovative treatment strategies and providing more effective solutions for diseases that have long affected human health. We are highly optimistic about the potential of this collaboration and are confident that it will have a substantial impact on the future of the healthcare field.”

Next steps

In the next phase, Microsoft Research and GHDDI will collaborate to optimize the discovered hit compounds, enhance ADMET properties, progress toward preclinical studies, and initiate a broader range of drug-discovery projects.

The post GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases appeared first on Microsoft Research.

Read More

TaskWeaver: A code-first agent framework for efficient data analytics and domain adaptation

TaskWeaver: A code-first agent framework for efficient data analytics and domain adaptation

TaskWeaver chat user interface

The advent of large language models (LLMs) has revolutionized human-machine interactions, particularly in natural language understanding and generation applications. These AI- or LLM-backed virtual assistants hold the promise of serving as intelligent agents capable of autonomously reasoning, observing, and performing tasks articulated in natural language. However, it is still challenging for most agent frameworks to efficiently handle complex data structures (e.g., DataFrame), which are prevalent in data analytics tasks and domain-specific scenarios. 

To address these challenges, we introduce TaskWeaver – a code-first agent framework which can convert natural language user requests into executable code, with additional support for rich data structures, dynamic plugin selection, and domain-adapted planning process. Now publicly available as an open-source framework, TaskWeaver leverages LLMs’ coding capability to implement complex logic and incorporates domain-specific knowledge through customizable examples and plugins. TaskWeaver empowers users to easily build their own virtual assistants that understand diverse domain questions, follow examples, and execute customizable algorithms on complex data structures efficiently.

Motivating example – Anomaly detection on time-series data

Scenario

Amy is a business analyst who wants to identify anomalies on a time series of sales data stored in a SQL database. She would like to get help from an AI assistant for the task with natural language interactions. Moreover, Amy would like to apply her own definition and interpretation of anomalies in the context of sales data, including a customized anomaly detection algorithm. Figure 1 shows the desired conversation between the user and the AI assistant – the AI assistant should be able to first pull the data from target database, then apply desired algorithms, and return the visualized results.

Figure 1. A sample conversation between user and the AI assistant powered by taskweaver. The user asks to pull data from a database and apply anomaly detection. The ai assistant accomplishes the task by multi-rounds of communication with the user and finally plots a figure of anomalies.
Figure 1. Sample chat between the user and the AI assistant

Requirements for an agent framework

To accomplish the task above, we identify several key requirements that current agent frameworks may lack:

  • Plugin: The agent needs to first query and collect data from a database, and then detect anomalies using a specialized anomaly detection algorithm. Both require the capability to define and invoke custom plugins, e.g., the query_database plugin and the anomaly_detection plugin. 
  • Rich data structure: the agent should be capable of handling data in complex but common structures, such as array, matrix, tabular data (e.g., pandas DataFrame (opens in new tab)), to perform advanced data processing actions. Many existing works tend to transform the intermediate outputs as strings in the prompt or save them as local files before reading them again. However, this practice is error-prone and could easily exceed the prompt token limit. Additionally, data in rich structure should be able to transfer easily from one plugin to another.  
  • Stateful execution: The agent engages in iterative interactions with the user, processing user inputs and executing tasks accordingly. The execution states should be preserved throughout the entire conversation session across multiple chat rounds. 
  • Reasoning and acting (ReAct): The agent should have the capability to first observe and then act. The database might contain data of various schemas in the real world, leading to different arguments for anomaly detection. Therefore, the agent must first inspect the data schema, understand which columns are appropriate (and ask users to confirm), then feed the corresponding column names into the anomaly detection algorithm.
  • Arbitrary code generation: The agent should be able to generate code to accommodate ad-hoc user demands, which are not covered by the pre-defined plugins. In the example provided, the agent generates code to visualize the detected anomalies without using any plugins.  
  • Incorporating domain knowledge: The agent should provide a systematic way to incorporate domain-specific knowledge. It would help LLMs deliver better planning and accurate tool calling, which in turn produces reliable results, particularly in domain-specific scenarios.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft


TaskWeaver architecture

Figure 2 shows the three core components in the TaskWeaver architecture. The Planner serves as the system’s entry point and interacts with the user. Its responsibilities include: (1) planning – breaking down the user’s request into subtasks and managing the execution process with self-reflection; and (2) responding – transforming the execution result into a human-readable response for the user. The Code Interpreter consists of two components: the Code Generator generates code for a given subtask from the Planner, considering existing plugins and domain-specific task examples; the Code Executor is responsible for executing the generated code and maintaining the execution state throughout the entire session.

Figure 2. The architecture of taskweaver, which consists of three parts, planner, code generator and stateful code executor. They communicate with each other to accomplish user’s request.
Figure 2. Overview of TaskWeaver

Running workflow for motivating example

TaskWeaver has a two-layer planning process for dealing with user requests. In the first layer, the Planner generates a high-level plan outlining the steps required to fulfill the request. In each subsequent round, the code generator will devise a plan, in terms of chain-of-thought and generated code, to execute the specified step. Figure 3 presents the internal workflow of TaskWeaver when accomplishing the motivating example mentioned above. Note that the prompts shown in Figure 3 are simplified and do not represent the full complex instructions. 

The initial step involves the Planner taking the user query, Code Interpreter description, and planning examples (if provided) to generate a plan. For the given example, the plan first pulls data from the database and describes the data schema. The Code Generator prompt delineates its profile and competencies, providing definitions of all relevant plugins (e.g., function name, description, arguments and return values.) The output from the Code Generator is a code snippet that executes the sql_pull_data plugin, retrieves the data into a DataFrame, and provides a description of the data schema.

Next, the code generated is sent to the Code Executor for execution, after which the result is sent back to the Planner to determine the next planning step. In the example, the execution result reveals two columns, namely date and value, in the DataFrame. For the next step, the Planner can either confirm with the user if these columns correspond to the two input parameters of the anomaly_detection plugin, or directly proceed to the next step.

The workflow of taskweaver when dealing with the motivating example. The planner first generates a plan by incorporating planner description, code interpreter description and plugin definition. The first step of the plan is passed to the code generator to generate Python code and then forwarded to the code executor. After collecting execution results, the planner responds to the user.
Figure 3. Workflow of TaskWeaver

Key design considerations of TaskWeaver

  • Code-first analytics experience: TaskWeaver converts user requests into Python programs that run on dedicated processes, where the Python program can be plugin invocations, arbitrary code to handle ad-hoc user queries, or both. Unlike other frameworks that rely on text or file-based expressions, TaskWeaver can fully utilize native data structures such as pandas DataFrame and numpy ndarray that exist in the memory. This makes it easy to perform tasks such as pulling data from a database, running machine learning algorithms (e.g., anomaly detection, classification, or clustering), summarizing results, and visualizing analysis outcomes. 
  • Domain adaptation: Incorporating domain-specific knowledge into the model via prompts can help boost LLMs’ performance when the user query is complex. TaskWeaver provides two options to make customizations with the user’s domain knowledge:
    • Customization with plugins: Users can define custom plugins (including Python implementation and schema) to incorporate domain knowledge, such as pulling data from a specific database, and running a dedicated algorithm.
    • Customization with examples: TaskWeaver also provides an easy-to-implement interface (in YAML format) for users to configure examples to teach the LLMs how to respond to certain requests. The examples can be of two categories: one is used for planning and the other is for code generation.
  • Stateful code execution: When users make ad-hoc requests for data analytics, it often involves multiple iterations. As a result, TaskWeaver maintains the state of code execution throughout the entire session. This is like programming in Python using the Jupyter Notebook, where users type code snippets in a sequence of cells and the program’s internal state progresses sequentially. The difference in TaskWeaver is that users use natural language instead of programming language. TaskWeaver converts each user request into one or more code snippets in each round, depending on the specific plan. 
  • Others: TaskWeaver also supports other features such as intelligent plan decomposition and self-reflection to respond to a user’s request in a more reliable and organized manner. Moreover, features like restricted code generation can help limit the capabilities of the generated code to reduce security risks.

Getting started

TaskWeaver (opens in new tab) is now publicly available on GitHub. You may run the following commands to quickly get started.

git clone https://github.com/microsoft/TaskWeaver.git 
cd TaskWeaver 
pip install -r requirements.txt   # install the requirements 

Once the installation is finished, users can configure key parameters, such as the LLM endpoint, key and model, and start the TaskWeaver service easily by following the running examples (opens in new tab).

Other resources

The post TaskWeaver: A code-first agent framework for efficient data analytics and domain adaptation appeared first on Microsoft Research.

Read More

Advancing transparency: Updates on responsible AI research

Advancing transparency: Updates on responsible AI research

Editor’s note: All papers referenced here represent collaborations throughout Microsoft and across academia and industry that include authors who contribute to Aether, the Microsoft internal advisory body for AI ethics and effects in engineering and research.

Blue green gradient with a lightbulb icon centered. Six icons surround it: (left to right) group of people, eyeball, shaking hands, set of scales, lock, and shield.

A surge of generative AI models in the past year has fueled much discussion about the impact of artificial intelligence on human history. Advances in AI have indeed challenged thinking across industries, from considering how people will function in creative roles to effects in education, medicine, manufacturing, and more. Whether exploring impressive new capabilities of large language models (LLMs) such as GPT-4 or examining the spectrum of machine learning techniques already embedded in our daily lives, researchers agree on the importance of transparency. For society to appropriately benefit from this powerful technology, people must be given the means for understanding model behavior.  

Transparency is a foundational principle of responsible, human-centered AI and is the bedrock of accountability. AI systems have a wide range of stakeholders: AI practitioners need transparency for evaluating data and model architecture so they can identify, measure, and mitigate potential failures; people using AI, expert and novice, must be able to understand the capabilities and limitations of AI systems; people affected by AI-assisted decision-making should have insights for redress when necessary; and indirect stakeholders, such as residents of cities using smart technologies, need clarity about how AI deployment may affect them.

Providing transparency when working with staggeringly complex and often proprietary models must take different forms to meet the needs of people who work with either the model or the user interface. This article profiles a selection of recent efforts for advancing transparency and responsible AI (RAI) by researchers and engineers affiliated with Aether, the Microsoft advisory body for AI ethics and effects in engineering and research. This work includes investigating LLM capabilities and exploring strategies for unlocking specialized-domain competencies of these powerful models while urging transparency approaches for both AI system developers and the people using these systems. Researchers are also working toward improving identification, measurement, and mitigation of AI harms while sharing practical guidance such as for red teaming LLM applications and for privacy-preserving computation. The goal of these efforts is to move from empirical findings to advancing the practice of responsible AI.

Demo Video for

Toward user-centered algorithmic recourse

In this demo of GAM Coach, an example of an AI transparency approach, an interactive interface lets stakeholders in a loan allocation scenario understand how a model based its prediction and what factors they can change to meet their goals.

Related papers

Identifying harms in LLMs and their applications

The sociotechnical nature of AI is readily apparent as product teams sprint to integrate the power and appeal of LLMs into conversational agents and productivity tools across domains. At the same time, recent accounts, such as a lawyer unwittingly submitting generative AI’s fictitious legal citations in a brief to the court (opens in new tab) or unsettling demonstrations of deepfakes, reveal the ample opportunity for misunderstanding these models’ capabilities and, worse yet, for deliberately misusing them.  

Envisioning what could go wrong with an AI system that has not yet been deployed is the first step toward responsible AI. Addressing this challenge, researchers introduce AHA! (anticipating harms of AI), a human-AI collaboration for systematic impact assessment. This framework enables people to make judgments about the impact of potential deployment on stakeholders. It uses an LLM to generate vignettes, or fictional scenarios, that account for an ethical matrix of problematic AI behaviors or harms. Evaluation of this framework in a variety of decision-making contexts found it surfaced a broader range of potential harmful outcomes than either people or LLMs could singly envision.

An illustration of a guide for red teaming LLMs comprising a network icon alongside an icon of a bulleted list.

AI practitioners can follow this planning guide to help them set up and manage red teaming for large language models (LLMs) and their applications. Based on firsthand experience of testing LLMs to identify potentially harmful outputs and plan for mitigation strategies, this guide provides tips for who should test, what to test, and how to test, plus pointers for recording the red-teaming data.

Responsible AI red teaming, or probing models and their applications to identify undesirable behaviors, is another method of harm identification. Microsoft has shared a practical guide for the RAI red teaming of LLMs and their applications, and automated tools for RAI red teaming are beginning to emerge. Although the vital task of impact assessment and testing for failures can be facilitated by LLMs helping with creative brainstorming, researchers emphasize that for AI to be human centered, such efforts should never be fully automated. To improve human-AI complementarity in red teaming, AdaTest++ builds on an existing tool that uses an LLM to generate test suggestions as it adapts to user feedback. The redesign offers greater human control for testing hypotheses, enabling editing and exploration of counterfactuals, and conducting in-depth testing across a broad diversity of topics.

ALT TEXT: An illustration of the LLM auditing tool AdaTest++ comprising a pointed finger icon hovering over an interactive interface. The GitHub logo is included to indicate the tool’s open-source availability.
Researchers invite AI practitioners working toward responsible AI to use and contribute to AdaTest++ (opens in new tab), which leverages human-AI complementarity to audit LLMs. Augmenting a preexisting tool, AdaTest++ offers prompt templates and introduces greater human control for testing hypotheses and exploring counterfactuals.

In AI privacy, researchers demonstrate how prompt-tuning can be used to infer private information from an email system using a language model to provide autocompleted replies. In sharing their red-teaming technique, they encourage privacy-enhancing efforts for applications using language models and take the stance that transparency of publicly detailing a model’s vulnerability is an essential step toward adversarial robustness. 

Identifying and exposing security vulnerabilities is a top concern, especially when these can seep into AI-generated code. The integration of LLMs for AI-assisted coding has reduced the entry barrier for novice programmers and increased productivity for veteran coders. But it is important to examine the reliability and safety of AI-assisted coding. Although static analysis tools can detect and remove insecure code suggestions caused by the adversarial manipulation of training data, researchers introduce two novel techniques for poisoning code-suggestion models that bypass static analysis mitigation: Covert inserts malicious code in docstrings and comments, while TrojanPuzzle tricks the transformer-based model into substituting tokens, giving the programmer harmless-looking but insecure code. Exposing these vulnerabilities, researchers call for new methods for training code-suggestion models and for processes to ensure code suggestions are secure before programmers ever see them.

Related papers

Transparency for improving measurement and its validity

We can’t begin to mitigate for the possibility of AI failures without first identifying and then measuring the potential harms of a model’s outputs, transparently examining who may or may not benefit or what could go wrong and to what extent. 

A framework for automating the measurement of harms at speed and scale has two LLMs simulate product- or end-user interaction and evaluate outputs for potential harms, using resources created by relevant sociotechnical-domain experts. As researchers stress, the validity and reliability of such evaluation rely strictly on the quality of these resources—the templates and parameters for simulating interactions, the definition of harms, and their annotation guidelines. In other words, sociotechnical-domain expertise is indispensable.

Measurement validity—ensuring metrics align with measurement goals—is central to the practice of responsible AI. Model accuracy in and of itself is not an adequate metric for assessing sociotechnical systems: for example, in the context of productivity applications, capturing what is valuable to an individual using an AI system should also be taken into account. How do we identify metrics appropriate for context-dependent models that are deployed across domains to serve a variety of populations and purposes? Teams need methods to address measurement and mitigation for every deployment scenario.  

Language models illustrate the maxim “context is everything.” When it comes to measuring and mitigating fairness-related harms that are context-dependent in AI-generated text, there’s generally not enough granularity in dataset labeling. Lumping harms under generalized labels like “toxic” or “hate speech” doesn’t capture the detail needed for measuring and mitigating harms specific to various populations. FairPrism is a new dataset for detecting gender- and sexuality-related harms that makes a case for greater granularity in human annotation and transparency in dataset documentation, including identifying groups of people that may be targeted. Researchers situate FairPrism as “a recipe” for creating better-detailed datasets for measuring and mitigating AI harms and demonstrate how the new dataset’s 5,000 examples of English text can probe for fairness-related harms to a specific group.

ALT TEXT: An illustration of the dataset FairPrism comprising an icon of a database stack with an arrow pointing to an icon of a laptop overlaid with an icon of a network. The GitHub logo is included to indicate the dataset’s open-source availability.
AI practitioners can request access to the FairPrism dataset (opens in new tab) for detecting gender- and sexuality-related harms in AI-generated text. FairPrism makes a case for greater granularity in human annotation.

Similarly, researchers deepen the conversation around representational harms in automated image-tagging systems, voicing the need for improved transparency and specificity in taxonomies of harms for more precision in measurement and mitigation. Image tagging is generally intended for human consumption, as in alt text or online image search, differentiating it from object recognition. Image tagging can impute fairness-related harms of reifying social groups as well as stereotyping, demeaning, or erasure. Researchers identify these four specific representational harms and map them to computational measurement approaches in image tagging. They call out the benefits of increased granularity but note there is no silver bullet: efforts to mitigate by adding or removing particular tags to avoid harms may in fact introduce or exacerbate these representational harms.

Related papers

Transparency and UX-based mitigations: What designers need and end users want

Prioritizing what people value and designing for the optimal user experience (UX) is a goal of human-centered, responsible AI. Unfortunately, UX design has often been viewed as a secondary consideration in engineering organizations. But because AI is a sociotechnical discipline, where technical solutions must converge with societal perspectives and social science theory, AI not only brings UX expertise to the foreground but also positions designers as potential innovators, well situated to mitigate some harms and model failures with UX interventions. To realize this, UX designers need transparency—visibility into how models work—so they can form “designerly understanding of AI” to help them ideate effectively. A study of 23 UX designers completing a hands-on design task illustrates their need for better support, including model documentation that’s easier to understand and interactive tools to help them anticipate model failures, envision mitigations, and explore new uses of AI.   

People with varying levels of AI experience or subject-matter expertise are suddenly harnessing commercially available generative AI copilots for productivity gains and decision-making support across domains. But generative AI can make mistakes, and the impact of these failures can differ greatly depending on the use case: for example, poor performance in a creative-writing task has a very different impact than an error in a health care recommendation. As the stakes rise, so does the call for mitigating these failures: people need tools and mechanisms that help them audit AI outputs. UX interventions are well suited for mitigating this type of harm. To begin, researchers propose a taxonomy of needs that co-auditing systems should address when helping people double-check generative AI model responses. Basic considerations should include how easy it is for individuals to detect an error, what their skill level is, and how costly or consequential an error may be in a given scenario. A prototype Excel add-in illustrates these considerations, helping the nonprogrammer inspect the accuracy of LLM-generated code.

There are productivity dividends to paying attention to people’s need and desire for transparency. A central problem people encounter with language models is crafting prompts that lead to useful output. Advancing a solution for this in LLM-based code generation, researchers demonstrate an interface that gives people visibility into how the model maps their natural language query to system action. This transparency approach helps people adapt their mental model of the code generator’s capabilities and modify their queries accordingly. Findings of the user study, which included participants with low expertise in coding, showed this transparency approach promoted user confidence and trust while facilitating explanation and debugging.  Similarly, human-centered efforts such as modeling the timing of when a programmer finds it most valuable to receive a code suggestion emphasize the primacy of end users’ needs when addressing productivity. 

“What it wants me to say” demo video

“What It Wants Me To Say”

This transparency approach provides nonexpert programmers with an interface that gives visibility into how a language model maps their natural language query to system action, helping them adapt their mental model and modify their prompts.

For experienced coders to be confident and benefit from AI-assisted code completion, they need to be able to easily spot and correct errors and security vulnerabilities. In the first empirical study of the effectiveness of token highlighting for communicating uncertainty of an AI prediction, researchers examine a UX technique that draws programmers’ attention in a way similar to a spell checker. Highlighting tokens that had the highest predicted likelihood of being edited resulted in programmers being able to complete tasks faster with better-targeted edits. Participants also desired more transparency in the form of explanations to help with diagnosing uncertainty and suggested interaction designs that would improve their efficiency and give them control.

Communicating the uncertainty of AI predictions in a way that is meaningful is a design challenge in every deployment context. How to provide transparency via explanations remains a conundrum—studies have shown that the mere presence of explanations can increase overreliance on AI. Designing UX that helps people meet their decision-making goals with confidence requires understanding their perceptions of how a given system works. But little is actually known about the processes decision-makers go through when debating whether to rely on an AI system’s output versus their own intuition. Conducting a think-aloud study for insights into the role of human intuition in reliance on AI predictions in AI-assisted decision making, researchers identified three types of intuition people use in deciding to override the system. While performing brief tasks of income prediction and biography classification with AI support, participants expressed “gut feel” about the decision outcome; how specific data characteristics, or features, may impact explanations; and the limitations of the AI system. Findings suggested what the authors call “intuition-driven pathways” to understanding the effect of different types of explanations on people’s decision to override AI. Results showed that example-based explanations, which were textual narratives, better aligned with people’s intuition and reasoning about a prediction than feature-based explanations, which conveyed information with bar charts and other visual tools. At the same time, participants echoed the familiar desire for help with understanding AI systems’ limitations. Suggestions included interface designs to better support transparency and user understanding—for example, interactive explanations that enable people to change attributes to explore the effect on the model’s prediction.

Accommodating varying levels of user expertise is a growing AI UX design challenge across domains and applications. For example, in business, people with limited knowledge of AI or statistics must increasingly engage AI visual-analytic systems to create reports and inform recommendations. While research seeks to address gaps in knowledge for improving user interaction with AI, some practical and evidence-driven tools are already available. A case study of business experts with varying levels of AI proficiency demonstrates the effectiveness of applying existing guidelines for human-AI interaction for transparency cues. Visual explanations improved participants’ ability to use a visual AI system to make recommendations. At the same time, researchers noted a high level of trust in outputs regardless of participants’ understanding of AI, illustrating the complexity of AI transparency for appropriate trust.

Related papers

ALT TEXT: An illustration of transparency considerations in the age of large language models. A network icon labeled “baseline LLM”; network and gear icons labeled “adapted LLM”; and an icon incorporating a keyboard, network, and pointed finger labeled “LLM-powered applications” represent considerations from the technology perspective. An icon of people labeled “stakeholders” represents considerations from the stakeholder perspective.
Transparency for responsible AI is a sociotechnical endeavor. From the technology perspective, organizations developing LLMs and their applications need to consider how to appropriately characterize and communicate capabilities, limitations, performance, and risks and at what level transparency approaches should take place. To meet the needs of stakeholders—the wide range of people developing, deploying, using, or being impacted by LLMs—transparency considerations should include stakeholder goals; improving people’s mental models of LLMs and supporting appropriate trust; and gaining insight to how transparency can contribute to better control mechanisms for LLMs. (Adapted from AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap

Transparency: A means for accountability in a new era of AI

This research compilation highlights that transparency is fundamental to multiple components of responsible AI. It requires, among other things, the understanding and communication of datasets and their composition and of model behaviors, capabilities, and limitations. Transparency also touches every aspect of the responsible AI harm mitigation framework: identify, measure, mitigate. Furthermore, this research establishes a primary role for UX in mitigating harms as AI integrates into the apps people rely on every day in their personal and professional lives.     

As authors of a research roadmap for transparency in the age of LLMs outline, these complex models’ massive datasets, nondeterministic outputs, adaptability, and rapid evolutionary pace present new challenges for deploying AI responsibly. There’s much work to be done to improve transparency for stakeholders of highly context-dependent AI systems—from improving how we publish the goals and results of evaluations when it comes to model reporting to providing appropriate explanations, communicating model uncertainty, and designing UX-based mitigations.  

Prioritizing transparency in the design of our AI systems is to acknowledge the primacy of people, whom the technology is meant to serve. Transparency plays a critical role in respecting human agency and expertise in this new frontier of human-AI collaboration and, ultimately, can hold us accountable for the world we are shaping.

Eric Horvitz presenting the KDD2023 Keynote: People and Machines: Pathways to Deeper Human-AI Synergy

Pathways to deeper human-AI synergy

In his KDD 2023 keynote, Microsoft Chief Scientific Officer Eric Horvitz presents an overview of the power of LLM capabilities and the potential for enriching human-AI complementarity.

Related papers

MICROSOFT RESEARCH PODCAST

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

In this episode, PhD students Jennifer Scurrell and Alejandro Cuevas talk to Senior Researcher Dr. Madeleine Daepp. They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers to the teamwork they say helped make it possible for them to succeed, and the impact they hope to have with their work.


The post Advancing transparency: Updates on responsible AI research appeared first on Microsoft Research.

Read More

Research Focus: Week of January 8, 2024

Research Focus: Week of January 8, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus - week of January 8, 2024

Mixture-of-Linear-Experts for Long-term Time Series Forecasting 

Long-term time series forecasting (LTSF), which aims to predict future values of a series given past values, is an important problem in the machine learning community. It’s useful in areas like weather modeling, traffic flow prediction, and financial forecasting.  

The current state of the art on LTSF is attained in some cases by linear-centric models. However, real-world time series are usually nonstationary. For example, traffic patterns change on different days of the week. The inherent simplicity of linear-centric models makes them unable to capture these patterns. In a recent paper: Mixture-of-Linear-Experts for Long-term Time Series Forecasting, researchers from Microsoft and external colleagues propose Mixture-of-Linear-Experts (MoLE) to address this problem. Instead of training a single model, MoLE trains multiple linear-centric models (i.e., experts) and a router model that weighs and mixes their outputs. While the entire framework is trained end-to-end, each expert learns to specialize in a specific temporal pattern, and the router model learns to compose the experts adaptively. Experiments show that MoLE significantly reduces forecasting error of linear-centric models, and MoLE outperforms state-of-the-art transformer-based approaches in 68% of settings.

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

End-to-end (E2E) models are the dominant model structure in automatic speech recognition (ASR) and speech translation (ST). This has led to efforts to develop a unified E2E model for multilingual ASR and multilingual ST tasks. Streaming ASR and ST tasks have extensively utilized neural transducers in the past.  

In a recent paper: A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability, researchers from Microsoft present a streaming multilingual speech model – $SM^2$ – which employs a single neural transducer model for transcribing or translating multiple languages into target languages. $SM^2$ is trained using weakly supervised data created by converting speech recognition transcriptions with a machine translation model. Leveraging 351,000 hours of speech training data from 25 languages, $SM^2$ achieves impressive ST performance. Notably, no human-labeled ST data was employed during training. It was purely weakly supervised ST data generated by converting 351,000 hours of anonymized ASR data from 25 languages using text based machine translation service. 

The researchers also demonstrate the truly zero-shot capability of $SM^2$ when expanding to new target languages, generating high-quality zero-shot ST translation for {source-speech, target-text} pairs that were not seen during training.

Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.


KBFormer: A Diffusion Model for Structured Entity Completion

Deep generative models include large language models (LLMs) for text, plus models for other modalities, such as for vision and audio. In a recent paper: KBFormer: A Diffusion Model for Structured Entity Completion, researchers from Microsoft and external colleagues explore generative modeling of structured entities with heterogeneous properties, such as numerical, categorical, string, and composite. This includes entries in rich knowledge bases (KBs), items in product catalogs or scientific catalogs, and ontologies like the periodic table of elements and the various properties of isotopes. 

Their approach handles such heterogeneous data through a mixed continuous-discrete diffusion process over the properties, using a flexible framework that can model entities with arbitrary hierarchical properties. Using this approach, the researchers obtain state-of-the-art performance on a majority of cases across 15 datasets. In addition, experiments with a device KB and a nuclear physics dataset demonstrate the model’s ability to learn representations useful for entity completion in diverse settings. This has many downstream use cases, including modeling numerical properties with high accuracy – critical for science applications, which also benefit from the model’s inherent probabilistic nature.

A Framework for Exploring the Consequences of AI-Mediated Enterprise Knowledge Access and Identifying Risks to Workers

People are increasingly interacting with, and being affected by, the deployment of AI systems in the workplace. This is a pressing matter for system designers, policy-makers, and workers themselves, which researchers from Microsoft address in a recent paper: A Framework for Exploring the Consequences of AI-Mediated Enterprise Knowledge Access and Identifying Risks to Workers.  

Organizations generate huge amounts of information that raise challenges associated with the maintenance, dissemination, and discovery of organizational knowledge. Recent developments in AI, notably large language models (LLMs), present a shift in what is possible in this domain. Recent advances could enable more extensive mining, knowledge synthesis, and natural language interaction in relation to knowledge.  

The researchers propose the Consequence-Mechanism-Risk Framework to identify risks to workers associated with deploying AI-mediated enterprise knowledge access systems. The goal is to support those involved in the design and/or deployment of such systems to identify the risks they introduce, the specific system mechanisms that introduce those risks, and the actionable levers to reduce those risks.

Large Search Model: Redefining Search Stack in the Era of LLMs

Modern search engines are built on a stack of different components, including query understanding, retrieval, multi-stage ranking, and question answering, among others. These components are often optimized and deployed independently. In a recent paper: Large Search Model: Redefining Search Stack in the Era of LLMs, researchers from Microsoft introduce a novel conceptual framework called large search model, which redefines the conventional search stack by unifying search tasks with one large language model (LLM). All tasks are formulated as autoregressive text generation problems, allowing for the customization of tasks through the use of natural language prompts. This proposed framework capitalizes on the strong language understanding and reasoning capabilities of LLMs, offering the potential to enhance search result quality while simplifying the cumbersome search stack. To substantiate the feasibility of this framework, the researchers present a series of proof-of-concept experiments and discuss the potential challenges associated with implementing this approach within real-world search systems.

The post Research Focus: Week of January 8, 2024 appeared first on Microsoft Research.

Read More

Splitwise improves GPU usage by splitting LLM inference phases

Splitwise improves GPU usage by splitting LLM inference phases

The recent surge in large language model (LLM) use is causing significant challenges for cloud providers, requiring them to deploy more GPUs at an unprecedented rate. However, the capacity to provision the power needed to run these GPUs is limited, and with demand for computation surpassing supply, it is not uncommon for user queries to be denied. Therefore, any approach to making the existing infrastructure more efficient—enabling it to serve more queries faster under the same power budget—can have very tangible benefits to both cloud providers and users.

One aspect of LLM inference that currently limits efficient use of resources is that it has two distinct phases with different characteristics: the prompt phase and the token-generation phase. During the prompt phase, LLMs process all user input, or prompts, in parallel, efficiently utilizing GPU compute. However, during the token-generation phase, LLMs generate each output token sequentially and are limited by GPU memory bandwidth. Even when employing state-of-the-art batching mechanisms, the discrepancy between these two phases results in low overall hardware utilization, leading to much higher costs when offering LLMs to users. Figure 1 illustrates the differences between these two phases.

An example of the generative LLM inference process and the two phases associated with it. The initial prompt is “Which is better, pizza or burger?” and it generates the word “Pizza”. The token generation phase generates the words/tokens: “is”, “better”, and “.”. The prompt phase has the following properties: (1) all input tokens are processed in parallel to generate the first output token, (2) compute intensive, and (3) is a smaller part of the end-to-end latency. The token phase is: (1) serialized, (2) memory intensive, and (3) tends to be the majority of the end-to-end latency.
Figure 1. An example of the generative LLM inference process and the two phases associated with it. The prompt phase is computationally intensive, while the token phase is memory intensive.

Splitting the phases with Splitwise

At Azure Research – Systems, we tackled this by creating Splitwise, a technique designed to optimally utilize available hardware by separating the prompt computation and token-generation phases onto separate machines. This approach is underpinned by the insight that prompt processing and token-generation are distinct in their computational, memory, and power requirements. By separating these two phases, we can enhance hardware utilization during both phases. Our paper, “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” details our methods for developing and testing this technique, including an exploration of how different types of GPUs perform during each phase.   

To create a sustainable approach for GPU provisioning, we used Splitwise to design GPU clusters with three primary objectives: maximizing throughput, minimizing costs, and reducing power. In addition to separating the two LLM inference phases into two distinct machine pools, we include a third machine pool for mixed batching across the prompt and token phases, sized dynamically based on real-time computational demands. Lastly, we transferred the state context (i.e., KV-cache in the LLM transformer attention layers) from the prompt to the token machines over InfiniBand without any perceivable latency impact to the user. This high-level system architecture is illustrated in Figure 2.

A high-level diagram of Splitwise architecture. Machines maintained in different pools are dedicated to the corresponding phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency.
Figure 2. A high-level diagram of the Splitwise architecture. Machines maintained in different pools are dedicated to the two distinct LLM inference phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency.

MICROSOFT RESEARCH PODCAST

Abstracts: October 23, 2023

On “Abstracts,” Partner Research Manager Andy Gordon & Senior Researcher Carina Negreanu explore new work introducing co-audit, a term for any tool-assisted experience that helps users of generative AI find and fix mistakes in AI output.


Tests show Splitwise maximizes throughput while lowering costs

To evaluate its performance, we used Splitwise to design clusters with different types of GPUs, including NVIDIA DGX-A100 and DGX-H100, while optimizing cost, power, and throughput under specific latency service level agreements (SLAs) for each query. Table 1 shows the machine types we used for each cluster design. Our application of Splitwise encompassed two use cases: code and conversation using the Llama-2-70B (opens in new tab) and BLOOM-176B (opens in new tab) LLMs.

Details for the prompt and token machines we used for each cluster design, evaluated with Splitwise. All values are normalized to a baseline of DGX-A100. DGX-H100 capped is a system with all GPUs power-capped to half the maximum power.
Table 1. Details for the prompt and token machines we used for each cluster design, evaluated with Splitwise. All values are normalized to a baseline of DGX-A100. DGX-H100 capped is a system with all GPUs power-capped to half the maximum power.

Our findings demonstrate that Splitwise successfully achieves our three goals of maximizing throughput, minimizing costs, and reducing power. Through our evaluation, we observed that the Splitwise cluster design can maximize throughput at the same cost compared with an A100 baseline cluster. Moreover, Splitwise delivers much higher throughput while operating within the same provisioned power constraints as the baseline cluster. Figure 3 shows that compared with Baseline-H100, we can achieve 1.4x higher throughput at 20 percent lower cost. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.

Results from baseline and Splitwise clusters optimized for throughput, all with the same power constraints. Splitwise-HH requires the least number of machines. Splitwise-HHcap provides the best throughput. Splitwise-AA is the cheapest option.
Figure 3. Results from baseline and Splitwise clusters optimized for throughput, all with the same power constraints.

Looking forward

Splitwise marks a leap toward efficient, high-performance LLM deployments. By separating the prompt and token phases, we can unlock new potential in GPU use. Looking forward, we at Microsoft Azure envision tailored machine pools driving maximum throughput, reduced costs, and power efficiency, and we will continue to focus on making LLM inference efficient and sustainable.

Our approach is now part of vLLM (opens in new tab) and can also be implemented with other frameworks.

Acknowledgements

This work was done in collaboration with our intern, Pratyush Patel from the University of Washington. We also appreciate the help and guidance of Suriya Kalivardhan, Gopi Kumar, and Chetan Bansal.

The post Splitwise improves GPU usage by splitting LLM inference phases appeared first on Microsoft Research.

Read More

Research at Microsoft 2023: A year of groundbreaking AI advances and discoveries

Research at Microsoft 2023: A year of groundbreaking AI advances and discoveries

It isn’t often that researchers at the cutting edge of technology see something that blows their minds. But that’s exactly what happened in 2023, when AI experts began interacting with GPT-4, a large language model (LLM) created by researchers at OpenAI that was trained at unprecedented scale. 

“I saw some mind-blowing capabilities that I thought I wouldn’t see for many years,” said Ece Kamar, partner research manager at Microsoft, during a podcast recorded in April.

Throughout the year, rapid advances in AI came to dominate the public conversation (opens in new tab), as technology leaders and eventually the general public voiced a mix of wonder and skepticism after experimenting with GPT-4 and related applications. Could we be seeing sparks of artificial general intelligence (opens in new tab)—informally defined as AI systems that “demonstrate broad capabilities of intelligence, including reasoning, planning, and the ability to learn from experience (opens in new tab)”? 

While the answer to that question isn’t yet clear, we have certainly entered the era of AI, and it’s bringing profound changes to the way we work and live. In 2023, AI emerged from the lab and delivered everyday innovations that anyone can use. Millions of people now engage with AI-based services like ChatGPT. Copilots (opens in new tab)—AI that helps with complex tasks ranging from search to security—are being woven into business software and services.

Underpinning all of this innovation is years of research, including the work of hundreds of world-class researchers at Microsoft, aided by scientists, engineers, and experts across many related fields. In 2023, AI’s transition from research to reality began to accelerate, creating more tangible results than ever before. This post looks back at the progress of the past year, highlighting a sampling of the research and strategies that will support even greater progress in 2024.

Strengthening the foundations of AI

AI with positive societal impact is the sum of several integral moving parts, including the AI models, the application of these models, and the infrastructure and standards supporting their development and the development of the larger systems they underpin. Microsoft is redefining the state of the art across these areas with improvements to model efficiency, performance, and capability; the introduction of new frameworks and prompting strategies that increase the usability of models; and best practices that contribute to sustainable and responsible AI. 

Advancing models 

  • Researchers introduced Retentive Networks (RetNet), an alternative to the dominant transformer architecture in language modeling. RetNet supports training parallelism and strong performance while making significant gains in inference efficiency. 
  • To contribute to more computationally efficient and sustainable language models, researchers presented a 1-bit transformer architecture called BitNet.
  • Microsoft expanded its Phi family of small language models with the 2.7 billion-parameter Phi-2, which raises the bar in reasoning and language understanding among base models with up to 13 billion parameters. Phi-2 also met or exceeded the performance of models 25 times its size on complex benchmarks.
  • The release of the language models Orca (13 billion parameters) and, several months later, Orca 2 (7 billion and 13 billion parameters) demonstrates how improved training methods, such as synthetic data creation, can elevate small model reasoning to a level on par with larger models.
  • For AI experiences that more closely reflect how people create across mediums, Composable Diffusion (CoDi) takes as input a mix of modalities, such as text, audio, and image, and produces multimodal output, such as video with synchronized audio.
  • To better model human reasoning and speed up response time, the new approach Skeleton-of-Thought has LLMs break tasks down into two parts—creating an outline of a response and providing details on each point in parallel.

Advancing methods for model usage

  • AutoGen is an open-source framework for simplifying the orchestration, optimization, and automation of LLM workflows to enable and streamline the creation of LLM-based applications.
  • Medprompt, a composition of prompting strategies, demonstrates that with thoughtful and advanced prompting alone, general foundation models can outperform specialized models, offering a more efficient and accessible alternative to fine-tuning on expert-curated data.
  • The resource collection promptbase offers prompting techniques and tools designed to help optimize foundation model performance, including Medprompt, which has been extended for application outside of medicine.
  • Aimed at addressing issues associated with lengthy inputs, such as increased response latency, LLMLingua is a prompt-compression method that leverages small language models to remove unnecessary tokens.

Developing and sharing best practices 

Accelerating scientific exploration and discovery

Microsoft uses AI and other advanced technologies to accelerate and transform scientific discovery, empowering researchers worldwide with leading-edge tools. Across global Microsoft research labs, experts in machine learning, quantum physics, molecular biology, and many other disciplines are tackling pressing challenges in the natural and life sciences.

  • Because of the complexities arising from multiple variables and the inherently chaotic nature of weather, Microsoft is using machine learning to enhance the accuracy of subseasonal forecasts.
  • Distributional Graphormer (DIG) is a deep learning framework for predicting protein structures with greater accuracy, a fundamental problem in molecular science. This advance could help deliver breakthroughs in critical research areas like materials science and drug discovery.
  • Leveraging evolutionary-scale protein data, the general-purpose diffusion framework EvoDiff helps design novel proteins more efficiently, which can aid in the development of industrial enzymes, including for therapeutics.
  • MOFDiff, a coarse-grained diffusion model, helps scientists refine the design of new metal-organic frameworks (MOFs) for the low-cost removal of carbon dioxide from air and other dilute gas streams. This innovation could play a vital role in slowing climate change.
  • This episode of the Microsoft Research Podcast series Collaborators explores research into renewable energy storage systems, specifically flow batteries, and discusses how machine learning can help to identify compounds ideal for storing waterpower and advancing carbon capture.
  • MatterGen is a diffusion model specifically designed to address the central challenge in materials science by efficiently generating novel, stable materials with desired properties, such as high conductivity for lithium-ion batteries.
  • Deep learning is poised to revolutionize the natural sciences, enhancing modeling and prediction of natural occurrences, ushering in a new era of scientific exploration, and leading to significant advances in sectors ranging from drug development to renewable energy. DeepSpeed4Science, a new Microsoft initiative, aims to build unique capabilities through AI system technology innovations to help domain experts unlock today’s biggest science mysteries. 
  • Christopher Bishop, Microsoft technical fellow and director of the AI4Science team, recently published Deep Learning: Foundations and Concepts, a book that “offers a comprehensive introduction to the ideas that underpin deep learning.” Bishop discussed the motivation and process behind the book, as well as deep learning’s impact on the natural sciences, in the AI Frontiers podcast series.

Maximizing the individual and societal benefits of AI

As AI models grow in capability so, too, do opportunities to empower people to achieve more, as demonstrated by Microsoft work in such domains as health and education this year. The company’s commitment to positive human impact requires that AI technology be equitable and accessible.

Beyond AI: Leading technology innovation

While AI rightly garners much attention in the current research landscape, researchers at Microsoft are still making plenty of progress across a spectrum of technical focus areas.

  • Project Silica, a cloud-based storage system underpinned by quartz glass, is designed to provide sustainable and durable archival storage that’s theoretically capable of lasting thousands of years.
  • Project Analog Iterative Machine (AIM) aims to solve difficult optimization problems—crucial across industries such as finance, logistics, transportation, energy, healthcare, and manufacturing—in a timely, energy-efficient, and cost-effective manner. Its designers believe Project AIM could outperform even the most powerful digital computers.
  • Microsoft researchers proved that 3D telemedicine (3DTM), using HoloportationTM communication technology, could help improve healthcare delivery, even across continents, in a unique collaboration with doctors and governments in Scotland and Ghana.
  • In another collaboration that aims to help improve precision medicine, Microsoft worked with industry and academic colleagues to release Terra, a secure, centralized, cloud-based platform for biomedical research on Microsoft Azure.
  • On the hardware front, Microsoft researchers are exploring sensor-enhanced headphones, outfitting them with controls that use head orientation and hand gestures to enable context-aware privacy, gestural audio-visual control, and animated avatars derived from natural body language.

Collaborating across academia, industries, and disciplines

Cross-company and cross-disciplinary collaboration has always played an important role in research and even more so as AI continues to rapidly advance. Large models driving the progress are components of larger systems that will deliver the value of AI to people. Developing these systems and the frameworks for determining their roles in people’s lives and society requires the knowledge and experience of those who understand the context in which they’ll operate—domain experts, academics, the individuals using these systems, and others.

Engaging and supporting the larger research community

Throughout the year, Microsoft continued to engage with the broader research community on AI and beyond. The company’s sponsorship of and participation in key conferences not only showcased its dedication to the application of AI in diverse technological domains but also underscored its unwavering support for cutting-edge advancements and collaborative community involvement.

Functional programming

  • Microsoft was a proud sponsor of ICFP 2023, with research contributions covering a range of functional programming topics, including memory optimization, language design, and software-development techniques.

Human-computer interaction

  • At CHI 2023, Microsoft researchers and their collaborators demonstrated the myriad and diverse ways people use computing today and will in the future. 

Large language models and ML

  • Microsoft was a sponsor of ACL 2023, showcasing papers ranging from fairness in language models to natural language generation and beyond.
  • Microsoft also sponsored NeurIPS 2023, publishing over 100 papers and conducting workshops on language models, deep learning techniques, and additional concepts, methods, and applications addressing pressing issues in the field.
  • With its sponsorship of and contribution to ICML 2023, Microsoft showcased its investment in advancing the field of machine learning.
  • Microsoft sponsored ML4H (opens in new tab) and participated in AfriCHI (opens in new tab) and EMNLP (opens in new tab), a leading conference in natural language processing and AI, highlighting its commitment to exploring how LLMs can be applied to healthcare and other vital domains.

Systems and advanced networking

Listeners’ choice: Notable podcasts for 2023

Thank you for reading

Microsoft achieved extraordinary milestones in 2023 and will continue pushing the boundaries of innovation to help shape a future where technology serves humanity in remarkable ways. To stay abreast of the latest updates, subscribe to the Microsoft Research Newsletter (opens in new tab) and the Microsoft Research Podcast (opens in new tab). You can also follow us on Facebook (opens in new tab), Instagram (opens in new tab), LinkedIn (opens in new tab), X (opens in new tab), and YouTube (opens in new tab). 

Writers, Editors, and Producers
Kristina Dodge
Kate Forster
Jessica Gartner
Alyssa Hughes
Gretchen Huizinga
Brenda Potts
Chris Stetkiewicz
Larry West

Managing Editor
Amber Tingle

Project Manager
Amanda Melfi

Microsoft Research Global Design Lead
Neeltje Berger

Graphic Designers
Adam Blythe
Harley Weber

Microsoft Research Creative Studio Lead
Matt Corwine

The post Research at Microsoft 2023: A year of groundbreaking AI advances and discoveries appeared first on Microsoft Research.

Read More

Research Focus: Week of December 18, 2023

Research Focus: Week of December 18, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus
December 18th, 2023

NASerEx: Optimizing Early Exits via AutoML for Scalable Efficient Inference in Big Image Streams

Deep Neural Networks (DNNs) are essentially stacked transformation functions (layers) that generate progressively complex features/encoding. This makes them universal approximators and allows for unprecedented success in complex tasks. This inferential effectiveness comes at the cost of increased computational complexity, making DNNs hard to scale for operational efficiency in AI applications, especially when running on resource-constrained hardware. 

In a recent paper: NASerEx: Optimizing Early Exits via AutoML for Scalable Efficient Inference in Big Image Streams, researchers from Microsoft and their collaborators propose a new framework to address this problem. NASerEX leverages neural architecture search (NAS) with a novel saliency-constrained search space and exit decision metric to learn suitable early exit structures to augment deep neural models for scalable efficient inference on big image streams. Optimized exit-augmented models, with the power of smart adaptive inference, perform ~2.5x faster having ~4x aggregated lower effective FLOPs, with no significant accuracy loss.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


InsightPilot: An LLM-Empowered Automated Data Exploration System

Effective data exploration requires in-depth knowledge of the dataset and the user intent, and expertise in data analysis techniques. Not being familiar with either can create obstacles that make the process time-consuming and overwhelming.

In a recent paper, InsightPilot: An LLM-Empowered Automated Data Exploration System, researchers from Microsoft address this issue. InsightPilot is a large language model (LLM)-based, automated system designed to simplify the data exploration process. It features a set of carefully designed analysis actions that streamline the data exploration process. Given a natural language question, InsightPilot collaborates with the LLM to issue a sequence of analysis actions, explore the data, and generate insights. The authors demonstrate the effectiveness of InsightPilot in a user study and a case study, showing how it can help users gain valuable insights from their datasets. 


Boosting Cloud Efficiency: Harnessing Data-Driven Decision-Making and Optimization Techniques

Microsoft’s cloud system serves as the backbone for the daily operations of hundreds of thousands of organizations, driving productivity and collaboration. The foundational infrastructure demands both high reliability and efficiency. In a new blog post, Microsoft’s Systems Innovation team explores some recent innovations to continually enhance hyper-scale cloud capacity efficiency, delivering substantial operational cost savings for customers.

Systems Innovation is a collaboration between Microsoft 365, Microsoft Research and Azure. The research group is focused on leveraging their shared deep workload understanding and combining algorithmic research with AI/machine learning techniques and hardware innovation to improve operational reliability and efficiency.


NeurIPS Large Language Model Efficiency Challenge

Large language models (LLMs) trained on large bodies of text can solve tasks with few supervised examples. These few-shot models have shown state-of-the-art success across natural language processing (NLP) tasks, language translation, standardized exams, and coding challenges, as well as in subjective domains such as chatbots. All of these domains involve bootstrapping a single LLM referred to as a foundation model with examples of specific knowledge from the associated task.

The process of updating a model with limited domain-specific data is known as fine-tuning. However, the costs of accessing, fine-tuning and querying foundation models to perform new tasks can be large.

To help democratize access to language models, Microsoft and other industry leaders were pleased to sponsor the NeurIPS Large Language Model Efficiency Challenge, (opens in new tab) which addressed three major issues:

  1. Lack of transparency around model training methods leads to a majority of models being not reproducible.
  2. The absence of a standard benchmark to evaluate these models side-by-side.
  3. Insufficient access to dedicated hardware prevents widespread availability and usage of these models.

The challenge to the community was to adapt a foundation model to specific tasks by fine-tuning on a single GPU of either 4090 or A100 (40GB) within a 24-hour (1-day) time frame, while maintaining high accuracy for these desired tasks.

Each submission was evaluated for accuracy and computational performance tradeoffs at commodity hardware scales. Insights and lessons were distilled into a set of well documented steps and easy-to-follow tutorials. The machine learning community will have documentation on how to achieve the same performance as winning entries, which will serve as the starting point to help them build their own LLM solutions.

The post Research Focus: Week of December 18, 2023 appeared first on Microsoft Research.

Read More

AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop

AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop

Chris Bishop looking at the camera. A podcast microphone displayed.

Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come. 

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as healthcare and education, and its potential to benefit humanity. 

This episode features Technical Fellow Christopher Bishop (opens in new tab), who leads a global team of researchers and engineers working to help accelerate scientific discovery by merging machine learning and the natural sciences. Llorens and Bishop explore the state of deep learning; Bishop’s new textbook, Deep Learning: Foundations and Concepts (opens in new tab), his third and a writing collaboration with his son; and a potential future in which “super copilots” accessible via natural language and drawing on a variety of tools, like those that can simulate the fundamental equations of nature, are empowering scientists in their pursuit of breakthrough.

Chris Bishop with son and coauthor Hugh Bishop
Chris Bishop with son and coauthor Hugh Bishop

Transcript

[MUSIC PLAYS] 

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more excited to work in the field than right now. The latest foundation models and the systems we’re building around them are exhibiting surprising new abilities in reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’m sharing conversations with fellow researchers about the latest developments in AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers

Today, I’ll speak with Chris Bishop. Chris was educated as a physicist but has spent more than 25 years as a leader in the field of machine learning. Chris directs our AI4Science organization, which brings together experts in machine learning and across the natural sciences with the aim of revolutionizing scientific discovery.


[MUSIC FADES] 

So, Chris, you have recently published a new textbook on deep learning, maybe the new definitive textbook on deep learning. Time will tell. So, of course, I want to get into that. But first, I’d like to dive right into a few philosophical questions. In the preface of the book, you make reference to the massive scale of state-of-the-art language models, generative models comprising on the order of a trillion learnable parameters. How well do you think we understand what a system at that scale is actually learning? 

CHRIS BISHOP: That’s a super interesting question, Ashley. So in one sense, of course, we understand the systems extremely well because we designed them; we built them. But what’s very interesting about machine learning technology compared to most other technologies is that the, the functionality in large part is learned, is learned from data. And what we discover in particular with these very large language models is, kind of, emergent behavior. As we go up at each factor of 10 in scale, we see qualitatively new properties and capabilities emerging. And that’s super interesting. That, that was called the scaling hypothesis. And it’s proven to be remarkably successful. 

LLORENS: Your new book lays out foundations in statistics and probability theory for modern machine learning. Central to those foundations is the concept of probability distributions, in particular learning distributions in the service of helping a machine perform a useful task. For example, if the task is object recognition, we may seek to learn the distribution of pixels you’d expect to see in images corresponding to objects of interest, like a teddy bear or a racecar. On smaller scales, we can at least conceive of the distributions that machines are learning. What does it mean to learn a distribution at the scale of a trillion learnable parameters? 

BISHOP: Right. That’s really interesting. So, so first of all, the fundamentals are very solid. The fact that we have this, this, sort of, foundational rock of probability theory on which everything is built is extremely powerful. But then these emergent properties that we talked about are the result of extremely complex statistics. What’s really interesting about these neural networks, let’s say, in comparison with the human brain is that we can perform perfect diagnostics on them. We can understand exactly what each neuron is doing at each moment of time. And, and so we can almost treat the system in a, in a, sort of, somewhat experimental way. We can, we can probe the system. You can apply different inputs and see how different units respond. You can play games like looking at a unit that responds to a particular input and then perhaps amplifying the, amplifying that response, adjusting the input to make that response stronger, seeing what effect it has, and so on. So there’s an aspect of machine learning these days that’s somewhat like experimental neurobiology, except with the big advantage that we have sort of perfect diagnostics. 

LLORENS: Another concept that is key in machine learning is generalization. In more specialized systems, often smaller systems, we can actually conceive of what we might mean by generalizing. In the object recognition example I used earlier, we may want to train an AI model capable of recognizing any arbitrary image of a teddy bear. Because this is a specialized task, it is easy to grasp what we mean by generalization. But what does generalization mean in our current era of large-scale AI models and systems?

BISHOP: Right. Well, generalization is a fundamental property, of course. If we couldn’t generalize, there’d be no point in building these systems. And again, these, these foundational principles apply equally at a very large scale as they do at a, at a smaller scale. But the concept of generalization really has to do with modeling the distribution from which the data is generated. So if you think about a large language model, it’s trained by predicting the next word or predicting the next token. But really what we’re doing is, is creating a task for the model that forces it to learn the underlying distribution. Now, that distribution may be extremely complex, let’s say, in the case of natural language. It can convey a tremendous amount of meaning. So, really, the system is forced to … in order to get the best possible performance, in order to make the best prediction for the next word, if you like, it’s forced to effectively understand the meaning of the content of the data. In the case of language, the meaning of effectively what’s being said. And so from a mathematical point of view, there’s a very close relationship between learning this probability distribution and the problem of data compression, because it turns out if you want to compress data in a lossless way, the optimal way to do that is to learn the distribution that generates the data. So that’s,  that’s … we show that in the book, in fact. And so, and the best way to … let’s take the example of images, for instance. If you’ve got a very, very large number of natural images and you had to compress them, the most efficient way to compress them would be to understand the mechanisms by which the images come about. There are objects. You could, you could pick a car or a bicycle or a house. There’s lighting from different angles, shadows, reflections, and so on. And learning about those mechanisms—understanding those mechanisms—will give you the best possible compression, but it’ll also give you the best possible generalization. 

LLORENS: Let’s talk briefly about one last fundamental concept—inductive bias. Of course, as you mentioned, AI models are learned from data and experience, and my question for you is, to what extent do the neural architectures underlying those models represent an inductive bias that shapes the learning?

BISHOP: This is a really interesting question, as well, and it sort of reflects the journey that neural nets have been on in the last, you know, 30–35 years since we first started using gradient-based methods to train them. So, so the idea of inductive bias is that, actually, you can only learn from data in the presence of assumptions. There’s, actually, a theorem called the “no free lunch” theorem, which proves this mathematically. And so, to be able to generalize, you have to have data and some sort of assumption, some set of assumptions. Now, if you go back, you know, 30 years, 35 years, when I first got excited about neural nets, we had very simple one– and two–layered neural nets. We had to put a lot of assumptions in. We’d have to code a lot of human expert knowledge into feature extraction, and then the neural net would do a little bit of, the last little bit of work of just mapping that into a, sort of, a linear representation and then, then learning a classifier or whatever it was. And then over the years as we’ve learned to train bigger and richer neural nets, we can allow the data to have more influence and then we can back off a little bit on some of that prior knowledge. And today, when we have models like large-scale transformers with a trillion parameters learned on vast datasets, we’re letting the data do a lot of the heavy lifting. But there always has to be some kind of assumption. So in the case of transformers, there are inductive biases related to the idea of attention. So that’s a, that’s a specific structure that we bake into the transformer, and that turns out to be very, very successful. But there’s always inductive bias somewhere.

LLORENS: Yeah, and I guess with these new, you know, generative pretrained models, there’s also some inductive bias you’re imposing in the inferencing stage, just with your, with the way you prompt the system. 

BISHOP: And, again, this is really interesting. The whole field of deep learning has become incredibly rich in terms of pretraining, transfer learning, the idea of prompting, zero-shot learning. The field has exploded really in the last 10 years—the last five years—not just in terms of the number of people and the scale of investment, number of startups, and so on, but the sort of the richness of ideas and, and, and techniques like, like order differentiation, for example, that mean we don’t have to code up all the gradient optimization steps. It allows us to explore a tremendous variety of different architectures very easily, very readily. So it’s become just an amazingly exciting field in the last decade. 

LLORENS: And I guess we’ve, sort of, intellectually pondered here in the first few minutes the current state of the field. But what was it like for you when you first used, you know, a state-of-the-art foundation model? What was that moment like for you?  

BISHOP: Oh, I could remember it clearly. I was very fortunate because I was given, as you were, I think, a very early access to GPT-4, when it was still very secret. And I, I’ve described it as being like the, kind of, the five stages of grief. It’s a, sort of, an emotional experience actually. Like first, for me, it was, like, a, sort of, first encounter with a primitive intelligence compared to human intelligence, but nevertheless, it was … it felt like this is the first time I’ve ever engaged with an intelligence that was sort of human-like and had those first sparks of, of human-level intelligence. And I found myself going through these various stages of, first of all, thinking, no, this is, sort of, a parlor trick. This isn’t real. And then, and then it would do something or say something that would be really quite shocking and profound in terms of its … clearly it was understanding aspects of what was being discussed. And I’d had several rounds of that. And then, then the next, I think, was that real? Did I, did I imagine that? And go back and try again and, no, there really is something here. So, so clearly, we have quite a way to go before we have systems that really match the incredible capabilities of the human brain. But nevertheless, I felt that, you know, after 35 years in the field, here I was encountering the first, the first sparks, the first hints, of real machine intelligence. 

LLORENS: Now let’s get into your book. I believe this is your third textbook. You contributed a text called Neural Networks for Pattern Recognition in ’95 and a second book called Pattern Recognition and Machine Learning in 2006, the latter still being on my own bookshelf. So I think I can hazard a guess here, but what inspired you to start writing this third text?

BISHOP: Well, really, it began with … actually, the story really begins with the COVID pandemic and lockdown. It was 2020. The 2006 Pattern Recognition and Machine Learning book had been very successful, widely adopted, still very widely used even though it predates the, the deep learning revolution, which of course one of the most exciting things to happen in the field of machine learning. And so it’s long been on my list of things to do, to update the book, to bring it up to date, to include deep learning. And when the, when the pandemic lockdown arose, 2020, I found myself sort of imprisoned, effectively, at home with my family, a very, very happy prison. But I needed a project. And I thought this would be a good time to start to update the book. And my son, Hugh, had just finished his degree in computer science at Durham and was embarking on a master’s degree at Cambridge in machine learning, and we decided to do this as a joint project during, during the lockdown. And we’re having a tremendous amount of fun together. We quickly realized, though, that the field of deep learning is so, so rich and obviously so important these days that what we really needed was a new book rather than merely, you know, a few extra chapters or an update to a previous book. And so we worked on that pretty hard for nearly a couple of years or so. And then, and then the story took another twist because Hugh got a job at Wayve Technologies in London building deep learning systems for autonomous vehicles. And I started a new team in Microsoft called AI4Science. We both found ourselves extremely busy, and the whole project, kind of, got put on the back burner. And then along came GPT and ChatGPT, and that, sort of, exploded into the world’s consciousness. And we realized that if ever there was a time to finish off a textbook on deep learning, this was the moment. And so the last year has really been absolutely flat out getting this ready, in fact, ready in time for launch at NeurIPS this year. 

LLORENS: Yeah, you know, it’s not every day you get to do something like write a textbook with your son. What was that experience like for you? 

BISHOP: It was absolutely fabulous. And, and I hope it was good fun for Hugh, as well. You know, one of the nice things was that it was a, kind of, a pure collaboration. There was no divergence of agendas or any sense of competition. It was just pure collaboration. The two of us working together to try to understand things, try to work out what’s the best way to explain this, and if we couldn’t figure something out, we’d go to the whiteboard together and sketch out some maths and try to understand it together. And it was just tremendous fun. Just a real, a real pleasure, a real honor, I would say. 

LLORENS: One of the motivations that you articulate in the preface of your book is to make the field of deep learning more accessible for newcomers to the field. Which makes me wonder what your sense is of how accessible machine learning actually is today compared to how it was, say, 10 years ago. On the one hand, I personally think that the underlying concepts around transformers and foundation models are actually easier to grasp than the concepts from previous eras of machine learning. Today, we also see a proliferation of helpful packages and toolkits that people can pick up and use. And on the other hand, we’ve seen an explosion in terms of the scale of compute necessary to do research at the frontiers. So net, what’s your concept of how accessible machine learning is today?

BISHOP: I think you’ve hit on some good points there. I would say the field of machine learning has really been through these three eras. The first was the focus on neural networks. The second was when, sort of, neural networks went into the back burner. As you, you hinted there, there was a proliferation of different ideas—Gaussian processes, graphical models, kernel machines, support vector machines, and so on—and the field became very broad. There are many different concepts to, to learn. Now, in a sense, it’s narrowed. The focus really is on deep neural networks. But within that field, there has been an explosion of different architectures and different … and not only in terms of the number of architectures. Just the sheer number of papers published has, has literally exploded. And, and so it can be very daunting, very intimidating, I think, especially for somebody coming into the field afresh. And so really the value proposition of this book is distill out the, you know, 20 or so foundational ideas and concepts that you really need to understand in order to understand the field. And the hope is that if you’ve really understood the content of the book, you’d be in pretty good shape to pretty much read any, any paper that’s published. In terms of actually using the technology in practice, yes, on the one hand, we have these wonderful packages and, especially with all the differentiation that I mentioned before, is really quite revolutionary. And now you can, you can put things together very, very quickly, a lot of open-source code that you can quickly bolt together and assemble lots of different, lots of different things, try things out very easily. It’s true, though, that if you want to operate at the very cutting edge of large-scale machine learning, that does require resources on a very large scale. So that’s obviously less accessible. But if your goal is to understand the field of machine learning, then, then I hope the book will serve a good purpose there. And in one sense, the fact that the packages are so accessible and so easy to use really hides some of the inner workings, I would say, of these, of these systems. And so I think in a way, it’s almost too easy just to train up a neural network on some data without really understanding what’s going on. So, so the book is really about, if you like, the minimum set of things that you need to know about in order to understand the field, not just to, sort of, turn the crank on it on a package but really understand what’s going on inside. 

LLORENS: One of the things I think you did not set out to do, as you just mentioned, is to create an exhaustive survey of the most recent advancements, which might have been possible, you know, a decade or so ago. How do you personally keep up with the blistering pace of research these days? 

BISHOP: Ah, yes, it’s a, it’s a challenge, of course. So, so my focus these days is on AI4Science, AI for natural science. But that’s also becoming a very large field. But, you know, one of the, one of the wonderful things about being at Microsoft Research is just having fantastic colleagues with tremendous expertise. And so, a lot of what I learn is from, is from colleagues. And we’re often swapping notes on, you know, you should take a look at this paper, did you hear about this idea, and so on, and brainstorming things together. So a lot of it is, you know, just taking time each day to read papers. That’s important. But also, just conversations with, with colleagues. 

LLORENS: OK, you mentioned AI4Science. I do want to get into that. I know it’s an area that you’re passionate about and one that’s become a focus for your career in this moment. And, you know, I think of our work in AI4Science as creating foundation models that are fluent not in human language but in the language of nature. And earlier in this conversation, we talked about distribution. So I want to, kind of, bring you back there. Do you think we can really model all of nature as one wildly complex statistical distribution?

BISHOP: [LAUGHS] Well, that’s, that’s really interesting. I do think I could imagine a future, maybe not too many years down the road, where scientists will engage with the tools of scientific discovery through something like a natural language model. That model will also have understanding of concepts around the structures of molecules and the nature of data, will read scientific literature, and so on, and be able to assemble these ideas together. But it may need to draw upon other kinds of tools. So whether everything will be integrated into one, one overarching tool is less clear to me because there are some aspects of scientific discovery that are being, truly being revolutionized right now by deep learning. For example, our ability to simulate the fundamental equations of nature is being transformed through deep learning, and the nature of that transformation, on the one hand, it leverages, might leverage architectures like diffusion models and large language, large language models, large transformers, and the ability to train on large GPU clusters. But the fundamental goals there are to solve differential equations at a very large scale. And so the kinds of techniques we use there are a little bit different from the ones we’d use in processing natural language, for example. So you could imagine, maybe not too many years in the future, where a scientist will have a, kind of, “super copilot” that they can interact with directly in natural language. And that copilot or system of copilots can itself draw upon various tools. They may be tools that simulate Schrödinger equation, solves Schrödinger equation, to predict the properties of molecules. It might call upon large-scale deep learning emulators that can do a similar thing to the simulators but very, very much more efficiently. It might even call upon automated labs, wet labs, that can run experiments and gather data and can help the scientist marshal these resources and make optimal decisions as they go through that iterative scientific discovery process, whether inventing a new battery, electrolyte, or whether discovering a new drug, for example. 

LLORENS: We talked earlier about the “no free lunch” theorem and the concept of inductive bias. What does that look like here in training science foundation models?

BISHOP: Well, it’s really interesting, and maybe I’m a little biased because my background is in physics. I did a PhD in quantum field theory many decades ago. For me, one of the reasons that this is such an exciting field is that, you know, my own career has come full circle. I now get to combine machine learning with physics and chemistry and biology. I think the inductive bias here is, is particularly interesting. If you think about large language models, we don’t have very many, sort of, fundamental rules of language. I mean, the rules of linguistics are really human observations about the structure of language. But neural nets are very good at extracting that, that kind of structure from data. Whereas when we look at physics, we have laws which we believe hold very accurately. For example, conservation of energy or rotational invariance. The energy of a molecule in a vacuum doesn’t depend on its rotation in space, for example. And that kind of inductive bias is very rigorous. We believe that it holds exactly. And so there is … and also, very often, we want to train on data that’s obtained from simulators. So the training data itself is obtained by solving some of those fundamental equations, and that process itself is computationally expensive. So the data can often be in relatively limited supply. So you’re in a regime that’s a little bit different from the large language models. It’s a little bit more like, in a way, machine learning was, you know, 10 to 20 years ago, as you were talking about, where data, data is limited. But now we have these powerful and strong inductive biases, and so there’s, it’s a very rich field of research for how to build in those inductive biases into the machine learning models but in a way that retains computational efficiency. So I personally, actually, find this one of the most exciting frontiers not only of the natural sciences but also of machine learning. 

LLORENS: Yeah, you know, physics and our understanding of the natural world has come so far, you know, over the last, you know, centuries and decades. And yet our understanding of physics is evolving. It’s an evolving science. And so maybe I’ll ask you somewhat provocatively if baking our current understanding of physics into these models as inductive biases is limiting in some way, perhaps limiting their ability to learn new physics? 

BISHOP: It’s a great question. I think for the kinds of things that we’re particularly interested in, in Microsoft Research, in the AI4Science team, we’re very interested in things that have real-world applicability, things to do with drug discovery, materials design. And there, first of all, we do have a very good understanding of the fundamental equations, essentially Schrödinger equation and fundamental equations of physics, and those inductive biases such as energy conservation. We really do believe they hold very accurately in the domains that we’re interested in. However, there’s a lot of scientific knowledge that is, that represents approximations to that, because you can only really solve these equations exactly for very small systems. And as you start to get to larger, more complex systems, there are, as it were, laws of physics that aren’t, aren’t quite as rigorous, that are somewhat more empirically derived, where there perhaps is scope for learning new kinds of physics. And, certainly, as you get to larger systems, you get, you get emergent properties. So, so conservation of energy doesn’t get violated, but nevertheless, you can have a very interesting new emergent physics. And so it’s, from the point of view of scientific discovery, I think the field is absolutely wide open. If you look at solid-state physics, for example, and device physics, there’s a tremendous amount of exciting new research to be done over the coming decades.

LLORENS: Yeah, you alluded to this. I think maybe it’s worth just double clicking on for a moment because there is this idea of compositionality and emergent properties as you scale up, and I wonder if you could just elaborate on that a little bit. 

BISHOP: Yeah, that’s a good, that’s a good, sort of, picture to have this, sort of, hierarchy of different levels in the way they interact with each other. And at the very deepest level, the level of electrons, you might even more or less directly solve Schrödinger equation or do some very good approximation to that. That quickly becomes infeasible. And as you go up this hierarchy of, effectively, length scales, you have to make more and more approximations in order to be computationally efficient or computationally even practical. But in a sense, the previous levels of the hierarchy can provide you with training data and with validation verification of what you’re doing at the next level. And so the interplay between these different hierarchies is also very, very, very interesting. So at the level of electrons, they govern forces between atoms, which governs the dynamics of atoms. But once you look at larger molecules, you perhaps can’t simulate the behavior of every electron. You have to make some approximations. And then for larger molecules still, you can’t even track the behavior of every atom. You need some sort of coarse graining and so on. And so you have this, this hierarchy of different length scales. But every single one of those length scales is being transformed by deep learning, by our ability to learn from simulations, learn from those fundamental equations, in some cases, learn also from experimental data and build emulators, effectively, systems that can simulate that particular length scale and the physical and biological properties but do so in a way that’s computationally very efficient. So every layer of this hierarchy is currently being transformed, which is just amazingly exciting. 

LLORENS: You alluded to some of the application domains that stand to get disrupted by advancements in AI4Science. What are a couple of the applications that you’re most excited about? 

BISHOP: There are so many, it would be impossible to list them. But let me give you a couple of domains. I mean, the first one is, is healthcare and the ability to design new molecules, whether it’s small-molecule drugs or more protein-based therapies. That, that whole field is rapidly shifting to a much more computational domain, and that should accelerate our ability to develop new therapies, new drugs. The other class of domains has more to do with materials, and there are a lot of … the applications that we’re interested in relate to sustainability, things to do with capturing CO2 from the atmosphere, creating, let’s say, electricity from hydrogen, creating hydrogen from electricity. We need to do things both ways round. Just storing heat as a form of energy storage. Many, many applications relating to sustainability to do with, to do with protecting our water supply, to do with providing green energy, to do with storing and transporting energy. Many, many applications.

LLORENS: And at the core of all those advancements is deep learning as we’ve kind of started. And so maybe as we, as we close, we can, kind of, come back to your book on deep learning. I don’t have the physical book yet, but there’s a spot on my shelf next to your last book that’s waiting for it. But as we close here, maybe you can tell folks where to look for or how to get a copy of your new book. 

BISHOP: Oh, sure. It’s dead easy. You go to bishopbook.com, and from there, you’ll see how to order a hardback copy if that’s what you’d like, or there’s a PDF based e-book version. There’ll be a Kindle version, I believe. But there’s also a free-to-use online version on bishopbook.com, and it’s available there. It’s, sort of, PDF style and fully hyperlinked, free to use, and I hope people will read it, and enjoy it, and learn from it. 

LLORENS: Thanks for a fascinating discussion, Chris. 

BISHOP: Thanks, Ashley.

The post AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop appeared first on Microsoft Research.

Read More