TLA+ Foundation aims to bring math-based software modeling to the mainstream

TLA+ Foundation aims to bring math-based software modeling to the mainstream

Leslie Lamport headshot in front of blurred code

TLA+ is a high level, open-source, math-based language for modeling computer programs and systems–especially concurrent and distributed ones. It comes with tools to help eliminate fundamental design errors, which are hard to find and expensive to fix once they have been embedded in code or hardware. 

The TLA language was first published in 1993 by the pioneering computer scientist Leslie Lamport, now a distinguished scientist with Microsoft Research. After years of Lamport’s stewardship and Microsoft’s support, TLA+ has found a new home. The TLA+ Foundation is launching this month as part of the Linux Foundation, with Microsoft, Amazon Web Services (AWS), and Oracle serving as founding members to help further refine the tools and spur commercial usage and additional research. 

“The foundation will help spread that work among more hands,” said Lamport. 

TLA+ is just one piece of Lamport’s impressive portfolio. He invented the document preparation system LaTeX and won the 2013 Turing Award for his work to clarify distributed systems, in which several autonomous computers communicate with each other by passing messages. 

Along the way he developed an idea to help programmers build systems more effectively by using algorithmic models to specify how the code should work. It’s the same idea as creating blueprints to guide the construction of a bridge. TLA+ (for Temporal Logic of Actions) comes with a model checker that will check whether satisfying a program’s specification implies that the code will do what it should.

“When programmers write systems, they should start by defining what they are supposed to do and check that their work will do it. That’s a better way than just sitting down to write the code, based on some vague outline,” Lamport said. 

For simple tasks, a trial-and-error approach may be fine. But for more complicated projects, or those where mistakes are unacceptable, a systematic approach makes more sense.

The challenge with writing large programs isn’t necessarily their size, it’s their complexity. They are often distributed across multiple systems and involve multiple processes that need to interact. The number of possible executions becomes astronomical. To reason about and check such a system, it helps to have a mathematical way to think about it ahead of time. Yet engineers often balk at the idea. 

“The difficulty that engineers have is more a fear of math than the math itself. The math, as math goes, is very basic,” Lamport said, though it’s worth noting he holds a PhD in mathematics. “I find that engineers, after using TLA+, understand the benefit.”

Leslie Lamport giving a talk on stage

In fact, TLA+ has been adopted for industrial use at semiconductor makers, companies that build distributed and database systems, other tech companies, and in more mainstream applications like payment systems in retail stores. It’s likely that some applications aren’t made public—most companies don’t publicly discuss their engineering process or proprietary technology.

That’s where the foundation comes in. A formal system for contributing to the tools and defining their future direction may spawn additional collaboration among engineers and facilitate commercial adoption. The foundation will create a steering committee, similar to other panels that look after public domain programming languages like C or Java

“I would hope that the new stewards make more subtractions than additions to the language, to remove some things that aren’t needed,” Lamport said. 

Now 82 years old and nearing retirement, Lamport also hopes the foundation gets TLA+ closer to the mainstream of industrial and academic discussion.

“TLA+ is never going to be as popular as Java. And I’d be happy if someone else made it better at helping engineers think more mathematically,” Lamport says. “The ultimate goal is to get engineers to think rigorously at a higher level about what they are doing.”

The post TLA+ Foundation aims to bring math-based software modeling to the mainstream appeared first on Microsoft Research.

Read More

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Rank Game diagram

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren’t always so easy.

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert’s reward function, or the expert’s intent, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can’t fully reason about the intent of the expert—that is, compare whether one of its own learned behaviors is closer to the expert’s than another behavior—as it only knows the best behavior and has no notion of ordering over other behaviors.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

In our paper “A Ranking Game for Imitation Learning,” being presented at Transactions on Machine Learning Research 2023 (TMLR), we propose a simple and intuitive framework, (texttt{rank-game}), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.

Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.

With (texttt{rank-game}), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.

A ranking game for imitation learning

Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the intent of the expert, and a policy, which specifies what actions the agent—or, in our case, the robot—should take in a given state to successfully mimic the expert.

Notation: We use (pi) and (pi^E) to denote the policy of the agent and the expert, respectively, and (R_{gt}) to be the reward function of the expert, which is unknown to the agent/robot. (rho^pi) denotes the state-action/state visitation distribution of policy (pi) in the environment—the probabilistic collection of states the policy visits in the environment. We use (J(R;pi)) to denote the (textit{cumulative reward}), or the performance of policy (pi) under a reward function (R). We assume policy (pi) belongs to function class (Pi) and reward function R belongs to function class (mathcal{R}).

The goal of imitation learning is to make the agent have the same performance as the expert under the expert’s unknown reward function (R_{gt}). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by (texttt{imit-game}) and write it below formally:

(texttt{imit-game}(pi,pi^E): text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

Simply stated, the (texttt{imit-game}) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.

Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).

A flow chart with, clockwise from top left, a green box labeled “policy agent,” a blue box labeled “reward agent,” and an orange box label “Dataset D,” which contains pairwise behavior rankings obtained from three sources. An arrow points from the policy agent to the dataset, indicating the policy’s contribution of rankings. An arrow pointing from the policy agent to the reward is labeled with the optimization strategy. An arrow pointing from the reward agent to the dataset is labeled with the ranking loss function.
Figure 1: The proposed (texttt{rank-game}) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.

To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors ((rho)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent (pi) maximizes the expected return (J(R;pi)) of the policy under reward function (R) when deployed in the environment. The reward player takes the dataset of pairwise rankings (D^p) (rankings are denoted as (rho^ipreceqrho^j)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by (L(D^p;R))).

(underbrace{text{argmax}_{piinPi}J(R;pi)}_{text{Policy Agent}}~~~~~~~~~~~~~~~underbrace{text{argmin}_{Rinmathcal{R}}L(D^p;R)}_{text{Reward Agent}})

The ranking loss induces a reward function (R) that attempts to satisfy each pairwise preference in the dataset as follows:

(mathbb{E}_{rho^i}[R(s,a)]lemathbb{E}_{rho^j}[R(s,a)]~~,~~forall rho^ipreceqrho^j in D^p)

Generalizing prior imitation learning approaches with (texttt{rank-game})

The (texttt{rank-game}) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let’s see how classical IRL is a part of this framework. Recall that the classical IRL/(texttt{imit-game}) optimization can be written as:

(text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus, (texttt{imit-game}) can be seen to be a special case of (texttt{rank-game}) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as (textit{supremum loss})). A number of prior methods in the imitation learning domain can be understood as special cases of (texttt{rank-game}) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).

A table with a summary of imitation learning (IL) methods demonstrating the data modalities they can handle (expert data and/or preferences), their ranking-loss functions, the assumptions they make on reward function, and whether they require availability of an external agent to provide preferences during training.  

  

The IL methods MaxEntIRL, AdRIL, GAN-GCL, GAIL, f-MAX, and AIRL don’t use offline preferences or active human query, enable Learning from Demonstration (LfD) when incorporating expert data, and use the supremum ranking loss function and a non-linear reward function. 

  

BCO, GAIfO, DACfO, OPOLO, and f-IRL don’t use offline preferences or active human query, enable Learning from Observation (LfO), and use the supremum ranking loss function and a non-linear reward function. 

  

TREX and DREX use offline preferences, the Bradley-Terry ranking loss function and a non-linear reward function; they don’t use active human query or enable LfO or LfD. 

  

BREX uses offline preferences, the Bradley-Terry ranking loss function, and a linear reward function; it doesn’t use active human query or enable LfO or LfD. 

  

DemPref uses offline preferences, the Bradley-Terry ranking loss function, a linear reward function, and active human query; it enables LfO and LfD. 

  

Ibarz et al. (2018) uses offline preferences, the Bradley-Terry ranking loss function, a non-linear reward function, and active human query; it enables LfD. 

  

Rank-game uses offline preferences, a new principled ranking loss that can naturally incorporate rankings provided by diverse sources, and a non-linear reward function; it enables LfO and LfD and doesn’t use active human query.
Figure 2: Previous methods that learn from expert demonstrations or preferences form a special case of (texttt{rank-game}) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)—that is, learning from both expert states and actions—or learning from observations (LfO), where an agent learns from expert states alone.

Setting up the ranking game

To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:

  1. What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?
  2. Where do we get the dataset of pairwise preferences?
  3. How can we effectively optimize this two-player game?

Step 1: A new ranking loss function for reward learning

Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as Luce Shepard, Lovász-Bregman divergences, and the earlier discussed supremum loss, we introduce a new loss function:

(L_k(mathcal{D}^p;R) = mathbb{E}_{(rho^{pi^i},rho^{pi^j})sim mathcal{D}^p} Big[mathbb{E}_{s,asimrho^{pi^i}}{[(R(s,a)-0)^2]} + mathbb{E}_{s,asimrho^{pi^j}}{[(R(s,a)-k)^2]}Big])

The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter (k). This loss function allows us to learn a reward function with user-defined scale (k), which plays an important role in enabling better policy optimization; it’s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.

Step 2: Getting the ranking dataset

Besides giving more information about the expert’s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or shaped, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert’s behavior to be better or equal to current policy’s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:

  • Expert-annotated rankings: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.
  • Automatically generated rankings: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to Mixup regularization in trajectory space.

Step 3: Improving optimization stability with Stackelberg game

Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A Stackelberg game is a bi-level optimization problem:

(text{max}_x (f(x,y_x)),~~~~text{s.t}~~y_xin text{min}_x(g(x,y)))

In this optimization, we have two players—Leader (x) and Follower (y)—that are trying to maximize and minimize their own payoff (f) and (g), respectively. We cast (texttt{rank-game}) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:

  • Policy as Leader (PAL): (text{max}_pi J(R,pi)~~~~~text{s.t}~~ R=text{argmin}_R~L(D^p;R))
  • Reward as Leader (RAL): (text{min}_R L(D^p;R)~~~text{s.t}~~pi = text{argmax}_pi~J(R;pi))

Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.

How well does (texttt{rank-game}) perform in practice?

In testing the capabilities of (texttt{rank-game}), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of (texttt{rank-game}) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. (texttt{Rank-game}) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.

Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.

In this setting, we show that using only a handful of expert-annotated preferences in the (texttt{rank-game}) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data—adding preferences is key.

Next steps

Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The (texttt{rank-game}) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of (texttt{rank-game}) is that it does not suggest a way to handle noisy preferences. Second, (texttt{rank-game}) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that (texttt{rank-game}) did not learn reusable robust reward functions.

For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the paper, project page, code, and video presentation.

Acknowledgments

This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.

The post Unifying learning from preferences and demonstration via a ranking game for imitation learning appeared first on Microsoft Research.

Read More

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Rank Game diagram

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren’t always so easy.

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert’s reward function, or the expert’s intent, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can’t fully reason about the intent of the expert—that is, compare whether one of its own learned behaviors is closer to the expert’s than another behavior—as it only knows the best behavior and has no notion of ordering over other behaviors.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

In our paper “A Ranking Game for Imitation Learning,” being presented at Transactions on Machine Learning Research 2023 (TMLR), we propose a simple and intuitive framework, (texttt{rank-game}), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.

Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.

With (texttt{rank-game}), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.

A ranking game for imitation learning

Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the intent of the expert, and a policy, which specifies what actions the agent—or, in our case, the robot—should take in a given state to successfully mimic the expert.

Notation: We use (pi) and (pi^E) to denote the policy of the agent and the expert, respectively, and (R_{gt}) to be the reward function of the expert, which is unknown to the agent/robot. (rho^pi) denotes the state-action/state visitation distribution of policy (pi) in the environment—the probabilistic collection of states the policy visits in the environment. We use (J(R;pi)) to denote the (textit{cumulative reward}), or the performance of policy (pi) under a reward function (R). We assume policy (pi) belongs to function class (Pi) and reward function R belongs to function class (mathcal{R}).

The goal of imitation learning is to make the agent have the same performance as the expert under the expert’s unknown reward function (R_{gt}). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by (texttt{imit-game}) and write it below formally:

(texttt{imit-game}(pi,pi^E): text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

Simply stated, the (texttt{imit-game}) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.

Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).

A flow chart with, clockwise from top left, a green box labeled “policy agent,” a blue box labeled “reward agent,” and an orange box label “Dataset D,” which contains pairwise behavior rankings obtained from three sources. An arrow points from the policy agent to the dataset, indicating the policy’s contribution of rankings. An arrow pointing from the policy agent to the reward is labeled with the optimization strategy. An arrow pointing from the reward agent to the dataset is labeled with the ranking loss function.
Figure 1: The proposed (texttt{rank-game}) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.

To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors ((rho)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent (pi) maximizes the expected return (J(R;pi)) of the policy under reward function (R) when deployed in the environment. The reward player takes the dataset of pairwise rankings (D^p) (rankings are denoted as (rho^ipreceqrho^j)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by (L(D^p;R))).

(underbrace{text{argmax}_{piinPi}J(R;pi)}_{text{Policy Agent}}~~~~~~~~~~~~~~~underbrace{text{argmin}_{Rinmathcal{R}}L(D^p;R)}_{text{Reward Agent}})

The ranking loss induces a reward function (R) that attempts to satisfy each pairwise preference in the dataset as follows:

(mathbb{E}_{rho^i}[R(s,a)]lemathbb{E}_{rho^j}[R(s,a)]~~,~~forall rho^ipreceqrho^j in D^p)

Generalizing prior imitation learning approaches with (texttt{rank-game})

The (texttt{rank-game}) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let’s see how classical IRL is a part of this framework. Recall that the classical IRL/(texttt{imit-game}) optimization can be written as:

(text{argmin}_{piinPi}text{max}_{Rinmathcal{R}} [mathbb{E}_{rho^E(s,a)}[R(s,a)]-mathbb{E}_{rho^pi(s,a)}[R(s,a)]])

The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus, (texttt{imit-game}) can be seen to be a special case of (texttt{rank-game}) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as (textit{supremum loss})). A number of prior methods in the imitation learning domain can be understood as special cases of (texttt{rank-game}) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).

A table with a summary of imitation learning (IL) methods demonstrating the data modalities they can handle (expert data and/or preferences), their ranking-loss functions, the assumptions they make on reward function, and whether they require availability of an external agent to provide preferences during training.  

  

The IL methods MaxEntIRL, AdRIL, GAN-GCL, GAIL, f-MAX, and AIRL don’t use offline preferences or active human query, enable Learning from Demonstration (LfD) when incorporating expert data, and use the supremum ranking loss function and a non-linear reward function. 

  

BCO, GAIfO, DACfO, OPOLO, and f-IRL don’t use offline preferences or active human query, enable Learning from Observation (LfO), and use the supremum ranking loss function and a non-linear reward function. 

  

TREX and DREX use offline preferences, the Bradley-Terry ranking loss function and a non-linear reward function; they don’t use active human query or enable LfO or LfD. 

  

BREX uses offline preferences, the Bradley-Terry ranking loss function, and a linear reward function; it doesn’t use active human query or enable LfO or LfD. 

  

DemPref uses offline preferences, the Bradley-Terry ranking loss function, a linear reward function, and active human query; it enables LfO and LfD. 

  

Ibarz et al. (2018) uses offline preferences, the Bradley-Terry ranking loss function, a non-linear reward function, and active human query; it enables LfD. 

  

Rank-game uses offline preferences, a new principled ranking loss that can naturally incorporate rankings provided by diverse sources, and a non-linear reward function; it enables LfO and LfD and doesn’t use active human query.
Figure 2: Previous methods that learn from expert demonstrations or preferences form a special case of (texttt{rank-game}) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)—that is, learning from both expert states and actions—or learning from observations (LfO), where an agent learns from expert states alone.

Setting up the ranking game

To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:

  1. What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?
  2. Where do we get the dataset of pairwise preferences?
  3. How can we effectively optimize this two-player game?

Step 1: A new ranking loss function for reward learning

Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as Luce Shepard, Lovász-Bregman divergences, and the earlier discussed supremum loss, we introduce a new loss function:

(L_k(mathcal{D}^p;R) = mathbb{E}_{(rho^{pi^i},rho^{pi^j})sim mathcal{D}^p} Big[mathbb{E}_{s,asimrho^{pi^i}}{[(R(s,a)-0)^2]} + mathbb{E}_{s,asimrho^{pi^j}}{[(R(s,a)-k)^2]}Big])

The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter (k). This loss function allows us to learn a reward function with user-defined scale (k), which plays an important role in enabling better policy optimization; it’s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.

Step 2: Getting the ranking dataset

Besides giving more information about the expert’s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or shaped, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert’s behavior to be better or equal to current policy’s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:

  • Expert-annotated rankings: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.
  • Automatically generated rankings: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to Mixup regularization in trajectory space.

Step 3: Improving optimization stability with Stackelberg game

Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A Stackelberg game is a bi-level optimization problem:

(text{max}_x (f(x,y_x)),~~~~text{s.t}~~y_xin text{min}_x(g(x,y)))

In this optimization, we have two players—Leader (x) and Follower (y)—that are trying to maximize and minimize their own payoff (f) and (g), respectively. We cast (texttt{rank-game}) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:

  • Policy as Leader (PAL): (text{max}_pi J(R,pi)~~~~~text{s.t}~~ R=text{argmin}_R~L(D^p;R))
  • Reward as Leader (RAL): (text{min}_R L(D^p;R)~~~text{s.t}~~pi = text{argmax}_pi~J(R;pi))

Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.

How well does (texttt{rank-game}) perform in practice?

In testing the capabilities of (texttt{rank-game}), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of (texttt{rank-game}) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. (texttt{Rank-game}) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.

Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.

In this setting, we show that using only a handful of expert-annotated preferences in the (texttt{rank-game}) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data—adding preferences is key.

Next steps

Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The (texttt{rank-game}) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of (texttt{rank-game}) is that it does not suggest a way to handle noisy preferences. Second, (texttt{rank-game}) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that (texttt{rank-game}) did not learn reusable robust reward functions.

For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the paper, project page, code, and video presentation.

Acknowledgments

This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.

The post Unifying learning from preferences and demonstration via a ranking game for imitation learning appeared first on Microsoft Research.

Read More

Automatic post-deployment management of cloud applications

Automatic post-deployment management of cloud applications

SelfTune interaction with Client (Developer Machine) into Data Store (Azure ML Workspace)

Cloud Intelligence/AIOps blog series

In the first two blog posts in this series, we presented our vision for Cloud Intelligence/AIOps (AIOps) research, and scenarios where innovations in AI technologies can help build and operate complex cloud platforms and services effectively and efficiently at scale. In this blog post, we dive deeper into our efforts to automatically manage large-scale cloud services in deployment. In particular, we focus on an important post-deployment cloud management task that is pervasive across cloud services – tuning configuration parameters. And we discuss SelfTune, a horizontal reinforcement learning (RL) platform for successful configuration management of various cloud services in deployment.

Post-deployment management of cloud applications

Managing cloud applications includes mission-critical tasks such as resource allocation, scheduling, pre-provisioning, capacity planning and provisioning, and autoscaling. Currently, several of these tasks rely on hand-tuned and manually designed algorithms, heuristics, and domain knowledge. For a large cloud company like Microsoft, a hand-tuned, manually designed algorithm works well only to a certain extent, because deployments are extremely varied, large-scale, and involve complex interactions of various components. Moreover, user, customer, and application behavior can change over time, making yesterday’s hand-tuning not as relevant today and even less so in the future. The varied nature of today’s cloud technologies forces our engineers to spend an inordinate amount of time on special casing, introducing new configuration parameters, and writing or rewriting heuristics to set them appropriately. This also creates a lot of undocumented domain knowledge and dependence on a few individuals to solve significant problems. All of this, we believe, is unsustainable in the long term.

As we discussed in the earlier posts in this blog series, the right AI/ML formulations and techniques could help to alleviate this problem. Specifically, cloud management tasks are a natural fit for adopting the reinforcement learning paradigm. These tasks are repetitive in space and time; they run simultaneously on multiple machines, clusters, datacenters, and/or regions, and they run once every hour, day, week, or month. For instance, the VM pre-provisioning service for Azure Functions is a continuously running process, pre-provisioning for every application. Scheduling of background jobs on substrate runs separately on every machine. Reinforcement learning also needs a repetitive and iterative platform to converge on an optimized setup and, hence, can go together with the basic functioning of the cloud management task.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Our goal is to reduce manual effort in ensuring service efficiency, performance, and reliability by augmenting, complimenting, or replacing existing heuristics for various management tasks with general RL-based solutions. In this blog post, we present our recent solution frameworks for cloud applications, to automatically tune their configuration parameters and to design policies for managing the parameters over time. Our solutions require minimal engineering effort and no AI expertise from the application developers or cloud operators.

Example Microsoft scenarios

O365 Workload Manager: Workload Manager (WLM) is a process that runs on each of the backend Exchange Online (EXO) servers to help schedule resources (CPU, disk, network) to background jobs that periodically execute. WLM has several configuration parameters that need to be carefully set so that the throughput of the scheduler is maximized while also ensuring that the resources are not too strained to execute low-latency user-facing jobs (e.g., Outlook search). Could we help EXO infrastructure manage the various knobs that dictate the control logic implemented in the scheduler for optimizing resource management and user latency?

Azure ML/Spark: Spark is the platform for performing distributed data analytics, and it comes with various configuration knobs that need to be appropriately set by developers based on their job context: Does the query involve JOIN clauses? How big are the data shards? The workload patterns change over time, and pre-trained models for choosing optimal configurations may not suffice. Can we help developers dynamically choose the deployment configuration based on workload signals?

Azure Functions VM management: Can we tune the VM management policy implemented in Azure Functions for VM pre-fetching/eviction to minimize cold starts and memory wastage over time? Our results in simulations are quite encouraging. We want to engage with the Azure and MSR Redmond teams to discuss the possibility of tuning the policy in the production setting.

Azure Kubernetes Service: AKS is chosen by first-party as well as third-party Azure customers for facilitating containerized development and deployment of cloud applications. The in-built workload autoscaling policies in AKS use several configuration parameters, which can be far from optimal in several scenarios. Can we help automatically adjust the parameters that govern resource allocation to containers running microservices based on applications’ workload patterns?

Horizontal solution design for configuration tuning

We see three main reasons why this is the right time to design and incorporate an RL-based solution framework across cloud management tasks:

  1. As the size and complexity of services in the cloud continue to increase, as our hardware footprint continues to include many SKUs, and as configuration and code get larger and more complex, heuristics and hand-tuning cannot provide optimal operations at all times. Not without significant and proportionate investment in human experts and engineers.
  2. While we will have to rely on domain experts for key changes in systems and the services landscape on the cloud, using RL sub-systems can help reduce dependence on expert decisions and domain-knowledge over time.
  3. It is important to have a horizontal framework with a simple yet expressive API, with appropriate algorithms for tuning configuration parameters in an online fashion to optimize a developer-specific metric of interest or reward.

SelfTune framework

We have designed and deployed the SelfTune framework to help cloud service developers automatically tune the configuration parameters in their codebase, which would otherwise be manually set or heuristically tweaked. SelfTune is an RL-based framework that helps developers automate complex post-deployment cloud management tasks such as parameter tuning and performance engineering.

SelfTune is hosted as a service on the public Azure cloud. First-party applications that are interested in post-deployment parameter tuning can use RestAPI calls to access SelfTune endpoints. The SelfTune framework has two components:

  1. Client API provides necessary support to access the SelfTune endpoints via RestAPI calls, namely, Predict for getting the parameters from the framework and SetReward for providing reward/feedback to the framework.
  2. RL Engine implements a suite of ML/RL algorithms for periodically updating the parameters and returning the latest values to the clients as well as for periodically computing the reward metrics.

At the core of the SelfTune framework is the formulation of the post-deployment parameter tuning problem as that of “online learning from bandit feedback.” SelfTune assumes that the only interaction possible with the external system (i.e., the application being tuned) is a black-box access to some form of feedback (e.g., daily P95 latency of the service). The framework repeatedly deploys configuration parameters and observes the corresponding rewards after a developer-defined period. As the operational environment (e.g., production cluster running certain types of workloads) is constantly in flux, there is no single setting of parameters that will remain optimal throughout. Thus, SelfTune continuously runs the explore-exploit paradigm of RL techniques – explore new parameters in the vicinity of the currently deployed parameters, observe rewards, update its internal model based on the reward, and exploit parameters that tend to give high rewards over time.

We have designed a bandit learning algorithm called Bluefinin SelfTune that crystallizes the aforementioned idea. Our algorithm has lower sample complexity, which means it takes a lower number of rounds for the algorithm to converge to desired values when we want to tune multiple real-valued parameters simultaneously, compared to peer techniques like multi-arm bandits (which is the base of Azure Personalizer), Bayesian Optimization (used by the MLOS framework), or genetic algorithms. This is provable under some assumptions on the reward function, but we observe, across multiple deployments, that the algorithm converges to good solutions in practice even when theoretical assumptions are often violated.

We have open-sourced Bluefin through Vowpal Wabbit, a popular RL library for practitioners, which houses the core algorithms of Azure Personalizer. We are continuing to work on designing vertical RL algorithms and horizontal feature learning for the systems domain. Besides Bluefin, SelfTune supports a suite of black-box optimization (e.g. Bayesian Optimization) and RL techniques (e.g., Deep Deterministic Policy Gradients) that the cloud applications can choose from, based on their needs.

A simple integration use case: Consider the scenario of setting PySpark cluster configuration parameters for Azure ML jobs that are spawned for ML workloads in the O365 MS-AI organization. The workloads are composed of various data processing jobs and run on various Azure ML clusters with different capacities and hardware. It is non-trivial to set parameters for various jobs, such that the workloads complete quickly, and not fail in the middle due to resourcing issues thereby losing all computations.

Basic SelfTune workflow: The basic integration of SelfTune in the AzureML pipeline is illustrated in the figure below. Here, the developer wants to tune seven key Apache PySpark parameters per job, namely driver memory, driver cores, executor cores, number executors, executor memory, spark.sql.shuffle.partitions, and spark.default.parallelism.

Basic SelfTune workflow
  1. Developer invokes Predict on the SelfTune instance, asking for the parameters for the next job.
  2. SelfTune service responds with the predicted parameters for the next job.
  3. The developer submits a job using SelfTune’s predicted parameters. //outside SelfTune’s purview
  4. Once the job is complete, the cluster sends job meta data to the data store. // outside SelfTune’s purview
  5. Developer queries rewards for previously completed jobs, if any, from Data Store (e.g., Azure ML workspace).
  6. Data Store responds with the rewards (e.g., job completion times, which is part of the job meta-data) from previously completed jobs.
  7. If the rewards exist in the store, the developer invokes SetReward for those jobs (which pushes the rewards to the SelfTune service endpoint hosted somewhere).

Self-tuning substrate background jobs scheduler

User-level background job scheduling: All the substrate backend servers in EXO datacenters (that host user mailboxes) run hundreds of low-priority, latency-insensitive, periodic workloads locally (e.g., mailbox replication, encryption, event-driven assistants, etc.). Workload Management (WLM) is a core substrate service that runs on all such backend servers. It helps with the user-level scheduling of workloads on the servers: a) with the goal of completing the tasks when resources become available (at micro-granular timescales), and b) mindful of the fact that high-priority, latency-sensitive workloads will bypass this scheduler. Thus, ensuring high availability of resources especially during peak hours is critical, besides meeting workload SLAs.

Tuning real-valued configuration parameters: The scheduler is implemented today as part of a huge codebase in the substrate core. The scheduler trades off resource utilization and completion rates by dynamically ramping up and ramping down the number of concurrent background tasks requiring access for the resources. This is achieved by carefully setting several configuration settings (hundreds of real-valued parameters). At a server level, we can achieve better resource utilization and throughput, by automatically tuning the key parameters, based on the workloads it receives and the ensuing resource health fluctuations.

Impact of using SelfTune in WLM: We have integrated SelfTune with the substrate background scheduler codebase (the change required is simple, on the order of tens of lines of code, as shown in the figure below). We first deployed in the inner rings of substrate (over 3000+ servers). The results gathered over 4-5 weeks of deployment clearly indicate that tuning helps on most of the deployed servers, increasing throughput at least 20% across multiple forests in their heavily throttled servers, with a marginal increase in CPU health and insignificant-to-mild degradation of disk health. Based on this validation, we have now rolled out SelfTune integration to most EXO backend servers (nearly 200,000) across the worldwide production ring.

SelfTune Application library contains the SelfTune client API and the RL/ML algorithms

Ongoing work and future AI+systems research

SelfTune is a general platform and can be readily applied to many RL-for-cloud scenarios without any additional feature engineering or onboarding efforts (which are typically required in AIOps). We expect that developers can define a suitable spatial and temporal tuning scope in the service/system, tuning the parameters of the service running in the cluster, at the level of machines, every hour of every day. Thus, instead of hand-coding the optimal operating points for various machines or various clusters that the service operates in, we could integrate SelfTune in the service codebase to dynamically figure them out over time, based on the real-time feedback at a determined temporal granularity.

Our work poses a lot of interesting design and algorithmic questions in this space. For instance, can we automatically scope the tuning problem based on some observed context such as cluster type, hardware, workload volumes, etc., and find optimal parameters per scope? Given that typical cloud applications have hundreds, if not thousands, of knobs to tune, can we automatically identify the knobs that impact the performance metric of interest, and then tune those knobs more efficiently?

A combination of system insights, ML formulations, and cross-layer optimization is vital for effective post-deployment management of cloud applications and services. We will post an update to this blog post on our ongoing work in this space soon. Meanwhile, the final blog post in this series will explore how AIOps can be made more comprehensive by spanning the entire cloud stack.

The post Automatic post-deployment management of cloud applications appeared first on Microsoft Research.

Read More

Automatic post-deployment management of cloud applications

Automatic post-deployment management of cloud applications

SelfTune interaction with Client (Developer Machine) into Data Store (Azure ML Workspace)

Cloud Intelligence/AIOps blog series

In the first two blog posts in this series, we presented our vision for Cloud Intelligence/AIOps (AIOps) research, and scenarios where innovations in AI technologies can help build and operate complex cloud platforms and services effectively and efficiently at scale. In this blog post, we dive deeper into our efforts to automatically manage large-scale cloud services in deployment. In particular, we focus on an important post-deployment cloud management task that is pervasive across cloud services – tuning configuration parameters. And we discuss SelfTune, a horizontal reinforcement learning (RL) platform for successful configuration management of various cloud services in deployment.

Post-deployment management of cloud applications

Managing cloud applications includes mission-critical tasks such as resource allocation, scheduling, pre-provisioning, capacity planning and provisioning, and autoscaling. Currently, several of these tasks rely on hand-tuned and manually designed algorithms, heuristics, and domain knowledge. For a large cloud company like Microsoft, a hand-tuned, manually designed algorithm works well only to a certain extent, because deployments are extremely varied, large-scale, and involve complex interactions of various components. Moreover, user, customer, and application behavior can change over time, making yesterday’s hand-tuning not as relevant today and even less so in the future. The varied nature of today’s cloud technologies forces our engineers to spend an inordinate amount of time on special casing, introducing new configuration parameters, and writing or rewriting heuristics to set them appropriately. This also creates a lot of undocumented domain knowledge and dependence on a few individuals to solve significant problems. All of this, we believe, is unsustainable in the long term.

As we discussed in the earlier posts in this blog series, the right AI/ML formulations and techniques could help to alleviate this problem. Specifically, cloud management tasks are a natural fit for adopting the reinforcement learning paradigm. These tasks are repetitive in space and time; they run simultaneously on multiple machines, clusters, datacenters, and/or regions, and they run once every hour, day, week, or month. For instance, the VM pre-provisioning service for Azure Functions is a continuously running process, pre-provisioning for every application. Scheduling of background jobs on substrate runs separately on every machine. Reinforcement learning also needs a repetitive and iterative platform to converge on an optimized setup and, hence, can go together with the basic functioning of the cloud management task.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Our goal is to reduce manual effort in ensuring service efficiency, performance, and reliability by augmenting, complimenting, or replacing existing heuristics for various management tasks with general RL-based solutions. In this blog post, we present our recent solution frameworks for cloud applications, to automatically tune their configuration parameters and to design policies for managing the parameters over time. Our solutions require minimal engineering effort and no AI expertise from the application developers or cloud operators.

Example Microsoft scenarios

O365 Workload Manager: Workload Manager (WLM) is a process that runs on each of the backend Exchange Online (EXO) servers to help schedule resources (CPU, disk, network) to background jobs that periodically execute. WLM has several configuration parameters that need to be carefully set so that the throughput of the scheduler is maximized while also ensuring that the resources are not too strained to execute low-latency user-facing jobs (e.g., Outlook search). Could we help EXO infrastructure manage the various knobs that dictate the control logic implemented in the scheduler for optimizing resource management and user latency?

Azure ML/Spark: Spark is the platform for performing distributed data analytics, and it comes with various configuration knobs that need to be appropriately set by developers based on their job context: Does the query involve JOIN clauses? How big are the data shards? The workload patterns change over time, and pre-trained models for choosing optimal configurations may not suffice. Can we help developers dynamically choose the deployment configuration based on workload signals?

Azure Functions VM management: Can we tune the VM management policy implemented in Azure Functions for VM pre-fetching/eviction to minimize cold starts and memory wastage over time? Our results in simulations are quite encouraging. We want to engage with the Azure and MSR Redmond teams to discuss the possibility of tuning the policy in the production setting.

Azure Kubernetes Service: AKS is chosen by first-party as well as third-party Azure customers for facilitating containerized development and deployment of cloud applications. The in-built workload autoscaling policies in AKS use several configuration parameters, which can be far from optimal in several scenarios. Can we help automatically adjust the parameters that govern resource allocation to containers running microservices based on applications’ workload patterns?

Horizontal solution design for configuration tuning

We see three main reasons why this is the right time to design and incorporate an RL-based solution framework across cloud management tasks:

  1. As the size and complexity of services in the cloud continue to increase, as our hardware footprint continues to include many SKUs, and as configuration and code get larger and more complex, heuristics and hand-tuning cannot provide optimal operations at all times. Not without significant and proportionate investment in human experts and engineers.
  2. While we will have to rely on domain experts for key changes in systems and the services landscape on the cloud, using RL sub-systems can help reduce dependence on expert decisions and domain-knowledge over time.
  3. It is important to have a horizontal framework with a simple yet expressive API, with appropriate algorithms for tuning configuration parameters in an online fashion to optimize a developer-specific metric of interest or reward.

SelfTune framework

We have designed and deployed the SelfTune framework to help cloud service developers automatically tune the configuration parameters in their codebase, which would otherwise be manually set or heuristically tweaked. SelfTune is an RL-based framework that helps developers automate complex post-deployment cloud management tasks such as parameter tuning and performance engineering.

SelfTune is hosted as a service on the public Azure cloud. First-party applications that are interested in post-deployment parameter tuning can use RestAPI calls to access SelfTune endpoints. The SelfTune framework has two components:

  1. Client API provides necessary support to access the SelfTune endpoints via RestAPI calls, namely, Predict for getting the parameters from the framework and SetReward for providing reward/feedback to the framework.
  2. RL Engine implements a suite of ML/RL algorithms for periodically updating the parameters and returning the latest values to the clients as well as for periodically computing the reward metrics.

At the core of the SelfTune framework is the formulation of the post-deployment parameter tuning problem as that of “online learning from bandit feedback.” SelfTune assumes that the only interaction possible with the external system (i.e., the application being tuned) is a black-box access to some form of feedback (e.g., daily P95 latency of the service). The framework repeatedly deploys configuration parameters and observes the corresponding rewards after a developer-defined period. As the operational environment (e.g., production cluster running certain types of workloads) is constantly in flux, there is no single setting of parameters that will remain optimal throughout. Thus, SelfTune continuously runs the explore-exploit paradigm of RL techniques – explore new parameters in the vicinity of the currently deployed parameters, observe rewards, update its internal model based on the reward, and exploit parameters that tend to give high rewards over time.

We have designed a bandit learning algorithm called Bluefinin SelfTune that crystallizes the aforementioned idea. Our algorithm has lower sample complexity, which means it takes a lower number of rounds for the algorithm to converge to desired values when we want to tune multiple real-valued parameters simultaneously, compared to peer techniques like multi-arm bandits (which is the base of Azure Personalizer), Bayesian Optimization (used by the MLOS framework), or genetic algorithms. This is provable under some assumptions on the reward function, but we observe, across multiple deployments, that the algorithm converges to good solutions in practice even when theoretical assumptions are often violated.

We have open-sourced Bluefin through Vowpal Wabbit, a popular RL library for practitioners, which houses the core algorithms of Azure Personalizer. We are continuing to work on designing vertical RL algorithms and horizontal feature learning for the systems domain. Besides Bluefin, SelfTune supports a suite of black-box optimization (e.g. Bayesian Optimization) and RL techniques (e.g., Deep Deterministic Policy Gradients) that the cloud applications can choose from, based on their needs.

A simple integration use case: Consider the scenario of setting PySpark cluster configuration parameters for Azure ML jobs that are spawned for ML workloads in the O365 MS-AI organization. The workloads are composed of various data processing jobs and run on various Azure ML clusters with different capacities and hardware. It is non-trivial to set parameters for various jobs, such that the workloads complete quickly, and not fail in the middle due to resourcing issues thereby losing all computations.

Basic SelfTune workflow: The basic integration of SelfTune in the AzureML pipeline is illustrated in the figure below. Here, the developer wants to tune seven key Apache PySpark parameters per job, namely driver memory, driver cores, executor cores, number executors, executor memory, spark.sql.shuffle.partitions, and spark.default.parallelism.

Basic SelfTune workflow
  1. Developer invokes Predict on the SelfTune instance, asking for the parameters for the next job.
  2. SelfTune service responds with the predicted parameters for the next job.
  3. The developer submits a job using SelfTune’s predicted parameters. //outside SelfTune’s purview
  4. Once the job is complete, the cluster sends job meta data to the data store. // outside SelfTune’s purview
  5. Developer queries rewards for previously completed jobs, if any, from Data Store (e.g., Azure ML workspace).
  6. Data Store responds with the rewards (e.g., job completion times, which is part of the job meta-data) from previously completed jobs.
  7. If the rewards exist in the store, the developer invokes SetReward for those jobs (which pushes the rewards to the SelfTune service endpoint hosted somewhere).

Self-tuning substrate background jobs scheduler

User-level background job scheduling: All the substrate backend servers in EXO datacenters (that host user mailboxes) run hundreds of low-priority, latency-insensitive, periodic workloads locally (e.g., mailbox replication, encryption, event-driven assistants, etc.). Workload Management (WLM) is a core substrate service that runs on all such backend servers. It helps with the user-level scheduling of workloads on the servers: a) with the goal of completing the tasks when resources become available (at micro-granular timescales), and b) mindful of the fact that high-priority, latency-sensitive workloads will bypass this scheduler. Thus, ensuring high availability of resources especially during peak hours is critical, besides meeting workload SLAs.

Tuning real-valued configuration parameters: The scheduler is implemented today as part of a huge codebase in the substrate core. The scheduler trades off resource utilization and completion rates by dynamically ramping up and ramping down the number of concurrent background tasks requiring access for the resources. This is achieved by carefully setting several configuration settings (hundreds of real-valued parameters). At a server level, we can achieve better resource utilization and throughput, by automatically tuning the key parameters, based on the workloads it receives and the ensuing resource health fluctuations.

Impact of using SelfTune in WLM: We have integrated SelfTune with the substrate background scheduler codebase (the change required is simple, on the order of tens of lines of code, as shown in the figure below). We first deployed in the inner rings of substrate (over 3000+ servers). The results gathered over 4-5 weeks of deployment clearly indicate that tuning helps on most of the deployed servers, increasing throughput at least 20% across multiple forests in their heavily throttled servers, with a marginal increase in CPU health and insignificant-to-mild degradation of disk health. Based on this validation, we have now rolled out SelfTune integration to most EXO backend servers (nearly 200,000) across the worldwide production ring.

SelfTune Application library contains the SelfTune client API and the RL/ML algorithms

Ongoing work and future AI+systems research

SelfTune is a general platform and can be readily applied to many RL-for-cloud scenarios without any additional feature engineering or onboarding efforts (which are typically required in AIOps). We expect that developers can define a suitable spatial and temporal tuning scope in the service/system, tuning the parameters of the service running in the cluster, at the level of machines, every hour of every day. Thus, instead of hand-coding the optimal operating points for various machines or various clusters that the service operates in, we could integrate SelfTune in the service codebase to dynamically figure them out over time, based on the real-time feedback at a determined temporal granularity.

Our work poses a lot of interesting design and algorithmic questions in this space. For instance, can we automatically scope the tuning problem based on some observed context such as cluster type, hardware, workload volumes, etc., and find optimal parameters per scope? Given that typical cloud applications have hundreds, if not thousands, of knobs to tune, can we automatically identify the knobs that impact the performance metric of interest, and then tune those knobs more efficiently?

A combination of system insights, ML formulations, and cross-layer optimization is vital for effective post-deployment management of cloud applications and services. We will post an update to this blog post on our ongoing work in this space soon. Meanwhile, the final blog post in this series will explore how AIOps can be made more comprehensive by spanning the entire cloud stack.

The post Automatic post-deployment management of cloud applications appeared first on Microsoft Research.

Read More

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

nsdi'23 on a red background with

Microsoft has made significant contributions to the prestigious USENIX NSDI’23 conference, which brings together experts in computer networks and distributed systems. A silver sponsor for the conference, Microsoft is a leader in developing innovative technologies for networking, and we are proud to have contributed to 30 papers accepted this year. Our team members also served on the program committee, highlighting our commitment to advancing the field.

The accepted research papers span a wide range of topics, including networking for AI workloads, cloud networking, WAN, and wireless networks. These papers showcase some of the latest advancements in networking research.

The paper, “DOTE: Rethinking (Predictive) WAN Traffic Engineering”, which revisits traffic engineering in the Wide Area Network (WAN), was selected for one of the Best Paper Awards at the conference. This work was done jointly by researchers at Microsoft, along with academics at Hebrew University of Jerusalem and Technion.

Some other innovations on cloud networking infrastructure include:

Empowering Azure Storage with RDMA, which presents the findings from deploying intra-region Remote Direct Memory Access (RDMA) to support storage workloads in Azure. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. RDMA helps us achieve significant disk I/O performance improvements and CPU core savings. This research is a testament to Microsoft’s ongoing commitment to providing customers with the best possible user experience.

Disaggregating Stateful Network Functions, which introduces a new approach for better reliability and performance at a lower per-server cost for cloud users. The core idea is to move the network function processing off individual servers and into shared resource pools. This technology is now shipping as part of Microsoft Azure Accelerated Connections.

Our colleagues from Microsoft Research Asia, will present ARK: GPU-driven Code Execution for Distributed Deep Learning, which overcomes the overhead of GPU communication for large deep learning workloads by having GPUs run their code, and handle communication events autonomously, without CPU intervention.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

Microsoft’s collective contributions to the USENIX NSDI’23 conference highlight our commitment to advancing the field of networking research and developing innovative solutions to real-world networking problems, leveraging strong academic collaborations. We look forward to continuing to push the boundaries of what is possible in networking research and delivering cutting-edge solutions to our customers.

A complete list of Microsoft papers accepted at USENIX NSDI’23:

  1. Understanding RDMA Microarchitecture Resources for Performance Isolation, Xinhao Kong and Jingrong Chen, Duke University; Wei Bai, Microsoft; Yechen Xu, Shanghai Jiao Tong University; Mahmoud Elhaddad, Shachar Raindel, and Jitendra Padhye, Microsoft; Alvin R. Lebeck and Danyang Zhuo, Duke University
  2. Empowering Azure Storage with RDMA, Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu, Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee, Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun, Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, and Brian Zill, Microsoft
  3. ARK: GPU-driven Code Execution for Distributed Deep Learning, Changho Hwang, KAIST, Microsoft Research; KyoungSoo Park, KAIST; Ran Shu, Xinyuan Qu, Peng Cheng, and Yongqiang Xiong, Microsoft Research
  4. Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications, Inho Choi, National University of Singapore; Ellis Michael, University of Washington; Yunfan Li, National University of Singapore; Dan R. K. Ports, Microsoft Research; Jialin Li, National University of Singapore
  5. Waverunner: An Elegant Approach to Hardware Acceleration of State Machine Replication, Mohammadreza Alimadadi and Hieu Mai, Stony Brook University; Shenghsun Cho, Microsoft; Michael Ferdman, Peter Milder, and Shuai Mu, Stony Brook University
  6. Scalable Distributed Massive MIMO Baseband Processing, Junzhi Gong, Harvard University; Anuj Kalia, Microsoft; Minlan Yu, Harvard University
  7. Unlocking unallocated cloud capacity for long, uninterruptible workloads, Anup Agarwal, Carnegie Mellon University; Shadi Noghabi, Microsoft Research; Íñigo Goiri, Azure Systems Research; Srinivasan Seshan, Carnegie Mellon University; Anirudh Badam, Microsoft Research
  8. Invisinets: Removing Networking from Cloud Networks, Sarah McClure and Zeke Medley, UC Berkeley; Deepak Bansal and Karthick Jayaraman, Microsoft; Ashok Narayanan, Google; Jitendra Padhye, Microsoft; Sylvia Ratnasamy, UC Berkeley and Google; Anees Shaikh, Google; Rishabh Tewari, Microsoft
  9. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs, John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, and Yifan Qiao, UCLA; Zhihao Jia, CMU; Minjia Zhang, Microsoft Research; Ravi Netravali, Princeton University; Guoqing Harry Xu, UCLA
  10. OneWAN is better than two: Unifying a split WAN architecture, Umesh Krishnaswamy, Microsoft; Rachee Singh, Microsoft and Cornell University; Paul Mattes, Paul-Andre C Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, Himanshu Raj, Luis Irun-Briz, Jamie Gaudette, and Erica Lan, Microsoft
  11. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches, Aashaka Shah, University of Texas at Austin; Vijay Chidambaram, University of Texas at Austin and VMware Research; Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi, Microsoft Research; Rachee Singh, Microsoft and Cornell University
  12. Synthesizing Runtime Programmable Switch Updates, Yiming Qiu, Rice University; Ryan Beckett, Microsoft; Ang Chen, Rice University
  13. Formal Methods for Network Performance Analysis, Mina Tahmasbi Arashloo, University of Waterloo; Ryan Beckett, Microsoft Research; Rachit Agarwal, Cornell University
  14. Scalable Tail Latency Estimation for Data Center Networks, Kevin Zhao, University of Washington; Prateesh Goyal, Microsoft Research; Mohammad Alizadeh, MIT CSAIL; Thomas E. Anderson, University of Washington
  15. Addax: A fast, private, and accountable ad exchange infrastructure, Ke Zhong, Yiping Ma, and Yifeng Mao, University of Pennsylvania; Sebastian Angel, University of Pennsylvania & Microsoft Research
  16. RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics, Mehrdad Khani, MIT CSAIL and Microsoft; Ganesh Ananthanarayanan and Kevin Hsieh, Microsoft; Junchen Jiang, University of Chicago; Ravi Netravali, Princeton University; Yuanchao Shu, Zhejiang University; Mohammad Alizadeh, MIT CSAIL; Victor Bahl, Microsoft
  17. Tambur: Efficient loss recovery for videoconferencing via streaming codes, Michael Rudow, Carnegie Mellon University; Francis Y. Yan, Microsoft Research; Abhishek Kumar, Carnegie Mellon University; Ganesh Ananthanarayanan and Martin Ellis, Microsoft; K.V. Rashmi, Carnegie Mellon University
  18. Gemel: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge, Arthi Padmanabhan, UCLA; Neil Agarwal, Princeton University; Anand Iyer and Ganesh Ananthanarayanan, Microsoft Research; Yuanchao Shu, Zhejiang University; Nikolaos Karianakis, Microsoft Research; Guoqing Harry Xu, UCLA; Ravi Netravali, Princeton University
  19. On Modular Learning of Distributed Systems for Predicting End-to-End Latency, Chieh-Jan Mike Liang, Microsoft Research; Zilin Fang, Carnegie Mellon University; Yuqing Xie, Tsinghua University; Fan Yang, Microsoft Research; Zhao Lucis Li, University of Science and Technology of China; Li Lyna Zhang, Mao Yang, and Lidong Zhou, Microsoft Research
  20. SelfTune: Tuning Cluster Managers, Ajaykrishna Karthikeyan and Nagarajan Natarajan, Microsoft Research; Gagan Somashekar, Stony Brook University; Lei Zhao, Microsoft; Ranjita Bhagwan, Microsoft Research; Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal, Microsoft
  21. OpenLoRa: Validating LoRa Implementations through an Extensible and Open-sourced Framework, Manan Mishra, Daniel Koch, Muhammad Osama Shahid, and Bhuvana Krishnaswamy, University of Wisconsin-Madison; Krishna Chintalapudi, Microsoft Research; Suman Banerjee, University of Wisconsin-Madison
  22. ExoPlane: An Operating System for On-Rack Switch Resource Augmentation, Daehyeok Kim, Microsoft and University of Texas at Austin; Vyas Sekar and Srinivasan Seshan, Carnegie Mellon University
  23. Sketchovsky: Enabling Ensembles of Sketches on Programmable Switches, Hun Namkung, Carnegie Mellon University; Zaoxing Liu, Boston University; Daehyeok Kim, Microsoft Research; Vyas Sekar and Peter Steenkiste, Carnegie Mellon University
  24. Acoustic Sensing and Communication Using Metasurface, Yongzhao Zhang, Yezhou Wang, and Lanqing Yang, Shanghai Jiao Tong University; Mei Wang, UT Austin; Yi-Chao Chen, Shanghai Jiao Tong University and Microsoft Research Asia; Lili Qiu, UT Austin and Microsoft Research Asia; Yihong Liu, University of Glasgow; Guangtao Xue and Jiadi Yu, Shanghai Jiao Tong University
  25. Disaggregating Stateful Network Functions, Deepak Bansal, Gerald DeGrace, Rishabh Tewari, Michal Zygmunt, and James Grantham, Microsoft; Silvano Gai, Mario Baldi, Krishna Doddapaneni, Arun Selvarajan, Arunkumar Arumugam, and Balakrishnan Raman, AMD Pensando; Avijit Gupta, Sachin Jain, Deven Jagasia, Evan Langlais, Pranjal Srivastava, Rishiraj Hazarika, Neeraj Motwani, Soumya Tiwari, Stewart Grant, Ranveer Chandra, and Srikanth Kandula, Microsoft
  26. Doing More with Less: Orchestrating Serverless Applications without an Orchestrator, David H. Liu and Amit Levy, Princeton University; Shadi Noghabi and Sebastian Burckhardt, Microsoft Research
  27. NetPanel: Traffic Measurement of Exchange Online Service, Yu Chen, Microsoft 365, China; Liqun Li and Yu Kang, Microsoft Research, China; Boyang Zheng, Yehan Wang, More Zhou, Yuchao Dai, and Zhenguo Yang, Microsoft 365, China; Brad Rutkowski and Jeff Mealiffe, Microsoft 365, USA; Qingwei Lin, Microsoft Research, China
  28. DOTE: Rethinking (Predictive) WAN Traffic Engineering, Yarin Perry, Hebrew University of Jerusalem; Felipe Vieira Frujeri, Microsoft Research; Chaim Hoch, Hebrew University of Jerusalem; Srikanth Kandula and Ishai Menache, Microsoft Research; Michael Schapira, Hebrew University of Jerusalem; Aviv Tamar, Technion
  29. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker, Yinfang Chen and Xudong Sun, University of Illinois at Urbana-Champaign; Suman Nath, Microsoft Research; Ze Yang and Tianyin Xu, University of Illinois at Urbana-Champaign
  30. Test Coverage for Network Configurations, Xieyang Xu and Weixin Deng, University of Washington; Ryan Beckett, Microsoft; Ratul Mahajan, University of Washington; David Walker, Princeton University

NSDI 2023 Program Committee members:

Members of other committees:

The post Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems appeared first on Microsoft Research.

Read More

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

nsdi'23 on a red background with

Microsoft has made significant contributions to the prestigious USENIX NSDI’23 conference, which brings together experts in computer networks and distributed systems. A silver sponsor for the conference, Microsoft is a leader in developing innovative technologies for networking, and we are proud to have contributed to 30 papers accepted this year. Our team members also served on the program committee, highlighting our commitment to advancing the field.

The accepted research papers span a wide range of topics, including networking for AI workloads, cloud networking, WAN, and wireless networks. These papers showcase some of the latest advancements in networking research.

The paper, “DOTE: Rethinking (Predictive) WAN Traffic Engineering”, which revisits traffic engineering in the Wide Area Network (WAN), was selected for one of the Best Paper Awards at the conference. This work was done jointly by researchers at Microsoft, along with academics at Hebrew University of Jerusalem.

Some other innovations on cloud networking infrastructure include:

Empowering Azure Storage with RDMA, which presents the findings from deploying intra-region Remote Direct Memory Access (RDMA) to support storage workloads in Azure. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. RDMA helps us achieve significant disk I/O performance improvements and CPU core savings. This research is a testament to Microsoft’s ongoing commitment to providing customers with the best possible user experience.

Disaggregating Stateful Network Functions, which introduces a new approach for better reliability and performance at a lower per-server cost for cloud users. The core idea is to move the network function processing off individual servers and into shared resource pools. This technology is now shipping as part of Microsoft Azure Accelerated Connections.

Our colleagues from Microsoft Research Asia, will present ARK: GPU-driven Code Execution for Distributed Deep Learning, which overcomes the overhead of GPU communication for large deep learning workloads by having GPUs run their code, and handle communication events autonomously, without CPU intervention.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

Microsoft’s collective contributions to the USENIX NSDI’23 conference highlight our commitment to advancing the field of networking research and developing innovative solutions to real-world networking problems, leveraging strong academic collaborations. We look forward to continuing to push the boundaries of what is possible in networking research and delivering cutting-edge solutions to our customers.

A complete list of Microsoft papers accepted at USENIX NSDI’23:

  1. Understanding RDMA Microarchitecture Resources for Performance Isolation, Xinhao Kong and Jingrong Chen, Duke University; Wei Bai, Microsoft; Yechen Xu, Shanghai Jiao Tong University; Mahmoud Elhaddad, Shachar Raindel, and Jitendra Padhye, Microsoft; Alvin R. Lebeck and Danyang Zhuo, Duke University
  2. Empowering Azure Storage with RDMA, Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu, Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee, Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun, Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, and Brian Zill, Microsoft
  3. ARK: GPU-driven Code Execution for Distributed Deep Learning, Changho Hwang, KAIST, Microsoft Research; KyoungSoo Park, KAIST; Ran Shu, Xinyuan Qu, Peng Cheng, and Yongqiang Xiong, Microsoft Research
  4. Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications, Inho Choi, National University of Singapore; Ellis Michael, University of Washington; Yunfan Li, National University of Singapore; Dan R. K. Ports, Microsoft Research; Jialin Li, National University of Singapore
  5. Waverunner: An Elegant Approach to Hardware Acceleration of State Machine Replication, Mohammadreza Alimadadi and Hieu Mai, Stony Brook University; Shenghsun Cho, Microsoft; Michael Ferdman, Peter Milder, and Shuai Mu, Stony Brook University
  6. Scalable Distributed Massive MIMO Baseband Processing, Junzhi Gong, Harvard University; Anuj Kalia, Microsoft; Minlan Yu, Harvard University
  7. Unlocking unallocated cloud capacity for long, uninterruptible workloads, Anup Agarwal, Carnegie Mellon University; Shadi Noghabi, Microsoft Research; Íñigo Goiri, Azure Systems Research; Srinivasan Seshan, Carnegie Mellon University; Anirudh Badam, Microsoft Research
  8. Invisinets: Removing Networking from Cloud Networks, Sarah McClure and Zeke Medley, UC Berkeley; Deepak Bansal and Karthick Jayaraman, Microsoft; Ashok Narayanan, Google; Jitendra Padhye, Microsoft; Sylvia Ratnasamy, UC Berkeley and Google; Anees Shaikh, Google; Rishabh Tewari, Microsoft
  9. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs, John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, and Yifan Qiao, UCLA; Zhihao Jia, CMU; Minjia Zhang, Microsoft Research; Ravi Netravali, Princeton University; Guoqing Harry Xu, UCLA
  10. OneWAN is better than two: Unifying a split WAN architecture, Umesh Krishnaswamy, Microsoft; Rachee Singh, Microsoft and Cornell University; Paul Mattes, Paul-Andre C Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, Himanshu Raj, Luis Irun-Briz, Jamie Gaudette, and Erica Lan, Microsoft
  11. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches, Aashaka Shah, University of Texas at Austin; Vijay Chidambaram, University of Texas at Austin and VMware Research; Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi, Microsoft Research; Rachee Singh, Microsoft and Cornell University
  12. Synthesizing Runtime Programmable Switch Updates, Yiming Qiu, Rice University; Ryan Beckett, Microsoft; Ang Chen, Rice University
  13. Formal Methods for Network Performance Analysis, Mina Tahmasbi Arashloo, University of Waterloo; Ryan Beckett, Microsoft Research; Rachit Agarwal, Cornell University
  14. Scalable Tail Latency Estimation for Data Center Networks, Kevin Zhao, University of Washington; Prateesh Goyal, Microsoft Research; Mohammad Alizadeh, MIT CSAIL; Thomas E. Anderson, University of Washington
  15. Addax: A fast, private, and accountable ad exchange infrastructure, Ke Zhong, Yiping Ma, and Yifeng Mao, University of Pennsylvania; Sebastian Angel, University of Pennsylvania & Microsoft Research
  16. RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics, Mehrdad Khani, MIT CSAIL and Microsoft; Ganesh Ananthanarayanan and Kevin Hsieh, Microsoft; Junchen Jiang, University of Chicago; Ravi Netravali, Princeton University; Yuanchao Shu, Zhejiang University; Mohammad Alizadeh, MIT CSAIL; Victor Bahl, Microsoft
  17. Tambur: Efficient loss recovery for videoconferencing via streaming codes, Michael Rudow, Carnegie Mellon University; Francis Y. Yan, Microsoft Research; Abhishek Kumar, Carnegie Mellon University; Ganesh Ananthanarayanan and Martin Ellis, Microsoft; K.V. Rashmi, Carnegie Mellon University
  18. Gemel: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge, Arthi Padmanabhan, UCLA; Neil Agarwal, Princeton University; Anand Iyer and Ganesh Ananthanarayanan, Microsoft Research; Yuanchao Shu, Zhejiang University; Nikolaos Karianakis, Microsoft Research; Guoqing Harry Xu, UCLA; Ravi Netravali, Princeton University
  19. On Modular Learning of Distributed Systems for Predicting End-to-End Latency, Chieh-Jan Mike Liang, Microsoft Research; Zilin Fang, Carnegie Mellon University; Yuqing Xie, Tsinghua University; Fan Yang, Microsoft Research; Zhao Lucis Li, University of Science and Technology of China; Li Lyna Zhang, Mao Yang, and Lidong Zhou, Microsoft Research
  20. SelfTune: Tuning Cluster Managers, Ajaykrishna Karthikeyan and Nagarajan Natarajan, Microsoft Research; Gagan Somashekar, Stony Brook University; Lei Zhao, Microsoft; Ranjita Bhagwan, Microsoft Research; Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal, Microsoft
  21. OpenLoRa: Validating LoRa Implementations through an Extensible and Open-sourced Framework, Manan Mishra, Daniel Koch, Muhammad Osama Shahid, and Bhuvana Krishnaswamy, University of Wisconsin-Madison; Krishna Chintalapudi, Microsoft Research; Suman Banerjee, University of Wisconsin-Madison
  22. ExoPlane: An Operating System for On-Rack Switch Resource Augmentation, Daehyeok Kim, Microsoft and University of Texas at Austin; Vyas Sekar and Srinivasan Seshan, Carnegie Mellon University
  23. Sketchovsky: Enabling Ensembles of Sketches on Programmable Switches, Hun Namkung, Carnegie Mellon University; Zaoxing Liu, Boston University; Daehyeok Kim, Microsoft Research; Vyas Sekar and Peter Steenkiste, Carnegie Mellon University
  24. Acoustic Sensing and Communication Using Metasurface, Yongzhao Zhang, Yezhou Wang, and Lanqing Yang, Shanghai Jiao Tong University; Mei Wang, UT Austin; Yi-Chao Chen, Shanghai Jiao Tong University and Microsoft Research Asia; Lili Qiu, UT Austin and Microsoft Research Asia; Yihong Liu, University of Glasgow; Guangtao Xue and Jiadi Yu, Shanghai Jiao Tong University
  25. Disaggregating Stateful Network Functions, Deepak Bansal, Gerald DeGrace, Rishabh Tewari, Michal Zygmunt, and James Grantham, Microsoft; Silvano Gai, Mario Baldi, Krishna Doddapaneni, Arun Selvarajan, Arunkumar Arumugam, and Balakrishnan Raman, AMD Pensando; Avijit Gupta, Sachin Jain, Deven Jagasia, Evan Langlais, Pranjal Srivastava, Rishiraj Hazarika, Neeraj Motwani, Soumya Tiwari, Stewart Grant, Ranveer Chandra, and Srikanth Kandula, Microsoft
  26. Doing More with Less: Orchestrating Serverless Applications without an Orchestrator, David H. Liu and Amit Levy, Princeton University; Shadi Noghabi and Sebastian Burckhardt, Microsoft Research
  27. NetPanel: Traffic Measurement of Exchange Online Service, Yu Chen, Microsoft 365, China; Liqun Li and Yu Kang, Microsoft Research, China; Boyang Zheng, Yehan Wang, More Zhou, Yuchao Dai, and Zhenguo Yang, Microsoft 365, China; Brad Rutkowski and Jeff Mealiffe, Microsoft 365, USA; Qingwei Lin, Microsoft Research, China
  28. DOTE: Rethinking (Predictive) WAN Traffic Engineering, Yarin Perry, Hebrew University of Jerusalem; Felipe Vieira Frujeri, Microsoft Research; Chaim Hoch, Hebrew University of Jerusalem; Srikanth Kandula and Ishai Menache, Microsoft Research; Michael Schapira, Hebrew University of Jerusalem; Aviv Tamar, Technion
  29. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker, Yinfang Chen and Xudong Sun, University of Illinois at Urbana-Champaign; Suman Nath, Microsoft Research; Ze Yang and Tianyin Xu, University of Illinois at Urbana-Champaign
  30. Test Coverage for Network Configurations, Xieyang Xu and Weixin Deng, University of Washington; Ryan Beckett, Microsoft; Ratul Mahajan, University of Washington; David Walker, Princeton University

NSDI 2023 Program Committee members:

Members of other committees:

The post Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems appeared first on Microsoft Research.

Read More

Hunting speculative information leaks with Revizor

Hunting speculative information leaks with Revizor

Revizor chart

Spectre and Meltdown are two security vulnerabilities that affect the vast majority of CPUs in use today. CPUs, or central processing units, act as the brains of a computer, directing the functions of its other components. By targeting a feature of the CPU implementation that optimizes performance, attackers could access sensitive data previously considered inaccessible. 

For example, Spectre exploits speculative execution—an aggressive strategy for increasing processing speed by postponing certain security checks. But it turns out that before the CPU performs the security check, attackers might have already extracted secrets via so-called side-channels. This vulnerability went undetected for years before it was discovered and mitigated in 2018. Security researchers warned that thieves could use it to target countless computers, phones and mobile devices. Researchers began hunting for more vulnerabilities, and they continue to find them. But this process is manual and progress came slowly. With no tools available to help them search, researchers had to analyze documentation, read through patents, and experiment with different CPU generations. 

A group of researchers from Microsoft and academic partners began exploring a method for systematically finding and analyzing CPU vulnerabilities. This effort would produce a tool called Revizor (REV-izz-or), which automatically detects microarchitectural leakage in CPUs—with no prior knowledge about the internal CPU components. Revizor achieves this by differentiating between expected and unexpected information leaks on the CPU. 

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

The Revizor process begins by describing what is expected from the CPU in a so-called “leakage contract.” Revizor then searches the CPU to find any violations of this contract. It creates random programs, runs them on the CPU, records the information they expose, and compares the information with the contract. When it finds a mismatch that violates the contract, it reports it as a potential vulnerability. 

Details were published in 2022 in the paper: Revizor: Testing Black-box CPUs against Speculation Contracts

To demonstrate Revizor’s effectiveness, the researchers tested a handful of commercial CPUs and found several known vulnerabilities, including Spectre, MDS, and LVI, as well as several previously unknown variants. 

However, the search was still slow, which hindered the discovery of entirely new classes of leaks. The team identified the root causes of the performance limitations, and proposed techniques to overcome them, improving the testing speed by up to two orders of magnitude. The improvements are described in a newly published paper: Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing

These improvements supported a testing campaign of unprecedented depth on Intel and AMD CPUs. In the process, the researchers found two types of previously unknown speculative leaks (affecting string comparison and division) that had escaped previous analyses—both manual and automated. These results show that work which previously required persistent hacking and painstaking manual labor can now be automated and rapidly accelerated. 

The team began working with the Microsoft Security Response Center and hardware vendors, and together they continue to find vulnerabilities so they can be closed before they are discovered by hackers—thereby protecting customers from risk. 

Revizor is part of Project Venice, which investigates novel mechanisms for the secure sharing and partitioning of computing resources, together with techniques for specifying and rigorously validating their resilience to side-channel attacks. 

The post Hunting speculative information leaks with Revizor appeared first on Microsoft Research.

Read More

Hunting speculative information leaks with Revizor

Hunting speculative information leaks with Revizor

Revizor chart

Spectre and Meltdown are two security vulnerabilities that affect the vast majority of CPUs in use today. CPUs, or central processing units, act as the brains of a computer, directing the functions of its other components. By targeting a feature of the CPU implementation that optimizes performance, attackers could access sensitive data previously considered inaccessible. 

For example, Spectre exploits speculative execution—an aggressive strategy for increasing processing speed by postponing certain security checks. But it turns out that before the CPU performs the security check, attackers might have already extracted secrets via so-called side-channels. This vulnerability went undetected for years before it was discovered and mitigated in 2018. Security researchers warned that thieves could use it to target countless computers, phones and mobile devices. Researchers began hunting for more vulnerabilities, and they continue to find them. But this process is manual and progress came slowly. With no tools available to help them search, researchers had to analyze documentation, read through patents, and experiment with different CPU generations. 

A group of researchers from Microsoft and academic partners began exploring a method for systematically finding and analyzing CPU vulnerabilities. This effort would produce a tool called Revizor (REV-izz-or), which automatically detects microarchitectural leakage in CPUs—with no prior knowledge about the internal CPU components. Revizor achieves this by differentiating between expected and unexpected information leaks on the CPU. 

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

The Revizor process begins by describing what is expected from the CPU in a so-called “leakage contract.” Revizor then searches the CPU to find any violations of this contract. It creates random programs, runs them on the CPU, records the information they expose, and compares the information with the contract. When it finds a mismatch that violates the contract, it reports it as a potential vulnerability. 

Details were published in 2022 in the paper: Revizor: Testing Black-box CPUs against Speculation Contracts

To demonstrate Revizor’s effectiveness, the researchers tested a handful of commercial CPUs and found several known vulnerabilities, including Spectre, MDS, and LVI, as well as several previously unknown variants. 

However, the search was still slow, which hindered the discovery of entirely new classes of leaks. The team identified the root causes of the performance limitations, and proposed techniques to overcome them, improving the testing speed by up to two orders of magnitude. The improvements are described in a newly published paper: Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing

These improvements supported a testing campaign of unprecedented depth on Intel and AMD CPUs. In the process, the researchers found two types of previously unknown speculative leaks (affecting string comparison and division) that had escaped previous analyses—both manual and automated. These results show that work which previously required persistent hacking and painstaking manual labor can now be automated and rapidly accelerated. 

The team began working with the Microsoft Security Response Center and hardware vendors, and together they continue to find vulnerabilities so they can be closed before they are discovered by hackers—thereby protecting customers from risk. 

Revizor is part of Project Venice, which investigates novel mechanisms for the secure sharing and partitioning of computing resources, together with techniques for specifying and rigorously validating their resilience to side-channel attacks. 

The post Hunting speculative information leaks with Revizor appeared first on Microsoft Research.

Read More

AI Frontiers: Models and Systems with Ece Kamar

AI Frontiers: Models and Systems with Ece Kamar

black and white photo of Ece Kamar, Partner Research Manager at Microsoft Research, next to the Microsoft Research Podcast

Episode 138 | April 13, 2023

Powerful new large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these new models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity.

The third episode features Ece Kamar, deputy lab director at Microsoft Research Redmond. Kamar draws on decades of experience in AI research and an opportunity she and Microsoft colleagues had to evaluate and experiment with GPT-4 prior to its release in discussing the capabilities and limitations of today’s large-scale models. She explores the short-term mitigation techniques she and her team are using to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 

Transcript

[MUSIC PLAYS]

Ashley Llorens: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more fortunate to work in the field than at this moment. The development of increasingly powerful large-scale models is accelerating the advancement of AI. Most recently, GPT-4 is exhibiting surprising new abilities like problem-solving and translation across languages and domains.

In this podcast series, I’ll share conversations with fellow researchers about our impressions of GPT-4, the nature of intelligence, and ultimately how innovations like these can have the greatest benefit for humanity.

Today we’re sitting down with Ece Kamar, deputy lab director at Microsoft Research in Redmond. In the months leading up to the release of GPT-4, Ece and her team leverage their many years of experience in AI research to evaluate the model and to help understand and mitigate its limitations.

So the experiences it powers can bring the greatest benefit to the people that use them.


Welcome to AI Frontiers.

All right, why don’t we just jump right in.

Ece Kamar: Okay.

Llorens: Okay.

Kamar: Take it over.

[MUSIC FADES]

Llorens: All right, so I want to start at a place that I think will be close to your heart, and that is with the difference between a model and a system. But let me, let me paint the picture a little bit, right. So machine learning is a process through which we create something called a model, which is learned from data. The model is a kind of program that maps inputs to outputs, be it language, images, etc. In deep learning, the models are some variant of an artificial neural network. And finally, in the current era of large-scale AI, these models can have hundreds of billions of parameters or more. But there’s a model, and then there’s a system. The system is the thing that gets deployed when we put out a product or something. So, Ece, from your perspective, what’s the difference between a model as described here and a system?

Ece Kamar: Yeah, that’s, that’s something that I’m thinking so much about these days because we are all getting very excited about the emerging capabilities we see in the latest models—what they can do, what kind of questions we can ask them, the generalizability, the interactive power, even some of the reasoning capabilities that are surprising just to be able to get them with that input-output mapping that, Ashley, you’ve been talking about. However, when you think about it, these models on their own, they don’t really have a purpose. They are just trying to replicate what they have seen in these massive data sources. And the thing that has been driving me as a researcher, even from my earlier days, has been the purpose: why are we building technology, and what is the purpose behind it? And the main difference between a system and a model is a system has a purpose. We build these systems for a particular reason that—in particular, the reason I care very much about is providing value to people who use these systems. So in terms of that distinction, I am spending a lot of time these days thinking about system design with the purpose of enabling, augmenting people, and these systems will have these latest models as building blocks. No question about it. They are so powerful in terms of synthesizing information, having a cohesive, interesting conversation. But at the same time, they are not enough. To be helpful to people, we have additional capabilities like knowing about that individual, learning from that individual, having an understanding of the goals that the individual would like to have. So we are trying to get to that system architecture, the system design that can actually make that input-output model a very crucial part of a much bigger, uh, purpose.

Llorens: Maybe next we can go into the system lifecycle. So there’s a way that a system component like a model becomes part, uh, of a larger system that eventually gets deployed. So tell me about that lifecycle. What’s that like from your experience?

Kamar: From my experience, actually, the larger system you really care about is the hybrid human-AI system because at the end of the day, what we really care about is not how great a system is alone, like an AI system is alone, but we care very much about how well that partnership is working between the human and the AI system. And right now, we have some systems out in the world that are actually already providing a lot of value for. For example, Copilot is a great example of this—the GitHub Copilot—where as you’re writing code, it can make suggestions for you and you can accept or reject them. At the same time, this is really missing some very crucial abilities in it because we are still in the very early days of this copilot-AI revolution. So what are some of the capabilities we are missing? Copilot still doesn’t really have a very good understanding of me as a developer. What are the particular habits I have? What kind of code I love to write? Maybe I care very much about the interpretability of my code by others when I’m not in that project anymore. It is not necessarily a preference that Copilot has about me. I think soon enough it will because I think we are going to get to a world where these AI systems will know a lot about us, our goals, our preferences, our intentions, our habits. And then they are going to become a lot more helpful to us. The other thing that’s not happening with the current systems is that they are not learning from feedback. As individuals, when we are part of teams—let’s say I’m working with you, which we do, all the time—I learn about you; you give me your feedback. You say, “Next time, don’t do that. Maybe don’t consider doing it that way.” I take that into account. I get better at what I do because I learn from you. So the more we build these self-feeding feedback loops into our AI systems, they are going to have a better understanding of us as users, but also they are going to be able to provide more value for us.

Llorens: The first time I used GPT-4, I asked it a question that was inspired by my own work in underwater robotics. I asked it how far away I could hear a sound generated underwater in the ocean. The response took me completely by surprise. The model pointed out that more information was needed, like how temperature would affect the speed of sound through the water. It suggested I consider using a sonar array. It went ahead and made its own assumptions about those things and then gave me an answer. The reasoning was breathtaking to me. I knew for a fact it hadn’t been explicitly trained to do any of that. It challenged my notion of the value of being able to do this kind of reasoning as a researcher.

So maybe we can actually start with the model and your experience of it. The capabilities and limitations. But why don’t we just start with your first impressions of it?

Kamar: It was surprising, mainly because I have been working in the AI space for almost like, I don’t want to say it, but two decades. So we have been all thinking about new models, new architectures, what is coming in AI; we always had in mind these kind of ambitious goals for AI. For me, it has always been these AI assistants that come and help us with whatever we are doing, even from the early days it has been that. But always that aspiration never really landed because when we tried to build these systems, they became narrow that they did not really match what, as users, we needed from them. And when I saw GPT-4 and started interacting with it, I saw some mind-blowing capabilities that I thought I wouldn’t see for many years to come. And one of the surprises was how quickly we get here. So that’s kind of No. 1. And we can talk a lot more about like what are those surprising abilities, but second, immediately, my mind goes to, what can we do with this? Because first of all, there’s so much potential now we have in terms of bringing that vision of helping people into reality.

But second of all, because I also care a lot about responsibility, “Oh, my god, this powerful model will come with so much responsibility.” What, as Microsoft, we build with this plus what others will be able to build with this model or maybe models [that] will come next, that’s going to matter a lot for not only for us as researchers, not only for users, but our society overall.

So the other reaction I had was like, what can go wrong and what can we do to prevent those negative consequences from happening? And that’s going to be a long journey that we are going to be on.

Llorens: Sure. Let’s get further into those surprising capabilities.

Kamar: Yeah, sure. So one of the very surprising capabilities is how general purpose these models are at the moment. I can prompt it to write code, write poems. I can ask—I’m Turkish. I can ask questions in Turkish and I can get very fluid responses in Turkish. It can actually write me beautiful poems about sunset in Cappadocia in Turkish, which is like, oh my god, this is already creating an emotional reaction, right, when I’m interacting with it. And then, though, you get into much more practical tasks. For example, how do I turn some of my thoughts into, into writing? Um, how can I customize my voice for different audiences? And the model seems to know about these things and can help me, but not producing a final result, but bringing me to a point where I can be a lot more productive.

So that general-purpose nature of it, like I can go from writing a poem—which I’m terrible at it—to writing academic papers—I think I’m better at that—and helping me throughout the spectrum when I’m not good at something, when I’m kind of good at something. That is just showing me so much potential, such a big spectrum.

But the other thing is the interactivity. It is not this static tool where I basically ask one thing, it gives me one answer, and I’m kind of done, like whatever I can do with that one turn is all I get. It is actually the opposite. It gives me a response and I can actually instruct it further. I can talk about my preferences, how I would like that to be changed for something that’s a much better fit for my needs.

And as a person, I may not be able to articulate my needs at the beginning clearly, so that interaction of being able to see what it can do and asking further is just making it a much, much more capable tool. And the other thing is the reasoning capabilities. What I mean by that is that, you know, for the last few years, as these larger models came out and came out, we all said, OK, this is pretty powerful, but it is still just like repeating patterns it has seen in the, in the internet. And one of the—you know, I think some of my colleagues used the term—was “stochastic parrots.” It’s just repeating things back to you. And what we are seeing with GPT-4—and I think it’s just the phase transition; we’re at this point in this phase transition and these capabilities are going to get stronger and stronger—is that the capabilities for synthesis, compiling information together to get into new insights that may not exist. I’m not claiming all of those insights are correct, but they are giving people sparks that they can further think about and build on. Also, it can reason about multiple steps. It’s not a planner yet, but it has the basics of top-level reasoning where we can start from a point towards the goal and we can collaborate to work towards a plan to get there.

And those are all very powerful things, especially when we think about building an AI system that can take somebody’s goals and turn them into actions.

Llorens: So you mentioned, planning as a limitation of the model, but let’s just talk about, you know, maybe more fully about the limitations that, that you see in, the in the current, current model, current state of the art.

Kamar: You know, a lot of people, when they think about these limitations, they see these as reasons not to invest in these technologies at all. I look at it from a different perspective. I see these as pieces of the puzzle that we need to invent and put in place. So we started this conversation with the distinction between the model and the system. The model is a very powerful piece of this puzzle, but we are also, as we are building these systems—like Bing is a great example, the GitHub Copilot is another example—we are seeing what they can do, but we are seeing a lot about what they cannot do, and that is giving us, as researchers, ideas about new puzzle pieces we need to invent so that we can come to this architecture.

So a huge limitation, hallucinations. I think that is top of mind for a lot of us. These models are learning from large datasets on the internet, they don’t have fresh information. They are not able to separate reliable information from unreliable information. And also because these models are general-purpose tools, sometimes we want to use them for creating something new that doesn’t exist on the internet, for example, writing a brand-new poem that nobody else wrote before. But sometimes you want them as information retrieval engines, where the biggest requirement is being correct in terms of that information coming back. So we are all learning, like, how can we understand the purpose, turn it into prompts, and then figure out the best way to instruct these models so that, so that we are getting our desired behavior in return, but also how can we actually, in the future, specialize these models in a way that we can have versions that are much less prone to hallucinations?

How can we ground them with the right context and know how to communicate that intent well, so that I can be assured that whenever they are giving me information, giving me facts when I need the facts, they are giving me the right facts? We are at the very beginning of solving this puzzle. But in my mind, this is not a limitation.

This is actually showing me a lot of problems, research problems, to invest in.

Llorens: So, Ece, you’re a leader here at Microsoft Research. You’ve got a team, and your team, uh, is instrumental in this process of turning the model into a system, uh, for some of these applications. And I guess you’ve talked about understanding the purpose—systems have a purpose—and maybe there’s aspects of the system design that mitigate or deal with some of the limitations in order to make it fit for that purpose.

You mentioned grounding, for example, as one of those methods, but can you just get deeper maybe into grounding and some of the other techniques that you use to, again, turn the model into a system?

Kamar: Yeah, definitely. We have been working with different teams across Microsoft as some of these technologies find their way into products, both understanding the limitations but also helping to overcome those limitations, um, with existing techniques. Of course, there’s a lot to be invented, but right now we still have some things in our capabilities list that we can apply to make these problems mitigated, up to some extent.

Just to give a few examples, right, when we are giving search results, instead of just using GPT-4 to produce that result, we are actually getting better, more accurate results when the top search results are provided as context to the models for them to create their generations. So that is one technique that is currently implemented. That is an example of grounding, grounding with that context. You can imagine that for another application, let’s say for writing an email for you, here we can ground with your past emails written to the same person; we can ground based on your personal documents. For example, if I’m writing you an email about this podcast, you probably have an outline or a document where we have previously discussed some of these ideas. That becomes important grounding so that that email represents my voice, represents my thoughts, but it actually becomes a way for me to just do things faster and more efficiently. So those are some examples of the grounding. The other thing we have in our toolbox these days is how we talk to the model. This is called prompting. A lot of people are talking about prompting because we are discovering new ways to communicate with these models as developers.

If you remember back in the day, um, the way a developer would talk to a machine learning model was giving labeled data. Here’s an example: True, false. Here’s an example: True, false. Now our communication channel with the model in terms of developing systems is increasing. Our bandwidth is so much higher. Now we can talk to the model in natural language.

The problem with this is this is, uh, not a perfect specification. However, still, the way I can instruct the model carries a lot of power. So when we are building systems with prompting, we can tell the model, instruct the model, that whenever the model is talking about a fact, it should cite the source of that material. This has two particular benefits. One benefit is that this is instructing the model that everything the model says should be coming from a source and the links should be there. Of course, I’m not claiming that we are doing this perfectly, but it’s a step in that direction. But second, and even the more important reason is, we are giving people accountability to check. As I said, none of the systems we are trying to build are there to automate the role of the human being.

It is all about complementarity and augmentation and enablement. So when we are building a system, giving results to the human, the goal is always having the human in the driver’s seat, having the human control what is being generated, and by providing sources in the results, that is one way we can enable the user, because then the user can go to these links and check.

These are just some of the things that we are currently inventing as, you know, short-term ideas to mitigate these problems as much as possible. But also we have to think about long-term solutions that can really make these problems go away. We are not there yet, but as a researcher, I’m very excited about the potential.

Llorens: I’d love to just drill into this notion of specification for a moment. You mentioned the complementarity, you mentioned the intent to have these systems amplify human agency, and with that stewardship of the system comes the expression of intent. And you know, you mentioned maybe even in the era before machine learning, the way to express intent was through a very explicitly written program and, you know, kind of machine learning for more narrow systems, it’s identifying labels for data. And now we have natural language as a means of specification, and you called it an imperfect means of specification. So can you just maybe take us a little deeper into that thought?

Kamar: Yeah. So we have been talking about what we are seeing in the latest models in GPT-4 as a phase transition. We haven’t arrived at the best possible model, and we haven’t arrived at the best possible way to communicate with that model. We are at this very specific point in our history where we are saying, “OK, our models are getting really capable and that communication channel has opened up.

Now I can talk to it in natural language.” I personally don’t think that this very noisy way of just communicating things in natural language as a way of prompts is the final product of how we are going to be talking to our AI systems. However, it is a way, and with iteration, we can become more precise. So let me tell you this.

Let’s say I want this AI system to write me an email to you. The simple prompt could be, “Write me an email to Ashley, and it should talk about this and this.” I can see the result. Immediately, I can see what I don’t like about it. Imagine I could say more specification, right, I can say, “Oh, don’t mention this; include this, as well. The tone should be this way and not that way.”

These are all additional specifications I may not think about when I’m just prompting the model, but over time, I may get better and better in terms of really specifying my preferences, my intent. So right now, we’re in this very noisy process of almost like trial and error. We are trying something, looking at the result; if we don’t like it, we come up with a correction. I think over time we can really compile these experiences—how people are specifying things into these models—and that can pave the way for much better communication methods. Again, we don’t have the answers yet, but I’m also, I’m also not thinking that these prompts are the best way to interact.

Llorens: And as I learn to specify my intent to a particular model, how much does that knowledge or that skill of prompting this model in an effective way translate when I pick up another model or maybe, you know, another iteration on the same model. Do I have to relearn it every time?

Kamar: Ideally not, because we all want to be consistent. Uh, we don’t want our experiences to go away just because we are starting over with a new model. Again, so far, a lot of the model developments have been guided by numbers—how big the models are, how accurate they are, how did they do on certain benchmarks. Now, as these models are enabling real systems for humans, we need to bring in other criteria that are human-centered, that can not only be explained by how well you predict the next word, but it is about what you said. How can I get consistency in the way I communicate with this model? How does this model learn better about me? How this model can capture the right context about me? So I think we are at the beginning of understanding those human-centered considerations we want to have in these models and somehow incorporate them into the way these models are trained.

Llorens: Earlier you mentioned responsibility, you know, that, that Microsoft, you know, has a responsibility, you know, when we put these systems out in the world. As researchers and engineers, um, we have some stewardship of that responsibility in the design process, and throughout the lifecycle. How has that manifested here, you know, for GPT-4 in the applications that you’ve worked on? How does that aspect of responsibility enter into the system design and engineering for you?

Kamar: In a very similar way to how we have been thinking about responsible AI for the last five, six years. It is a journey, and with every model, including GPT-4, the first step, is understanding—understanding the capabilities, understanding the limitations, understanding what can go wrong and what can we do in a short term to prevent those negative effects to be as little as possible.

So from the early days of Microsoft’s interaction with GPT-4, uh, me and many of my colleagues have been involved. We started playing with it. We started observing what it can do, what it cannot do, started documenting all of those capabilities. And now you need to take a step back and say, “OK, what can I say about the risks?” Because you observe the instances, but there are these higher-level risks that you should be considerate about. So it became obvious that hallucination was an issue. The other issue is something we call manipulation. The fact that these models don’t have a good understanding of what they don’t know, but at the same time, they can also not admit that they don’t have the right answer, and they may actually even try to convince you as the user that what they are providing is the right one.

So we started thinking what kind of mitigations we can bring in place to make these problems as little as possible. Of course, another consideration is offensive language, biases, content moderation. So that’s another, another factor that a lot of my colleagues have been involved with from the early days. And we worked closely across the company in terms of putting practices in place.

Sometimes this is content moderation modules. Sometimes this is prompt engineering to get hallucinations to be as low as possible. Sometimes it is really thinking about those high-level guidelines you can give to the systems to make these risks as low as possible. So we have been very heavily involved from the beginning, and we are also putting our ideas into publications to share with the wider world, because not everybody—we are aware that not everybody will have as much experience as we have with these models.

So how can we actually capture our experience and share with our academic colleagues so that we can all think about these problems together? So now I think we have some understanding. Again, now this is distilling the longer-term research questions and getting our teams to focus on those.

Llorens: You know, another important phase of the research lifecycle or the system lifecycle is the test and evaluation. So you design a system; you conceptualize it; you develop it. At some point, you know—put some mitigations in place, perhaps like the ones you suggested. Um, at some point, then you have to test it. How does that happen, uh, with these, with this kind of a system, this kind of general-purpose system?

Kamar: Yeah. So, you know, just thinking about traditional machine learning, testing was always a very core part of the way we built machine learning. You would collect some data, you would make part of that data training and you would have part of that data as test set, and then you would have a call to measure for every model you’re building from, from Day 1.

That is no longer the case with these generative models, especially as we get into this “prompt something and you have your application development” culture. There are really big questions about how we evaluate these models. The insight there is that because these models are generative, they can also be used for creating test data. So on the topic of hallucination, for example, we have been using GPT-4 for simulating dialogues fed by, um, queries, common queries, and also get the model to check if some certain risks like hallucinations are happening.

So this is giving us a partly automated, GPT-4–powered evaluation pipeline that, of course, needs to have human eyes on it because not everything the machine generates or validates is always correct. But this gives us a loop to be able to generate data at scale and do evaluation. But, of course, not all problems are equally vital for our society.

There are certain things that carry a lot more weight than others. For example, even on the topic of hallucinations, if a search engine is providing wrong guidance on a critical health query, that is a much bigger risk. So this is why another important part of the evaluation is red teaming. How can we bring human eyes onto the system in the most critical ways and actually get them to check what the systems are doing?

So again, we are at the early days of figuring out what evaluation is going to look like for this new generation of models. Again, human-AI partnership is going to play a key role in the way we evaluate these systems. We see that generative capabilities of these models are powerful for creating data. Human eyes are always going to be important as the final checkers of what is right and what is wrong.

And we just need to build these techniques and make them part of the way we build AI systems with these latest models.

Llorens: I want to ask you about a term, uh, the term agent. Um, you, you kind of referenced it earlier, but I want to come back to it, and I want to come back to it in the context of what your vision for the future is for, I’ll say, AI models and systems that we use, that we create from those models.

What is that vision, and what does that vision have to do with agents?

Kamar: You know, the word agent comes from agency, and the question is what does agency mean for an AI system? It is the fact that they are aware, they can act, and they can learn. So those are the three main capabilities we want to have in our AI systems. Just to take a bit deeper into this: being aware—again, we are building these agents not to act independently in the world. We are building them to partner with people and help people with their tasks. So when we talk about them being aware, we are talking about being aware of their users, being aware of their needs, being aware of their goals, and also being aware of the information on the world so that they don’t have to start from scratch. The other part is action—taking action on behalf of their users.

And here I think we are going to see a lot more interesting scenarios going forward in terms of what the AI systems can do in partnership with people. Right now, we are seeing writing documents, collecting information from the web, and presenting them, but in the future, what other creative things AI systems and humans can do together?

What other tasks that you just don’t want to do and you want the AI to take over with your accountability and control, of course. So that’s the part of the acting we need to figure out. And the other part that is very important is learning. We talked about GitHub Copilot, which is a wonderful AI application that so many people are getting value in the world.

At the same time, we are not only talking about GitHub Copilot getting better at code completion; we are talking about GitHub Copilot getting better in terms of providing value for people. So in terms of like getting better, we have to figure out what does that human-centered reward we can provide to these AI systems just in terms of the value people get—what has been good, what has been bad—and use that reward signal to teach the machine how to act better in the world. Those are all part of the framework we have for this AI agent. And just to reiterate, this is always going to have these very powerful models as a building block. But as you can imagine, we will need other components to get there.

[MUSIC]

Llorens: Thanks, Ece. Well, I’m certainly excited by the technologies we have today, and I’m excited for the vision that you’ve articulated for the future. So, yeah, really appreciate you sharing that vision with us today, and thanks for spending the time.

Kamar: Thank you.

The post AI Frontiers: Models and Systems with Ece Kamar appeared first on Microsoft Research.

Read More