A game-theoretic approach to provably correct and scalable offline RL

A figure that illustrates the concept of the version space in a bandit example. It is a 2D plot where the x-axis denotes actions, and the y-axis denotes reward. It shows data of sampled reward values of different actions as dots, and different hypotheses of how reward depends on action as a function. The functions that are consistent with the observed data form the version space.

Despite increasingly widespread use of machine learning (ML) in all aspects of our lives, a broad class of scenarios still rely on automation designed by people, not artificial intelligence (AI). In real-world applications that involve making sequences of decisions with long-term consequences, from allocating beds in an intensive-care unit to controlling robots, decision-making strategies to this day are carefully crafted by experienced engineers. But what about reinforcement learning (RL), which gave machines supremacy in games as distinct as Ms. PacMan and Pokémon Go? For all its appeal, RL – specifically, its most famous flavor, online RL – has a significant drawback beyond scenarios that can be simulated and have well-defined behavioral rules. Online RL agents learn by trial and error. They need opportunities to try various actions, observe their consequences, and improve as a result. Making wildly suboptimal decisions just for learning’s sake is acceptable when the biggest stake is a premature demise of a computer game character or showing an irrelevant ad to a website visitor. For tasks such as training a self-driving car’s AI, however, it is clearly not an option.

Offline reinforcement learning (RL) is a paradigm for designing agents that can learn from large existing datasets – possibly collected by recording data from existing reliable but suboptimal human-designed strategies – to make sequential decisions. Unlike the conventional online RL, offline RL can learn policies without collecting online data and even without interacting with a simulator. Moreover, since offline RL does not blindly mimic the behaviors seen in data, like imitation learning (an alternate strategy of RL), offline RL does not require expensive expert-quality decision examples, and the learned policy of offline RL can potentially outperform the best data-collection policy. This means, for example, an offline RL agent in principle can learn a competitive driving policy from logged datasets of regular driving behaviors. Therefore, offline RL offers great potential for large-scale deployment and real-world problem solving.

Two figures and two arrows connecting them. The left figure shows examples of real-world sequential decision-making problems such as robotic manipulation, health care, and autonomous driving. The right figure shows that these problems only have non-exploratory logged data. An arrow pointing from the left figure to the right figure shows data collection is costly and risky in these applications. Another arrow pointing from the left figure to the right figure highlights the question “How to make decisions under systematic uncertainty due to missing data coverage?”.

However, offline RL faces a fundamental challenge: the data we can collect in large quantity lacks diversity, so it is impossible to use it to estimate how well a policy would perform in the real world. While we often associate the term “Big Data” with diverse datasets in ML, it is no longer true when the data concerns real-world “sequential” decision making. In fact, curating diverse datasets for these problems can range from difficult to nearly impossible, because it would require running unacceptable experiments in extreme scenarios (like staging the moments just before a car cash, or conducting unethical clinical trials). As a result, the data that gets collected in large quantity, counterintuitively, lacks diversity, which limits its usefulness.

In this post, we introduce a generic game-theoretic framework for offline RL. We frame the offline RL problem as a two-player game where a learning agent competes with an adversary that simulates the uncertain decision outcomes due to missing data coverage. By this game analogy, we design a systematic and provably correct way to design offline RL algorithms that can learn good policies with state-of-the-art empirical performance. Finally, we show that this framework provides a natural connection between offline RL and imitation learning through the lens of generative adversarial networks (GANs). This connection ensures that the policies learned by this game-theoretic framework are always guaranteed to be no worse than the data collection policies. In other words, with this framework, we can use existing data to robustly learn policies that improve upon the human-designed strategies currently running in the system.

The content of this post is based on our recent papers Bellman-consistent Pessimism for Offline Reinforcement Learning (Oral Presentation, NeurIPS 2021) and Adversarially Trained Actor Critic for Offline Reinforcement Learning (Outstanding Paper Runner-up, ICML 2022).

Fundamental difficulty of offline RL and version space

A major limitation of making decisions with only offline data is that existing datasets do not include all possible scenarios in the real world. Hypothetically, suppose that we have a large dataset of how doctors treated patients in the past 50 years, and we want to design an AI agent to make treatment recommendations through RL. If we run a typical RL algorithm on this dataset, the agent might come up with absurd treatments, such as treating a patient with pneumonia through amputation. This is because the dataset does not have examples of what would happen to pneumonia after amputating patients, and “amputation cures pneumonia” may seem to be a plausible scenario to the learning agent, as no information in the data would falsify such a hypothesis.

To address this issue, the agent needs to carefully consider uncertainties due to missing data. Rather than fixating on a particular data-consistent outcome, the agent should be aware of different possibilities (i.e., while amputating the leg might cure the pneumonia, it also might not, since we do not have sufficient evidence for either scenario) before committing to a decision. Such deliberate conservative reasoning is especially important when the agent’s decisions can cause negative consequences in the real world.

A formal way to express the idea above is through the concept of version space. In machine learning, a version space is a set of hypotheses consistent with data. In the context of offline RL, we will form a version space for each candidate decision, describing the possible outcomes if we are to make the decision. Notably, the version space may contain multiple possible outcomes for a candidate decision whose data is missing.

To understand better what a version space is, we use the figure below to visualize the version space of a simplified RL problem where the decision horizon is one. Here we want to select actions that can obtain the highest reward (such as deciding which drug treatment has the highest chance of curing a disease). Suppose we have data of action-reward pairs, and we want to reason about the reward obtained by taking an action. If the underlying reward function is linear, then we have a version space shown in red, which is a subset of (reward) hypotheses that are consistent with the data. In general, for a full RL problem, a version space can be more complicated than just reward functions, e.g., we would need to start thinking about hypotheses about world models or value functions. Nonetheless, similar concepts like the one shown in the figure below apply.

A figure that illustrates the concept of the version space in a bandit example. It is a 2D plot where the x-axis denotes actions, and the y-axis denotes reward. It shows data of sampled reward values of different actions as dots, and different hypotheses of how reward depends on action as a function. The functions that are consistent with the observed data form the version space.
Figure 1: Illustration of version space and hypotheses in a bandit example.

Thinking offline RL as two-player games

If we think about uncertainties as hypotheses in a version space, then a natural strategy of designing offline RL agents is to optimize for the worst-case scenario among all hypotheses in the version space. In this way, the agent would consider all scenarios in the version space as likely, avoiding making decisions following a delusional outcome.

Below we show that this approach of thinking about worst cases for offline RL can be elegantly described by a two-player game, called the Stackelberg game. To this end, let us first introduce some math notations and explain what a Stackelberg game is.

Notation We use (π) to denote a decision policy and use (J(π)) to denote the performance of (π) in the application environment. We use (Π) to denote the set of policies that the learning agent is considering, and we use (mathcal{H}) to denote the set of all hypotheses. We also define a loss function (psi:Π × mathcal{H}→[0, infty)) such that if (psi(π,H)) is small, (H) is a data-consistent hypothesis with respect to (π); conversely (psi(π,H)) gets larger for data-inconsistent ones (e.g., we can treat each (H) as a potential model of the world and (psi(π,H)) as the modelling error on the data). Consequently, a version space above is defined as (mathcal V_pi = { H: psi(pi,H) leq varepsilon }) for some small (varepsilon).

Given a hypothesis (H in mathcal{H}), we use (H(π)) to denote the performance of (π) (predicted) by (H), which may be different from (π)’s true performance (J(π)). As a standard assumption, we suppose for every (π in Π) there is some (H_{π}^* in mathcal{H}) that describes the true outcome of (π), that is, (J(π) = H_{π}^*(π)) and (psi(π,H_{π}^*) = 0).

Stackelberg Game In short, a Stackelberg game is a bilevel optimization problem,

(displaystyle max_{x} f(x,y_x), quad {rm s.t.} ~~ y_x in min_x g(x,y))

In this two-player game, the Leader (x) and the Follower (y) maximize and minimize the payoffs (f) and (g), respectively, under the rule that the Follower plays after the Leader. (This is reflected in the subscript of (y_{x}), that (y) is decided based on the value of (x)). In the special case of (f = g), a Stackelberg game reduces to a zero-sum game (displaystyle max_{x} min_{y} f(x,y)).

A figure shows a learner that tries to compete with an adversary in a two player game, which resembles a chess game. The learner is thinking about whether to use absolute or relative pessimism as the strategy to choose the policy. The adversary is thinking about which hypothesis from the version space to choose.
Figure 2: Offline RL as two-player Stackelberg game.

We can think about offline RL as a Stackelberg game: We let the learning agent be the Leader and introduce a fictitious adversary as the Follower, which chooses hypotheses from the version space (V_{π}) based on the Leader’s policy (π). Then we define the payoffs above as performance estimates of a policy in view of a hypothesis. By this construction, solving the Stackelberg would mean finding policies that maximize the worst-case performance, which is exactly our starting goal.

Now we give details on how this game can be designed, using absolute pessimism or relative pessimism, so that we can have performance guarantees in offline RL.

A two-player game based on absolute pessimism

Our first instantiation is a two-player game based on the concept of absolute pessimism, introduced in the paper Bellman-consistent Pessimism for Offline Reinforcement Learning (NeurIPS 2021).

(displaystyle max_{pi in Pi} H_pi(pi), quad {rm s.t.} ~~ H_pi in min_{H in mathcal H} H(pi) + beta psi(pi,H) )

where (beta ge 0) is a hyperparameter which controls how strongly we want the hypothesis to be data consistent. That is, the larger (beta) is, the smaller the version space (mathcal V_pi = { H: psi(pi,H) leq varepsilon }) is (since (varepsilon) is smaller), so we can think that the value of (beta) trades off conservatism and generalization in learning.

This two-player game aims to optimize for the worst-case absolute performance of the agent, as we can treat (H_{π}) as the most pessimistic hypothesis in (mathcal{H}) that is data consistent. Specifically, we can show (H_{π}(π) le J(π)) for any (beta ge 0); as a result, the learner is always optimizing for a performance lower bound.

In addition, we can show that by the design of (psi), this lower bound is tight when (beta) is chosen well (that is, it only underestimates the performance when the policy runs into situations for which we lack data). As a result, with a properly chosen (beta) value, the policy found by the above game formulation is optimal, if the data covers relevant scenarios that an optimal policy would visit, even for data collected by a sub-optimal policy.

In practice, we can define the hypothesis space (mathcal{H}) as a set of value critics (model-free) or world models (model-based). For example, if we set the hypothesis space (mathcal{H}) as candidate Q-functions and (psi) as the Bellman error of a Q function with respect to a policy, then we will get the actor-critic algorithm called PSPI. If we set (mathcal{H}) as candidate models and (psi) as the model fitting error, then we get the CPPO algorithm.

A two-player game based on relative pessimism

While the above game formulation guarantees learning optimality for a well-tuned hyperparameter (beta), the policy performance can be arbitrarily bad if the (beta) value is off. To address this issue, we introduce an alternative two-player game based on relative pessimism, proposed in the paper Adversarially Trained Actor Critic for Offline Reinforcement Learning (ICML 2022).

(displaystyle max_{pi in Pi} H_pi(pi) color{red}{- {H_{π}(mu)}}, {rm s.t.} H_{π} in min_{H in mathcal H} H(π) color{red}{- {H_{π}(mu)}} + betapsi(π,H))

where we use (mu) to denote the data collection policy.

Unlike the absolute pessimism version above, this relative pessimism version is designed to optimize for the worst-case performance relative to the behavior policy (mu). Specifically, we can show (H_pi (pi) – H_pi(mu) leq J(pi) – J(mu)) for all (beta ge 0). And again, we can think about the hypotheses in model-free or model-based manners, like the discussion above. When instantiated model-free, we get the ATAC (Adversarially Trained Actor Critic) algorithm, which achieves state-of-the-art empirical results in offline RL benchmarks.

An important benefit of this relative pessimism version is a robust property to the choice of (beta). As a result, we can guarantee that the learned policy is always no worse than the behavior policy, despite data uncertainty and the (beta) choice – a property we call (robust) (policy) (improvement). The intuition behind this is that (pi = mu) achieves a zero objective in this game, so the agent has an incentive to deviate from (mu) only if it finds a policy (pi) that is uniformly better than (mu) for all possible data-consistent hypotheses.

At the same time, this relative pessimism game is also guaranteed with the learning optimality when (beta) is chosen correctly, just like above absolute pessimism game. Therefore, in some sense, we can view the relative pessimism game (e.g., ATAC) as a robust version of the absolute pessimism game (e.g., PSPI).

A connection between offline RL and imitation learning

A figure that shows a spectrum that unifies offline reinforcement learning and imitation learning through the lens of ATAC’s Stackelberg game. The left of the spectrum is offline reinforcement learning, which corresponds to ATAC with a good 
𝛽 value; the right of the spectrum is imitation learning, which corresponds to ATAC with 𝛽 equal to zero. On top of the spectrum, there are figures visualizing how the hypothesis space and the learned policy behave in each scenario. Overall, when 𝛽 is larger, the hypothesis space is smaller; on the other hand, when 𝛽 is smaller the learned policy is closer to the behavior policy that collects the data.
Figure 3: ATAC based on a Stackelberg game of relative pessimism provides a natural connection between offline RL and imitation learning.

An interesting takeaway from the discussion above is a clean connection between offline RL and imitation learning (IL) based on GANs or integral probability metric. IL like offline RL also tries to learn good policies from offline data. But IL does not use reward information, so the best strategy for IL is to mimic the data collection policy. Among modern IL algorithms, one effective strategy is to use GANs, where we train the policy (the generator) using an adversarial discriminator to separate the actions generated the policy and the actions from the data.

Now that we understand how IL-based on GANs work, we can see the connection between offline RL and IL through the model-free version of the relative pessimism game, that is, ATAC. We can view the actor and the critic of ATAC as the generator and the discriminator in IL based on GANs. By choosing different (beta) values, we can control the strength of the discriminator; when (beta = 0), the discriminator is the strongest, and the best generator is naturally the behavior policy, which recovers the imitation learning behavior. On the other hand, using larger (beta) weakens the discriminator by Bellman regularization (i.e., (psi)) and leads to offline RL.

In conclusion, our game-theoretic framework shows

Offline RL + Relative Pessimism = IL + Bellman Regularization.

We can view them both as solving a version of GANs problems! The only difference is that offline RL uses more restricted discriminators that are consistent with the observed rewards, since the extra information added in offline RL compared with imitation learning is the reward labels.

The high-level takeaway from this connection is that the policies learned by offline RL with the relative pessimism game are guaranteed to be no worse than the data collection policy. In other words, we look forward to exploring possible applications that robustly improve upon existing human-designed strategies running in the system, by just using existing data despite the lack of data diversity.

The post A game-theoretic approach to provably correct and scalable offline RL appeared first on Microsoft Research.

Read More

Collaborative machine learning that preserves privacy

Collaborative machine learning that preserves privacy

Training a machine-learning model to effectively perform a task, such as image classification, involves showing the model thousands, millions, or even billions of example images. Gathering such enormous datasets can be especially challenging when privacy is a concern, such as with medical images. Researchers from MIT and the MIT-born startup DynamoFL have now taken one popular solution to this problem, known as federated learning, and made it faster and more accurate.

Federated learning is a collaborative method for training a machine-learning model that keeps sensitive user data private. Hundreds or thousands of users each train their own model using their own data on their own device. Then users transfer their models to a central server, which combines them to come up with a better model that it sends back to all users.

A collection of hospitals located around the world, for example, could use this method to train a machine-learning model that identifies brain tumors in medical images, while keeping patient data secure on their local servers.

But federated learning has some drawbacks. Transferring a large machine-learning model to and from a central server involves moving a lot of data, which has high communication costs, especially since the model must be sent back and forth dozens or even hundreds of times. Plus, each user gathers their own data, so those data don’t necessarily follow the same statistical patterns, which hampers the performance of the combined model. And that combined model is made by taking an average — it is not personalized for each user.

The researchers developed a technique that can simultaneously address these three problems of federated learning. Their method boosts the accuracy of the combined machine-learning model while significantly reducing its size, which speeds up communication between users and the central server. It also ensures that each user receives a model that is more personalized for their environment, which improves performance.

The researchers were able to reduce the model size by nearly an order of magnitude when compared to other techniques, which led to communication costs that were between four and six times lower for individual users. Their technique was also able to increase the model’s overall accuracy by about 10 percent.

“A lot of papers have addressed one of the problems of federated learning, but the challenge was to put all of this together. Algorithms that focus just on personalization or communication efficiency don’t provide a good enough solution. We wanted to be sure we were able to optimize for everything, so this technique could actually be used in the real world,” says Vaikkunth Mugunthan PhD ’22, lead author of a paper that introduces this technique.

Mugunthan wrote the paper with his advisor, senior author Lalana Kagal, a principal research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL). The work will be presented at the European Conference on Computer Vision.

Cutting a model down to size

The system the researchers developed, called FedLTN, relies on an idea in machine learning known as the lottery ticket hypothesis. This hypothesis says that within very large neural network models there exist much smaller subnetworks that can achieve the same performance. Finding one of these subnetworks is akin to finding a winning lottery ticket. (LTN stands for “lottery ticket network.”)

Neural networks, loosely based on the human brain, are machine-learning models that learn to solve problems using interconnected layers of nodes, or neurons.

Finding a winning lottery ticket network is more complicated than a simple scratch-off. The researchers must use a process called iterative pruning. If the model’s accuracy is above a set threshold, they remove nodes and the connections between them (just like pruning branches off a bush) and then test the leaner neural network to see if the accuracy remains above the threshold.

Other methods have used this pruning technique for federated learning to create smaller machine-learning models which could be transferred more efficiently. But while these methods may speed things up, model performance suffers.

Mugunthan and Kagal applied a few novel techniques to accelerate the pruning process while making the new, smaller models more accurate and personalized for each user.

They accelerated pruning by avoiding a step where the remaining parts of the pruned neural network are “rewound” to their original values. They also trained the model before pruning it, which makes it more accurate so it can be pruned at a faster rate, Mugunthan explains.

To make each model more personalized for the user’s environment, they were careful not to prune away layers in the network that capture important statistical information about that user’s specific data. In addition, when the models were all combined, they made use of information stored in the central server so it wasn’t starting from scratch for each round of communication.

They also developed a technique to reduce the number of communication rounds for users with resource-constrained devices, like a smart phone on a slow network. These users start the federated learning process with a leaner model that has already been optimized by a subset of other users.

Winning big with lottery ticket networks

When they put FedLTN to the test in simulations, it led to better performance and reduced communication costs across the board. In one experiment, a traditional federated learning approach produced a model that was 45 megabytes in size, while their technique generated a model with the same accuracy that was only 5 megabytes. In another test, a state-of-the-art technique required 12,000 megabytes of communication between users and the server to train one model, whereas FedLTN only required 4,500 megabytes.

With FedLTN, the worst-performing clients still saw a performance boost of more than 10 percent. And the overall model accuracy beat the state-of-the-art personalization algorithm by nearly 10 percent, Mugunthan adds.

Now that they have developed and finetuned FedLTN, Mugunthan is working to integrate the technique into a federated learning startup he recently founded, DynamoFL.

Moving forward, he hopes to continue enhancing this method. For instance, the researchers have demonstrated success using datasets that had labels, but a greater challenge would be applying the same techniques to unlabeled data, he says.

Mugunthan is hopeful this work inspires other researchers to rethink how they approach federated learning.

“This work shows the importance of thinking about these problems from a holistic aspect, and not just individual metrics that have to be improved. Sometimes, improving one metric can actually cause a downgrade in the other metrics. Instead, we should be focusing on how we can improve a bunch of things together, which is really important if it is to be deployed in the real world,” he says.

Read More

Collaborative machine learning that preserves privacy

Training a machine-learning model to effectively perform a task, such as image classification, involves showing the model thousands, millions, or even billions of example images. Gathering such enormous datasets can be especially challenging when privacy is a concern, such as with medical images. Researchers from MIT and the MIT-born startup DynamoFL have now taken one popular solution to this problem, known as federated learning, and made it faster and more accurate.

Federated learning is a collaborative method for training a machine-learning model that keeps sensitive user data private. Hundreds or thousands of users each train their own model using their own data on their own device. Then users transfer their models to a central server, which combines them to come up with a better model that it sends back to all users.

A collection of hospitals located around the world, for example, could use this method to train a machine-learning model that identifies brain tumors in medical images, while keeping patient data secure on their local servers.

But federated learning has some drawbacks. Transferring a large machine-learning model to and from a central server involves moving a lot of data, which has high communication costs, especially since the model must be sent back and forth dozens or even hundreds of times. Plus, each user gathers their own data, so those data don’t necessarily follow the same statistical patterns, which hampers the performance of the combined model. And that combined model is made by taking an average — it is not personalized for each user.

The researchers developed a technique that can simultaneously address these three problems of federated learning. Their method boosts the accuracy of the combined machine-learning model while significantly reducing its size, which speeds up communication between users and the central server. It also ensures that each user receives a model that is more personalized for their environment, which improves performance.

The researchers were able to reduce the model size by nearly an order of magnitude when compared to other techniques, which led to communication costs that were between four and six times lower for individual users. Their technique was also able to increase the model’s overall accuracy by about 10 percent.

“A lot of papers have addressed one of the problems of federated learning, but the challenge was to put all of this together. Algorithms that focus just on personalization or communication efficiency don’t provide a good enough solution. We wanted to be sure we were able to optimize for everything, so this technique could actually be used in the real world,” says Vaikkunth Mugunthan PhD ’22, lead author of a paper that introduces this technique.

Mugunthan wrote the paper with his advisor, senior author Lalana Kagal, a principal research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL). The work will be presented at the European Conference on Computer Vision.

Cutting a model down to size

The system the researchers developed, called FedLTN, relies on an idea in machine learning known as the lottery ticket hypothesis. This hypothesis says that within very large neural network models there exist much smaller subnetworks that can achieve the same performance. Finding one of these subnetworks is akin to finding a winning lottery ticket. (LTN stands for “lottery ticket network.”)

Neural networks, loosely based on the human brain, are machine-learning models that learn to solve problems using interconnected layers of nodes, or neurons.

Finding a winning lottery ticket network is more complicated than a simple scratch-off. The researchers must use a process called iterative pruning. If the model’s accuracy is above a set threshold, they remove nodes and the connections between them (just like pruning branches off a bush) and then test the leaner neural network to see if the accuracy remains above the threshold.

Other methods have used this pruning technique for federated learning to create smaller machine-learning models which could be transferred more efficiently. But while these methods may speed things up, model performance suffers.

Mugunthan and Kagal applied a few novel techniques to accelerate the pruning process while making the new, smaller models more accurate and personalized for each user.

They accelerated pruning by avoiding a step where the remaining parts of the pruned neural network are “rewound” to their original values. They also trained the model before pruning it, which makes it more accurate so it can be pruned at a faster rate, Mugunthan explains.

To make each model more personalized for the user’s environment, they were careful not to prune away layers in the network that capture important statistical information about that user’s specific data. In addition, when the models were all combined, they made use of information stored in the central server so it wasn’t starting from scratch for each round of communication.

They also developed a technique to reduce the number of communication rounds for users with resource-constrained devices, like a smart phone on a slow network. These users start the federated learning process with a leaner model that has already been optimized by a subset of other users.

Winning big with lottery ticket networks

When they put FedLTN to the test in simulations, it led to better performance and reduced communication costs across the board. In one experiment, a traditional federated learning approach produced a model that was 45 megabytes in size, while their technique generated a model with the same accuracy that was only 5 megabytes. In another test, a state-of-the-art technique required 12,000 megabytes of communication between users and the server to train one model, whereas FedLTN only required 4,500 megabytes.

With FedLTN, the worst-performing clients still saw a performance boost of more than 10 percent. And the overall model accuracy beat the state-of-the-art personalization algorithm by nearly 10 percent, Mugunthan adds.

Now that they have developed and finetuned FedLTN, Mugunthan is working to integrate the technique into a federated learning startup he recently founded, DynamoFL.

Moving forward, he hopes to continue enhancing this method. For instance, the researchers have demonstrated success using datasets that had labels, but a greater challenge would be applying the same techniques to unlabeled data, he says.

Mugunthan is hopeful this work inspires other researchers to rethink how they approach federated learning.

“This work shows the importance of thinking about these problems from a holistic aspect, and not just individual metrics that have to be improved. Sometimes, improving one metric can actually cause a downgrade in the other metrics. Instead, we should be focusing on how we can improve a bunch of things together, which is really important if it is to be deployed in the real world,” he says.

Read More

ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer

Generating robust and reliable correspondences across images is a fundamental task for a diversity of applications. To capture context at both global and local granularity, we propose ASpanFormer, a Transformer-based detector-free matcher that is built on hierarchical attention structure, adopting a novel attention operation which is capable of adjusting attention span in a self-adaptive manner. To achieve this goal, first, flow maps are regressed in each cross attention phase to locate the center of search region. Next, a sampling grid is generated around the center, whose size, instead of…Apple Machine Learning Research

Multi-objective Hyper-parameter Optimization of Behavioral Song Embeddings

Song embeddings are a key component of most music recommendation engines. In this work, we study the hyper-parameter optimization of behavioral song embeddings based on Word2Vec on a selection of downstream tasks, namely next-song recommendation, false neighbor rejection, and artist and genre clustering. We present new optimization objectives and metrics to monitor the effects of hyper-parameter optimization. We show that single-objective optimization can cause side effects on the non optimized metrics and propose a simple multi-objective optimization to mitigate these effects. We find that…Apple Machine Learning Research

What’s new in TensorFlow 2.10?

What’s new in TensorFlow 2.10?

Posted by the TensorFlow Team

TensorFlow 2.10 has been released! Highlights of this release include user-friendly features in Keras to help you develop transformers, deterministic and stateless initializers, updates to the optimizers API, and new tools to help you load audio data. We’ve also made performance enhancements with oneDNN, expanded GPU support on Windows, and more. This release also marks TensorFlow Decision Forests 1.0! Read on to learn more.

Keras

Expanded, unified mask support for Keras attention layers

Starting from TensorFlow 2.10, mask handling for Keras attention layers, such as tf.keras.layers.Attention, tf.keras.layers.AdditiveAttention, and tf.keras.layers.MultiHeadAttention have been expanded and unified. In particular, we’ve added two features:

Causal attention: All three layers now support a use_causal_mask argument to call (Attention and AdditiveAttention used to take a causal argument to __init__).

Implicit masking: Keras AttentionAdditiveAttention, and MultiHeadAttention layers now support implicit masking  (set mask_zero=True in tf.keras.layers.Embedding).

Combined, this simplifies the implementation of any Transformer-style model since getting the masking right is often a tricky part.

A basic Transformer self-attention block can now be written as:

import tensorflow as tf


embedding = tf.keras.layers.Embedding(

    input_dim=10,

    output_dim=3,

    mask_zero=True) # Infer a correct padding mask.


# Instantiate a Keras multi-head attention (MHA) layer,

# a layer normalization layer, and an `Add` layer object.

mha = tf.keras.layers.MultiHeadAttention(key_dim=4, num_heads=1)

layernorm = tf.keras.layers.LayerNormalization()

add = tf.keras.layers.Add()


# Test input.

x = tf.constant([[1, 2, 3, 4, 5, 0, 0, 0, 0],

                 [1, 2, 1, 0, 0, 0, 0, 0, 0]])

# The embedding layer sets the mask.

x = embedding(x)


# The MHA layer uses and propagates the mask.

a = mha(query=x, key=x, value=x, use_causal_mask=True)

x = add([x, a]) # The `Add` layer propagates the mask.

x = layernorm(x)


# The mask made it through all layers.

print(x._keras_mask)

And here’s the output: 

> tf.Tensor(

> [[ True  True  True  True  True False False False False]

>  [ True  True  True False False False False False False]], shape=(2, > 9), dtype=bool)

Try out the new Keras Optimizers API

In the previous release, Tensorflow 2.9, we published a new version of the Keras Optimizer API, in tf.keras.optimizers.experimental, which will replace the current tf.keras.optimizers namespace in TensorFlow 2.11. To prepare for the upcoming formal switch of the optimizer namespace to the new API, we’ve also exported all of the current Keras optimizers under tf.keras.optimizers.legacy in TensorFlow 2.10.

Most users won’t be affected by this change, but please check the API doc to see if any API used in your workflow has changed. If you decide to keep using the old optimizer, please explicitly change your optimizer to corresponding tf.keras.optimizers.legacy.Optimizer.

You can also find more details about new Keras Optimizers in this article.

Deterministic and Stateless Keras initializers

In TensorFlow 2.10, we’ve made Keras initializers (the tf.keras.initializers API) stateless and deterministic, built on top of stateless TF random ops. Starting in TensorFlow 2.10, both seeded and unseeded Keras initializers will always generate the same values every time they are called (for a given variable shape). The stateless initializer enables Keras to support new features such as multi-client model training with DTensor.

init = tf.keras.initializers.RandomNormal()

a = init((3, 2))

b = init((3, 2))

# a == b


init_2 = tf.keras.initializers.RandomNormal(seed=1)

c = init_2((3, 2))

d = init_2((3, 2))

# c == d

# a != c


init_3 = tf.keras.initializers.RandomNormal(seed=1)

e = init_3((3, 2))

# e == c


init_4 = tf.keras.initializers.RandomNormal()

f = init_4((3, 2))

# f != a

For unseeded initializers (seed=None), a random seed will be created and assigned at initializer creation (different initializer instances get different seeds). An unseeded initializer will raise a warning if it is reused (called) multiple times. This is because it would produce the same values each time, which may not be intended.

BackupAndRestore checkpoints with step level granularity

In the previous release, Tensorflow 2.9, the tf.keras.callbacks.BackupAndRestore Keras callback would backup the model and training state at epoch boundaries. In Tensorflow 2.10, the callback can also backup the model every N training steps. However, keep in mind that when BackupAndRestore is used with tf.distribute.MultiWorkerMirroredStrategy, the distributed dataset iterator state will be reinitialized and won’t be restored when restoring the model. More information and code examples can be found in the migrate the fault tolerance mechanism guide.

Easily generate an audio classification dataset from a directory of audio files

You can now use a new utility, tf.keras.utils.audio_dataset_from_directory, to easily generate audio classification datasets from directories of .wav files. Just sort your audio files into one different directory per file class, and a single line of code will get you a labeled tf.data.Dataset you can pass to a Keras model. You can find an example here.

The EinsumDense layer is no longer experimental

The einsum function is the swiss army knife of linear algebra. It can efficiently and explicitly describe a wide variety of operations. The tf.keras.layers.EinsumDense layer brings some of that power to Keras.

Operations like einsumeinops.rearrange, and the EinsumDense layer operate based on a string “equation” that describes the axes of the inputs and outputs. For EinsumDense the equation lists the axes of the input argument, the axes of the weights, and the axes of the output. A basic Dense layer can be written as: 

dense = keras.layers.Dense(units=10, activation=’relu’)

dense = keras.layers.EinsumDense(‘…i, ij -> …j’, output_shape=(10,), activation=’relu’)

Notes:

  • …i – This only works on the last axis of the input, that axis is called i.
  • ij – The weights are a matrix with shape (ij).
  • …j – The result sums out the i axis and leaves j.

For example, here is a stack of 5 Dense layers with 10 units each:

dense = keras.layers.EinsumDense(‘…i, nij -> …nj’, output_shape=(5,10))

Here is a stack of Dense layers, where each one operates on a different input vector:

dense = keras.layers.EinsumDense(‘…ni, nij -> …nj’, output_shape=(5,10))

Here is a stack of Dense layers where each one operates on each input vector independently:

dense = keras.layers.EinsumDense(‘…ni, mij -> …nmj’, output_shape=(None, 5,10))

Performance and collaborations

Improved aarch64 CPU performance: ACL/oneDNN integration

We have worked with Arm, AWS, and Linaro to integrate Compute Library for the Arm® Architecture (ACL) with TensorFlow through oneDNN to accelerate performance on aarch64 CPUs. Starting with TensorFlow 2.10, you can try these experimental optimizations by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1 before running your TensorFlow program.

There may be slightly different numerical results due to different computation and floating-point round-off approaches. If this causes issues for you, turn the optimizations off by setting TF_ENABLE_ONEDNN_OPTS=0 before running your program.

To verify that the optimizations are on, look for a message beginning with “oneDNN custom operations are on” in your program log. We welcome feedback on GitHub and the TensorFlow Forum.

Expanded GPU support on Windows

TensorFlow can now leverage a wider range of GPUs on Windows through the TensorFlow-DirectML plug-in. To enable model training on DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm, install the plug-in alongside standard TensorFlow CPU packages on native Windows or WSL2. The preview package currently supports a limited number of basic machine learning models, with a goal to increase model coverage in the future. You can view the open-source code and leave feedback at the TensorFlow-DirectML GitHub repository.

New features in tf.data

Create tf.data Dataset from lists of elements

Tensorflow 2.10 introduces a convenient new experimental API tf.data.experimental.from_list which creates a tf.data.Dataset comprising the given list of elements. The returned dataset will produce the items in the list one by one. The functionality is identical to tf.data.Dataset.from_tensor_slices when elements are scalars, but different when elements have structure.

Consider the following example:

dataset = tf.data.experimental.from_list([(1, ‘a’), (2, ‘b’), (3, ‘c’)])

list(dataset.as_numpy_iterator())

[(1, ‘a’), (2, ‘b’), (3, ‘c’)]

In contrast, to get the same output with `from_tensor_slices`, the data needs to be reorganized:

dataset = tf.data.Dataset.from_tensor_slices(([1, 2, 3], [‘a’, ‘b’, ‘c’]))

list(dataset.as_numpy_iterator())

[(1, ‘a’), (2, ‘b’), (3, ‘c’)]

Unlike the from_tensor_slices method, from_list supports non-rectangular input (achieving the same with from_tensor_slices requires the use of ragged tensors).

Sharing tf.data service with concurrent trainers

If you run multiple trainers concurrently using the same training data, it could save resources to cache the data in one tf.data service cluster and share the cluster with the trainers. For example, if you use Vizier to tune hyperparameters, the Vizier jobs can run concurrently and share one tf.data service cluster.

To enable this feature, each trainer needs to generate a unique trainer ID, and you pass the trainer ID to tf.data.experimental.service.distribute. Once a job has consumed the data, the data remains in the cache and is re-used by jobs with different trainer_ids. Requests with the same trainer_id do not re-use data. For example:

dataset = expensive_computation()

dataset = dataset.apply(tf.data.experimental.service.distribute(

processing_mode=tf.data.experimental.service.ShardingPolicy.OFF,

service=FLAGS.tf_data_service_address,

job_name=”job”,

cross_trainer_cache=data_service_ops.CrossTrainerCache(

trainer_id=trainer_id())))

 tf.data service uses a sliding-window cache to store shared data. When one trainer consumes data, the data remains in the cache. When other trainers need data, they can get data from the cache instead of repeating the expensive computation. The cache has a bounded size, so some workers may not read the full dataset. To ensure all the trainers get sufficient training data, we require the input dataset to be infinite. This can be achieved, for example, by repeating the dataset and performing random augmentation on the training instances.

TensorFlow Decision Forests 1.0

In conjunction with the release of Tensorflow 2.10, Tensorflow Decision Forests (TF-DF) reaches version 1.0. With this milestone we want to communicate more broadly that Tensorflow Decision Forests has become a more stable and mature library. We’ve improved our documentation and established more comprehensive testing to make sure that TF-DF is ready for professional environments.

The new release of TF-DF also offers a first look at the Javascript and Go APIs for inference of TF-DF models. While these APIs are still in beta, we are actively looking for feedback for them. TF-DF 1.0 improves performance of oblique splits. Oblique splits allow decision trees to express more complex patterns by conditioning on multiple features at the same time – learn more in our Decision Forests class on developers.google.com. Benchmarks and real-world observations show that oblique splits outperform classical axis-aligned splits on the majority of datasets. Finally, the new release includes our latest bug fixes.

Next steps

Check out the release notes for more information. To stay up to date, you can read the TensorFlow blog, follow twitter.com/tensorflow, or subscribe to youtube.com/tensorflow. If you’ve built something you’d like to share, please submit it for our Community Spotlight at goo.gle/TFCS. For feedback, please file an issue on GitHub or post to the TensorFlow Forum. Thank you!

Read More

Detect audio events with Amazon Rekognition

When most people think of using machine learning (ML) with audio data, the use case that usually comes to mind is transcription, also known as speech-to-text. However, there are other useful applications, including using ML to detect sounds.

Using software to detect a sound is called audio event detection, and it has a number of applications. For example, suppose you want to monitor the sounds from a noisy factory floor, listening for an alarm bell that indicates a problem with a machine. In a healthcare environment, you can use audio event detection to passively listen for sounds from a patient that indicate an acute health problem. Media workloads are a good fit for this technique, for example to detect when a referee’s whistle is blown in a sports video. And of course, you can use this technique in a variety of surveillance workloads, like listening for a gunshot or the sound of a car crash from a microphone mounted above a city street.

This post describes how to detect sounds in an audio file even if there are significant background sounds happening at the same time. What’s more, perhaps surprisingly, we use computer vision-based techniques to do the detection, using Amazon Rekognition.

Using audio data with machine learning

The first step in detecting audio events is understanding how audio data is represented. For the purposes of this post, we deal only with recorded audio, although these techniques work with streaming audio.

Recorded audio is typically stored as a sequence of sound samples, which measure the intensity of the sound waves that struck the microphone during recording, over time. There are a wide variety of formats with which to store these samples, but a common approach is to store 10,000, 20,000, or even 40,000 samples per second, with each sample being an integer from 0–65535 (two bytes). Because each sample measures only the intensity of sound waves at a particular moment, the sound data generally isn’t helpful for ML processes because it doesn’t have any useful features in its raw state.

To make that data useful, the sound sample is converted into an image called a spectrogram, which is a representation of the audio data that shows the intensity of different frequency bands over time. The following image shows an example.

The X axis of this image represents time, meaning that the left edge of the image is the very start of the sound, and the right edge of the image is the end. Each column of data within the image represents different frequency bands (indicated by the scale on the left side of the image), and the color at each point represents the intensity of that frequency at that moment in time.

Vertical scaling for spectrograms can be changed to other representations. For example, linear scaling means that the Y axis is evenly divided over frequencies, logarithmic scaling uses a log scale, and so forth. The problem with using these representations is that the frequencies in a sound file are usually not evenly distributed, so most of the information we might be interested in ends up being clustered near the bottom of the image (the lower frequencies).

To solve that problem, our sample image is an example of a Mel spectrogram, which is scaled to closely approximate how human beings perceive sound. Notice the frequency indicators along the left side of the image—they give an idea of how they are distributed vertically, and it’s clear that it’s a non-linear scale.

Additionally, we can modify the measurement of intensity by frequency by time to enhance various features of the audio being measured. As with the Y axis scaling that is implemented by a Mel spectrogram, others emphasize features such as the intensity of the 12 distinctive pitch classes that are used to study music (chroma). Another class emphasizes horizonal (harmonic) features or vertical (percussive) features. The type of sound that is being detected should drive the type of spectrogram used for the detection system.

The earlier example spectrogram represents a music clip that is just over 2 minutes long. Zooming in reveals more detail, as is shown in the following image.

The numbers along the top of the image show the number of seconds from the start of the audio file. You can clearly see a sequence of sounds that seems to repeat more than four times per second, indicated by the bright colors near the bottom of the image.

As you can see, this is one of the benefits of converting audio to a spectrogram—distinct sounds are often easily visible with the naked eye, and even if they aren’t, they can frequently be detected using computer vision object detection algorithms. In fact, this is exactly the process we follow in order to detect sounds.

Looking for discrete sounds in a spectrogram

Depending on the length of the audio file that we’re searching, finding a discrete sound that lasts just a second or two is a challenge. Refer to the first spectrogram we shared—because we’re viewing an entire 3:30 minutes of data, details that last only a second or so aren’t visible. We zoomed in a great deal in order to see the rhythm that is shown in the second image. Clearly, with larger sound files (and therefore much larger spectrograms), we quickly run into problems unless we use a different approach. That approach is called windowing.

Windowing refers to using a sliding window that moves across the entire spectrogram, isolating a few seconds (or less) at a time. By repeatedly isolating portions of the overall image, we get smaller images that are searchable for the presence of the sound to be detected. Because each window could result in only part of the image we’re looking for (as in the case of searching for a sound that doesn’t start exactly at the start of a window), windowing is often performed with succeeding windows being overlapped. For example, the first window starts at 0:00 and extends 2 seconds, then the second window starts at 0:01 and extends 2 seconds, and the third window starts at 0:02 and extends 2 seconds, and so on.

Windowing splits a spectrogram image horizontally. We can improve the effectiveness of the detection process by isolating certain frequency bands by cropping or searching only certain vertical parts of the image. For example, if you know that the alarm bell you want to detect creates sounds that range from one specific frequency to another, you can modify the current window to only consider those frequency ranges. That vastly reduces the amount of data to be manipulated, and results in a much faster search. It also improves accuracy, because it’s eliminating possible false positives matches occurring in frequency bands outside of the desired range. The following images compare a full Y axis (left) with a limited Y axis (right).

Full Y Axis

Full Y Axis

Limited Y Axis

Limited Y Axis

Now that we know how to iterate over a spectrogram with a windowing approach and filter to certain frequency bands, the next step is to do the actual search for the sound. For that, we use Amazon Rekognition Custom Labels. The Rekognition Custom Labels feature builds off of the existing capabilities of Amazon Rekognition, which is already trained on tens of millions of images across many categories. Instead of thousands of images, you simply need to upload a small set of training images (typically a few hundred images, but optimal training dataset size should be arrived at experimentally based on the specific use case to avoid under- or over-training the model) that are specific to your use case via the Rekognition Custom Labels console.

If your images are already labeled, Amazon Rekognition training is accessible with just a few clicks. Alternatively, you can label the images directly within the Amazon Rekognition labeling interface, or use Amazon SageMaker Ground Truth to label them for you. When Amazon Rekognition begins training from your image set, it produces a custom image analysis model for you in just a few hours. Behind the scenes, Rekognition Custom Labels automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. You can then use your custom model via the Rekognition Custom Labels API and integrate it into your applications.

Assembling training data and training a Rekognition Custom Labels model

In the GitHub repo associated with this post, you’ll find code that shows how to listen for the sound of a smoke alarm going off, regardless of background noise. In this case, our Rekognition Custom Labels model is a binary classification model, meaning that the results are either “smoke alarm sound was detected” or “smoke alarm sound was not detected.”

To create a custom model, we need training data. That training data is comprised of two main types: environmental sounds, and the sounds you wish to detect (like a smoke alarm going off).

The environmental data should represent a wide variety of soundscapes that are typical for the environment you want to detect the sound in. For example, if you want to detect a smoke alarm sound in a factory environment, start with sounds recorded in that factory environment under a variety of situations (without the smoke alarm sounding, of course).

The sounds you want to detect should be isolated if possible, meaning the recordings should just be the sound itself without any environmental background sounds. For our example, that’s a sound of a smoke alarm going off.

After you’ve collected these sounds, the code in the GitHub repo shows how to combine the environmental sounds with the smoke alarm sounds in various ways (and then convert them to spectrograms) in order to create a number of images that represent the environmental sounds with and without the smoke alarm sounds overlaid on them. The following image is an example of some environmental sounds with a smoke alarm sound (the bright horizontal bars) overlaid on top of it.

The training and test data is stored in an Amazon Simple Storage Service (Amazon S3) bucket. The following directory structure is a good starting point to organize data within the bucket.

The sample code in the GitHub repo allows you to choose how many training images to create. Rekognition Custom Labels doesn’t require a large number of training images. A training set of 200–500 images should be sufficient.

Creating a Rekognition Custom Labels project requires that you specify the URIs of the S3 folder that contains the training data, and (optionally) test data. When specifying the data sources for the training job, one of the options is Automatic labeling, as shown in the following screenshot.

Using this option means that Amazon Rekognition uses the names of the folders as the label names. For our smoke alarm detection use case, the folder structure inside of the train and test folders looks like the following screenshot.

The training data images go into those folders, with spectrograms containing the sound of the smoke alarm going in the alarm folder, and spectrograms that don’t contain the smoke alarm sound in the no_alarm folder. Amazon Rekognition uses those names as the output class names for the custom labels model.

Training a custom label model usually takes 30–90 minutes. At the end of that training, you must start the trained model so it becomes available for use.

End-to-end architecture for sound detection

After we create our model, the next step is to set up an inference pipeline, so we can use the model to detect if a smoke alarm sound is present in an audio file. To do this, the input sound must be turned into a spectrogram and then windowed and filtered by frequency, as was done for the training process. Each window of the spectrogram is given to the model, which returns a classification that indicates if the smoke alarm sounded or not.

The following diagram shows an example architecture that implements this inference pipeline.

This architecture waits for an audio file to be placed into an S3 bucket, which then causes an AWS Lambda function to be invoked. Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. You can trigger a Lambda function from over 200 AWS services and software as a service (SaaS) applications, and only pay for what you use.

The Lambda function receives the name of the bucket and the name of the key (or file name) of the audio file. The file is downloaded from Amazon S3 to the function’s memory, which then converts it into a spectrogram and performs windowing and frequency filtering. Each windowed portion of the spectrogram is then sent to Amazon Rekognition, which uses the previously-trained Amazon Custom Labels model to detect the sound. If that sound is found, the Lambda function signals that by using an Amazon Simple Notification Service (Amazon SNS) notification. Amazon SNS offers a pub/sub approach where notifications can be sent to Amazon Simple Queue Service (Amazon SQS) queues, Lambda functions, HTTPS endpoints, email addresses, mobile push, and more.

Conclusion

You can use machine learning with audio data to determine when certain sounds occur, even when other sounds are occurring at the same time. Doing so requires converting the sound into a spectrogram image, and then homing in on different parts of that spectrogram by windowing and filtering by frequency band. Rekognition Custom Labels makes it easy to train a custom model for sound detection.

You can use the GitHub repo containing the example code for this post as a starting point for your own experiments. For more information about audio event detection, refer to Sound Event Detection: A Tutorial.


About the authors

Greg Sommerville is a Senior Prototyping Architect on the AWS Prototyping and Cloud Engineering team, where he helps AWS customers implement innovative solutions to challenging problems with machine learning, IoT and serverless technologies. He lives in Ann Arbor, Michigan and enjoys practicing yoga, catering to his dogs, and playing poker.

Jeff Harman is a Senior Prototyping Architect on the AWS Prototyping and Cloud Engineering team, where he helps AWS customers implement innovative solutions to challenging problems. He lives in Unionville, Connecticut and enjoys woodworking, blacksmithing, and Minecraft.

Read More

Digitizing Smell: Using Molecular Maps to Understand Odor

Digitizing Smell: Using Molecular Maps to Understand Odor

Did you ever try to measure a smell? …Until you can measure their likenesses and differences you can have no science of odor. If you are ambitious to found a new science, measure a smell.
— Alexander Graham Bell, 1914.

How can we measure a smell? Smells are produced by molecules that waft through the air, enter our noses, and bind to sensory receptors. Potentially billions of molecules can produce a smell, so figuring out which ones produce which smells is difficult to catalog or predict. Sensory maps can help us solve this problem. Color vision has the most familiar examples of these maps, from the color wheel we each learn in primary school to more sophisticated variants used to perform color correction in video production. While these maps have existed for centuries, useful maps for smell have been missing, because smell is a harder problem to crack: molecules vary in many more ways than photons do; data collection requires physical proximity between the smeller and smell (we don’t have good smell “cameras” and smell “monitors”); and the human eye only has three sensory receptors for color while the human nose has > 300 for odor. As a result, previous efforts to produce odor maps have failed to gain traction.

In 2019, we developed a graph neural network (GNN) model that began to explore thousands of examples of distinct molecules paired with the smell labels that they evoke, e.g., “beefy”, “floral”, or “minty”, to learn the relationship between a molecule’s structure and the probability that such a molecule would have each smell label. The embedding space of this model contains a representation of each molecule as a fixed-length vector describing that molecule in terms of its odor, much as the RGB value of a visual stimulus describes its color.

Left: An example of a color map (CIE 1931) in which coordinates can be directly translated into values for hue and saturation. Similar colors lie near each other, and specific wavelengths of light (and combinations thereof) can be identified with positions on the map. Right: Odors in the Principal Odor Map operate similarly. Individual molecules correspond to points (grey), and the locations of these points reflect predictions of their odor character.

Today we introduce the “Principal Odor Map” (POM), which identifies the vector representation of each odorous molecule in the model’s embedding space as a single point in a high-dimensional space. The POM has the properties of a sensory map: first, pairs of perceptually similar odors correspond to two nearby points in the POM (by analogy, red is nearer to orange than to green on the color wheel). Second, the POM enables us to predict and discover new odors and the molecules that produce them. In a series of papers, we demonstrate that the map can be used to prospectively predict the odor properties of molecules, understand these properties in terms of fundamental biology, and tackle pressing global health problems. We discuss each of these promising applications of the POM and how we test them below.

Test 1: Challenging the Model with Molecules Never Smelled Before
First, we asked if the underlying model could correctly predict the odors of new molecules that no one had ever smelled before and that were very different from molecules used during model development. This is an important test — many models perform well on data that looks similar to what the model has seen before, but break down when tested on novel cases.

To test this, we collected the largest ever dataset of odor descriptions for novel molecules. Our partners at the Monell Center trained panelists to rate the smell of each of 400 molecules using 55 distinct labels (e.g., “minty”) that were selected to cover the space of possible smells while being neither redundant nor too sparse. Unsurprisingly, we found that different people had different characterizations of the same molecule. This is why sensory research typically uses panels of dozens or hundreds of people and highlights why smell is a hard problem to solve. Rather than see if the model could match any one person, we asked how close it was to the consensus: the average across all of the panelists. We found that the predictions of the model were closer to the consensus than the average panelist was. In other words, the model demonstrated an exceptional ability to predict odor from a molecule’s structure.

Predictions made by two models, our GNN model (orange) and a baseline chemoinformatic random forest (RF) model (blue), compared with the mean ratings given by trained panelists (green) for the molecule 2,3-dihydrobenzofuran-5-carboxaldehyde. Each bar corresponds to one odor character label (with only the top 17 of 55 shown for clarity). The top five are indicated in color; our model correctly identifies four of the top five, with high confidence, vs. only three of five, with low confidence, for the RF model. The correlation (R) to the full set of 55 labels is also higher in our model.

<!–

Predictions made by two models, our GNN model (orange) and a baseline chemoinformatic random forest (RF) model (blue), compared with the mean ratings given by trained panelists (green) for the molecule 2,3-dihydrobenzofuran-5-carboxaldehyde. Each bar corresponds to one odor character label (with only the top 17 of 55 shown for clarity). The top five are indicated in color; our model correctly identifies four of the top five, with high confidence, vs. only three of five, with low confidence, for the RF model. The correlation (R) to the full set of 55 labels is also higher in our model.

–>

Unlike alternative benchmark models (RF and nearest-neighbor models trained on various sets of chemoinformatic features), our GNN model outperforms the median human panelist at predicting the panel mean rating. In other words, our GNN model better reflects the panel consensus than the typical panelist.

<!–

Unlike alternative benchmark models (RF and nearest-neighbor models trained on various sets of chemoinformatic features), our GNN model outperforms the median human panelist at predicting the panel mean rating. In other words, our GNN model better reflects the panel consensus than the typical panelist.

–>

The POM also exhibited state-of-the-art performance on alternative human olfaction tasks like detecting the strength of a smell or the similarity of different smells. Thus, with the POM, it should be possible to predict the odor qualities of any of billions of as-yet-unknown odorous molecules, with broad applications to flavor and fragrance.

Test 2: Linking Odor Quality Back to Fundamental Biology
Because the Principal Odor Map was useful in predicting human odor perception, we asked whether it could also predict odor perception in animals, and the brain activity that underlies it. We found that the map could successfully predict the activity of sensory receptors, neurons, and behavior in most animals that olfactory neuroscientists have studied, including mice and insects.

What common feature of the natural world makes this map applicable to species separated by hundreds of millions of years of evolution? We realized that the common purpose of the ability to smell might be to detect and discriminate between metabolic states, i.e., to sense when something is ripe vs. rotten, nutritious vs. inert, or healthy vs. sick. We gathered data about metabolic reactions in dozens of species across the kingdoms of life and found that the map corresponds closely to metabolism itself. When two molecules are far apart in odor, according to the map, a long series of metabolic reactions is required to convert one to the other; by contrast, similarly smelling molecules are separated by just one or a few reactions. Even long reaction pathways containing many steps trace smooth paths through the map. And molecules that co-occur in the same natural substances (e.g., an orange) are often very tightly clustered on the map. The POM shows that olfaction is linked to our natural world through the structure of metabolism and, perhaps surprisingly, captures fundamental principles of biology.

Left: We aggregated metabolic reactions found in 17 species across 4 kingdoms to construct a metabolic graph. In this illustration, each circle is a distinct metabolite molecule and an arrow indicates that there is a metabolic reaction that converts one molecule to another. Some metabolites have an odor (color) and others do not (gray), and the metabolic distance between two odorous metabolites is the minimum number of reactions necessary to convert one into the other. In the path shown in bold, the distance is 3. Right: Metabolic distance was highly correlated with distance in the POM, an estimate of perceived odor dissimilarity.

<!–

Left: We aggregated metabolic reactions found in 17 species across 4 kingdoms to construct a metabolic graph. In this illustration, each circle is a distinct metabolite molecule and an arrow indicates that there is a metabolic reaction that converts one molecule to another. Some metabolites have an odor (color) and others do not (gray), and the metabolic distance between two odorous metabolites is the minimum number of reactions necessary to convert one into the other. In the path shown in bold, the distance is 3. Right: Metabolic distance was highly correlated with distance in the POM, an estimate of perceived odor dissimilarity.

–>

Test 3: Extending the Model to Tackle a Global Health Challenge
A map of odor that is tightly connected to perception and biology across the animal kingdom opens new doors. Mosquitos and other insect pests are drawn to humans in part by their odor perception. Since the POM can be used to predict animal olfaction generally, we retrained it to tackle one of humanity’s biggest problems, the scourge of diseases transmitted by mosquitoes and ticks, which kill hundreds of thousands of people each year.

For this purpose, we improved our original model with two new sources of data: (1) a long-forgotten set of experiments conducted by the USDA on human volunteers beginning 80 years ago and recently made discoverable by Google Books, which we subsequently made machine-readable; and (2) a new dataset collected by our partners at TropIQ, using their high-throughput laboratory mosquito assay. Both datasets measure how well a given molecule keeps mosquitos away. Together, the resulting model can predict the mosquito repellency of nearly any molecule, enabling a virtual screen over huge swaths of molecular space. We validated this screen experimentally using entirely new molecules and found over a dozen of them with repellency at least as high as DEET, the active ingredient in most insect repellents. Less expensive, longer lasting, and safer repellents can reduce the worldwide incidence of diseases like malaria, potentially saving countless lives.

We digitized USDA mosquito repellency data for thousands of molecules previously scanned by Google Books, and used it to refine the learned representation (the map) at the heart of the model. We added additional layers, specifically to predict repellency in a mosquito feeder assay, and iteratively trained the model to improve assay predictions while running computational screens for candidate repellents.
Many molecules showing mosquito repellency in the laboratory assay also showed repellency when applied to humans. Several showed repellency greater than the most common repellents used today (DEET and picaridin).

The Road Ahead
We discovered that our modeling approach to smell prediction could be used to draw a Principal Odor Map for tackling odor-related problems more generally. This map was the key to measuring smell: it answered a range of questions about novel smells and the molecules that produce them, it connected smells back to their origins in evolution and the natural world, and it is helping us tackle important human-health challenges that affect millions of people. Going forward, we hope that this approach can be used to find new solutions to problems in food and fragrance formulation, environmental quality monitoring, and the detection of human and animal diseases.

Acknowledgements
This work was performed by the ML olfaction research team, including Benjamin Sanchez-Lengeling, Brian K. Lee, Jennifer N. Wei, Wesley W. Qian, and Jake Yasonik (the latter two were partly supported by the Google Student Researcher program) and our external partners including Emily Mayhew and Joel D. Mainland from the Monell Center, and Koen Dechering and Marnix Vlot from TropIQ. The Google Books team brought the USDA dataset online. Richard C. Gerkin was supported by the Google Visiting Faculty Researcher program and is also an Associate Research Professor at Arizona State University.

Read More