Inside Chirpy Cardinal: Stanford's Open-Source Social Chatbot that Won 2nd place in the Alexa Prize

Inside Chirpy Cardinal: Stanford’s Open-Source Social Chatbot that Won 2nd place in the Alexa Prize

Last year, Stanford won 2nd place in the Alexa Prize Socialbot Grand Challenge 3 for social chatbots. In this post, we look into building a chatbot that combines the flexibility and naturalness of neural dialog generation with the reliability and practicality of scripted dialogue. We also announce an open-source version of our socialbot with the goal of enabling future research.

Our bot, Chirpy, is a modern social chatbot, tested and validated by real users, capable of discussing a broad range of topics. We can’t wait to introduce it to you!

What makes Chirpy special?

Social conversations – such as one you would have with a friend – challenge chatbots to demonstrate human traits: emotional intelligence and empathy, world knowledge, and conversational awareness. They also challenge us – researchers and programmers – to imagine novel solutions and build fast and scalable systems. As an Alexa Prize team, we created Chirpy, a social chatbot that interacted with hundreds of thousands of people who rated the conversations and validated our approaches. We were able to successfully incorporate neural generation into Chirpy and here we explain what went into making that happen.

Recently, there has been enormous progress in building large neural net models that can produce fluent, coherent text by training them on a very large amount of text 12. One might wonder, “Why not just extend these models and fine-tune them (or train them) on large dialogue corpora?” In fact Meena3, BlenderBot4, and DialoGPT5 attempt to do exactly this, yet they are not being used to chat with people in the real world – why? First, they lack controllability and can respond unpredictably to new, or out-of-domain inputs. For example they can generate toxic language and cause safety issues. Furthermore, they lack consistency over long conversations – forgetting, repeating, and contradicting themselves. Although neurally-generated dialog is more flexible and natural than scripted dialog, as the number of neural turns increases, their errors compound and so does the likelihood of producing inconsistent or illogical responses.

Since neural generation is unreliable, practical conversational agents are dominated by hand-written rules: dialogue trees, templated responses, topic ontologies, and so on. Figure 1 shows an example of this type of conversation. The bot gives a series of scripted responses, choosing responses based on fixed criteria – for example, whether or not a song name is detected. This design has benefits: developers writing these rules can interpret the system’s choices, control the direction of the conversation, and ensure consistency. But these rules are brittle! An unanticipated user utterance that gets misclassified into a wrong branch can lead to absurd responses and lead down a path that is hard to recover from. Additionally, depending on a predetermined set of templated responses limits the range of possibilities and can make the bot sound unnatural.

Figure 1: Example of a hand-written dialogue tree. Unanticipated or misclassified user responses can lead to absurd responses and paths that are hard to recover from.

Both neural and scripted dialog have clear benefits and drawbacks. We designed Chirpy to take advantage of both, choosing a modular architecture that lets us fluidly combine elements of neural and scripted conversations. We refer to our bot’s modules as response generators, or RGs. Each response generator is designed for a specific type of conversation such as talking about music, exchanging opinions, or sharing factual information. Based on their roles, some response generators are scripted, some are entirely neural, and others use a combination of neural and scripted dialog. We track entities, topics, emotions, and opinions using rules that maintain consistency across response generators, without sacrificing the flexibility of our neural components. This modular design allows users to add new response generators without needing to alter large parts of the codebase every time they want to extend Chirpy’s coverage. In the following sections, we highlight a few response generators and how they fit into our broader system.

Response Generators

Our response generators range from fully rule-based to fully neural. First, we’ll highlight three to demonstrate this range: the music response generator, which is entirely rule-based, the personal chat response generator, which relies entirely on a neural generative model, and the wikipedia response generator, which combines both rule-based and neurally-generated content. In the next section, we’ll discuss how Chirpy decides which RG to use, so that it can leverage their different strengths.

Music Response Generator

Chirpy’s music response generator uses rule-based, scripted dialog trees, like the example shown in Figure 1. It asks the user a series of questions about their musical preferences and, depending on their answers, selects a response. All possible responses are hand-written in advance, so this RG is highly effective at handling cases where the user responds as expected, but has more difficulty when they say something it doesn’t have a rule for.

Personal Chat Response Generator

We wanted Chirpy to have the ability to discuss users’ personal experiences and emotions. Since these are highly varied, we used neural generation because of its flexibility when handling previously unseen utterances. As shown in Figure 2, neural generation models input a context and then generate the response word-by-word. We fine-tuned a GPT2-medium model 6 on the EmpatheticDialogues dataset 7, which consists of conversations between a speaker describing an emotional personal experience and a listener who responds to the speaker. Since the conversations it contains are relatively brief, grounded in specific emotional situations, and focused on empathy, this dataset is well suited to the personal chat RG.

To keep neural conversations focused and effective, we begin each personal chat discussion by asking the user a scripted starter question, e.g., What do you like to do to relax? On each subsequent turn, we pass the conversation history as context to the GPT-2 model, and then sample 20 diverse responses. When selecting the final response, we prioritize generations that contain questions. However, if fewer than one third of the responses contain questions, we assume that the model no longer has a clear path forward, select a response without a question, and hand over to another response generator. Not continuing neurally generated conversation segments for too long is a simple but effective strategy for preventing the overall conversation quality from degrading.

Wiki Response Generator

We wanted Chirpy to be able to discuss a broad range of topics in depth. One source of information for a broad range of topics is Wikipedia, which provides in-depth content for millions of entities. Chirpy tracks the entity under discussion and if it is able to find a corresponding Wikipedia article, the Wiki RG searches for relevant sentences using TF-IDF, a standard technique used by search engines to find relevant documents based on text overlap with an underlying query. To encourage such overlap, we have our bot ask a handwritten open-ended question that is designed to evoke a meaningful response, eg in Figure 2 “I’m thinking about visiting the Trevi fountain. Do you have any thoughts about what I should do?”

Figure 2: In the first utterance, we have our bot ask a handwritten question. The user in response provides a meaningful answer which we can use to find related content from the Wikipedia page on the Trevi Fountain. The neural rephrasing model takes the two prior turns and the snippet as input to produce a response that weaves factual sentences into the conversational narrative.

However, quoting sentences from Wikipedia is insufficient, due to its encyclopedic, stiff writing style. When people share information they connect it to the conversation thus far and style it conversationally. Thus, we use a neural rephrasing model that takes the conversational history and the retrieved wikipedia sentence as input and generates a conversationally phrased reply. This model is trained on a modified version of the Topical Chat Dataset 8 which contains conversations paired with factual sentences. Unfortunately, the model isn’t perfect and makes mistakes from time to time. We handle user’s confusion with a few handwritten rules.

How do the response generators fit together?

There are clear benefits to all three types of dialog – entirely scripted, partially scripted, and entirely neural. So on a given turn, how do we decide which kind of response to give?

Many other chatbots use a dialog management strategy where on a given turn, the bot gets a user utterance, decides which module can handle it best, and then returns the next response from this module. Our bot delays that decision until after generation, so that it can make a more informed choice. On each turn, every module within the bot generates a response and a self-assessed priority using module specific context. Once every response generator has produced some output, the bot will then use these priorities to select the highest priority response.

Figure 3: Example conversation with our bot. In the second utterance, our bot uses a response from Launch RG appended with a prompt from Personal Chat RG. When the topic of conversation shifts to music, Music RG takes over. In the last turn, a lack of interest, the Music RG produces a response to acknowledge and hands over to Categories RG which provides a prompt.

Since a module may decide it has finished discussing a topic, we allow another module to append a prompt and take over on the same turn. The first module’s response acknowledges the users’ previous utterance, and the second module’s prompt gives the user direction for the next turn. For example, the user might receive a response “I also like avocados” from the opinions response generator, which is used for casual exchange of personal opinions, and then a prompt “Would you like to know more about the history of avocados?” from the Wikipedia response generator, which is used for sharing factual information. Figure 3 shows an example of the type of conversation our bot has, with different RGs handling their own specialized types of dialog.

The animations below show how the response and prompt selection works. The opinion, wikipedia, and personal chat modules use the state and annotations to generate responses, the bot selects the best response and prompt, and then the bot updates its state based on this choice.

How can I use Chirpy?

We’re open-sourcing Chirpy Cardinal, so that others can expand on the existing socialbot, create their own, or simply have a social conversation. This release is unique for several reasons.

User-tested design
Our bot has already been tested over hundreds of thousands of conversations during the Alexa Prize Competition. Its strategies are verified and can appeal to a broad range of users with diverse interests.
Essential building blocks
We’ve implemented time-consuming but essential basics, such as entity-linking and dialog management, so that you won’t have to. This allows new developers to move chatbot design forward faster, focusing on higher-level research and design.
Customizable architecture
Our bot’s flexible architecture allows easier customization for your own unique use cases. You can introduce new areas of content by creating specialized response generators, and you can choose what your bot prioritizes by adjusting the settings of the dialog manager.
Experiment framework
Finally, our bot was designed to enable dialog research. Users can create their own experiments, which are stored as parameters of the state, determine how often these experiments should be triggered, and then use the collected data to compare different strategies or models.

To get started, you can try the live demo of chirpy yourself before diving into the code in our github repo.

You can find more details about our system in a 30 minute overview presentation or our technical paper.

Our team continues to work on improving open-domain dialogue. You can find more about our current Alexa Prize team, publications and other updates at


We thank our colleages at Stanford’s Alexa Prize Team: Abigail See (co-lead), Kathleen Kenealy, Haojun Li, Peng Qi, Kaushik Ram Sadagopan, Nguyet Minh Phu, Dilara Soylu, Christopher D. Manning (faculty advisor).

We also thank Haojun Li and Dilara Soylu for helping us with open sourcing of the codebase.

Thanks to Siddharth Karamcheti, Megha Srivastava and rest of the SAIL Blog Team for reviewing and publishing our blog post.

This research was supported in part by Stanford Open Virtual Assistant Lab (OVAL) (for Amelia Hardy) and Alexa Prize Stipend Award (for Haojun Li, Kaushik Ram Sadagopan, Nguyet Minh Phu, Dilara Soylu).

  1. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR, abs/1910.13461. 

  2. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, … and Dario Amodei. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165 

  3. Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, … Quoc V. Le. (2020). Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. 

  4. Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, … Jason Weston. (2020). Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637. 

  5. Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, & Bill Dolan. (2020). DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. arXiv preprint arXiv:1911.00536. 

  6. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. (2019). Language Models are Unsupervised Multitask Learners. 

  7. Hannah Rashkin, Eric Michael Smith, Margaret Li and Y-Lan Boureau. (2019). Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset. arXiv preprint arXiv:1811.00207. 

  8. Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, & Dilek Hakkani-Tür (2019). Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019 (pp. 1891–1895). 

Read More

Neural Mechanics: Symmetry and Broken Conservation Laws In Deep Learning Dynamics

Neural Mechanics: Symmetry and Broken Conservation Laws In Deep Learning Dynamics

Just like the fundamental laws of classical and quantum mechanics taught us how to control and optimize the physical world for engineering purposes, a better understanding of the laws governing neural network learning dynamics can have a profound impact on the optimization of artificial neural networks. This raises a foundational question: what, if anything, can we quantitatively understand about the learning dynamics of state-of-the-art deep learning models driven by real-world datasets?

In order to make headway on this extremely difficult question, existing works have made major simplifying assumptions on the architecture, such as restricting to a single hidden layer 1, linear activation functions 2, or infinite width layers 3. These works have also ignored the complexity introduced by the optimizer through stochastic and discrete updates. In the present work, rather than introducing unrealistic assumptions on the architecture or optimizer, we identify combinations of parameters with simpler dynamics (as shown Fig. 1) that can be solved exactly!

Fig. 1. We plot the per-parameter dynamics (left) and per-filter squared Euclidean norm dynamics (right) for the convolutional layers of a VGG-16 model (with batch normalization) trained on Tiny ImageNet with SGD with learning rate , weight decay , and batch size . Each color represents a different convolutional block. While the parameter dynamics are noisy and chaotic, the neuron dynamics are smooth and patterned.

Symmetries in the loss shape gradient and Hessian geometry

While we commonly initialize neural networks with random weights, their gradients and Hessians at all points in training, no matter the loss or dataset, obey certain geometric constraints. Some of these constraints have been noticed previously as a form of implicit regularization, while others have been leveraged algorithmically in applications from network pruning to interpretability. Remarkably, all these geometric constraints can be understood as consequences of numerous symmetries in the loss introduced by neural network architectures.

A set of parameters observes a symmetry in the loss if the loss doesn’t change under a certain transformation of these parameters. This invariance introduces associated geometric constraints on the gradient and Hessian. We consider three families of symmetries (translation, scale, and rescale) that commonly appear in modern neural network architectures.

  • Translation symmetry is defined by the transformation where is the indicator vector for some subset of the parameters . Any network using the softmax function gives rise to translation symmetry for the parameters immediately preceding the function.
  • Scale symmetry is defined by the transformation where . Batch normalization leads to scale invariance for the parameters immediately preceding the function.
  • Rescale symmetry is defined by the transformation where and are two disjoint sets of parameters. For networks with continuous, homogeneous activation functions (e.g. ReLU, Leaky ReLU, linear), this symmetry emerges at every hidden neuron by considering all incoming and outgoing parameters to the neuron.

These symmetries enforce geometric constraints on the gradient of a neural network ,

Fig. 2. We visualize the vector fields associated with simple network components that have translation, scale, and rescale symmetry. On the right we consider the vector field associated with a neuron where is the softmax function. In the middle we consider the vector field associated with a neuron where is the batch normalization function. On the left we consider the vector field associated with a linear path .

Symmetry leads to conservation laws under gradient flow

We now consider how geometric constraints on gradients and Hessians, arising as a consequence of symmetry, impact the learning dynamics given by stochastic gradient descent (SGD). We will consider a model parameterized by , a training dataset of size , and a training loss with corresponding gradient . The gradient descent update with learning rate is , which is a forward Euler discretization with step size of the ordinary differential equation (ODE) . In the limit as , gradient descent exactly matches the dynamics of this ODE, which is commonly referred to as gradient flow. Equipped with a continuous model for the learning dynamics, we now ask how do the dynamics interact with the geometric properties introduced by symmetries?

Strikingly similar to Noether’s theorem, which describes a fundamental relationship between symmetry and conservation for physical systems governed by Lagrangian dynamics, every symmetry of a network architecture has a corresponding “conserved quantity” through training under gradient flow. Just as the total kinetic and potential energy is conserved for an idealized spring in harmonic motion, certain combinations of parameters are constant under gradient flow dynamics.

Consider some subset of the parameters that respects either a translation, scale, or rescale symmetry. As discussed earlier, the gradient of the loss is always perpendicular to the vector field that generates the symmetry . Projecting the gradient flow learning dynamics onto the generator vector field yields a differential equation . Integrating this equation through time results in the conservation laws,

Each of these equations define a conserved constant through training, effectively restricting the possible trajectory the parameters take through learning. For parameters with translation symmetry, their sum is conserved, effectively constraining their dynamics to a hyperplane. For parameters with scale symmetry, their Euclidean norm is conserved, effectively constraining their dynamics to a sphere. For parameters with rescale symmetry, their difference in squared Euclidean norm is conserved, effectively constraining their dynamics to a hyperbola.

Fig. 3. Associated with each symmetry is a conserved quantity constraining the gradient flow dynamics to a surface. For translation symmetry (right) the flow is constrained to a hyperplane where the intercept is conserved. For scale symmetry (middle) the flow is constrained to a sphere where the radius is conserved. For rescale symmetry (left) the flow is constrained to a hyperbola where the axes are conserved. The color represents the value of the conserved quantity, where blue is positive and red is negative, and the black lines are level sets.

A realistic continuous model for stochastic gradient descent

While the conservation laws derived with gradient flow are quite striking, empirically we know they are broken, as demonstrated in Fig. 1. Gradient flow is too simple of a continuous model for realistic SGD training, it fails to account for the effect of hyperparameters such as weight decay and momentum, the effect of stochasticity introduced by random batches of data, and the effect of discrete updates due to a finite learning rate. Here, we consider how to address these effects individually to construct more realistic continuous models of SGD.

Modeling weight decay. Explicit regularization through the addition of an penalty on the parameters, with regularization constant , is a very common practice when training neural networks. Weight decay modifies the gradient flow trajectory pulling the network towards the origin in parameter space.

Modeling momentum. Momentum is a common extension to SGD that uses an exponentially moving average of gradients to update parameters rather than a single gradient evaluation. The method introduces an additional hyperparameter , which controls how past gradients are used in future updates, resulting in a form of “inertia” that accelerates the learning dynamics rescaling time, but leaves the gradient flow trajectory intact.

Modeling stochasticity. Stochastic gradients arise when we consider a batch of size drawn uniformly from the indices forming the unbiased gradient estimate . We can model the batch gradient as a noisy version of the true gradient . However, because both the batch gradient and true gradient observe the same geometric properties introduced by symmetry, this noise has a special low-rank structure. In other words, stochasticity introduced by random batches does not affect the gradient flow dynamics in the directions associated with symmetry.

Modeling discretization. Gradient descent always moves in the direction of steepest descent on a loss function at each step, however, due to the finite nature of the learning rate, it fails to remain on the continuous steepest descent path given by gradient flow. In order to model this discrepancy, we borrow tools from the numerical analysis of partial differential equations. In particular, we use modified equation analysis 4, which determines how to model the numerical artifacts introduced by a discretization of a PDE. In our paper we present two methods based on modified equation analysis and recent works 5, 6, which modify gradient flow, with either higher order derivatives of the loss or higher order temporal derivatives of the parameters, to account for the effect of discretization on the learning dynamics.

Fig. 4. We visualize the trajectories of gradient descent with momentum (black dots), gradient flow (blue line), and the modified dynamics (red line) on the quadratic loss . The modified continuous dynamics visually track the discrete dynamics much better than the original gradient flow dynamics.

Combining symmetry and modified gradient flow to derive exact learning dynamics

We now study how weight decay, momentum, stochastic gradients, and finite learning rates all interact to break the conservation laws of gradient flow. Remarkably, even when using a more realistic continuous model for stochastic gradient descent, we can derive exact learning dynamics for the previously conserved quantities. To do this we (i) consider a realistic continuous model for SGD, (ii) project these learning dynamics onto the generator vector fields associated with each symmetry, (iii) harness the geometric constraints introduced by symmetry to derive simplified ODEs, and (iv) solve these ODEs to obtain exact dynamics for the previously conserved quantities. We first consider the continuous model of SGD without momentum incorporating weight decay, stochasticity, and a finite learning rate. In this setting, the exact dynamics for the parameter combinations tied to the symmetries are,

Notice how these equations are equivalent to the conservation laws when . Remarkably, even in typical hyperparameter settings (weight decay, stochastic batches, finite learning rates), these solutions match nearly perfectly with empirical results from modern neural networks (VGG-16) trained on real-world datasets (Tiny ImageNet), as shown in Fig. 5.

Fig. 5. We plot the column sum of the final linear layer (left) and the difference between squared channel norms of the fifth and fourth convolutional layer (right) of a VGG-16 model without batch normalization. We plot the squared channel norm of the second convolution layer (middle) of a VGG-16 model with batch normalization. Both models are trained on Tiny ImageNet with SGD with learning rate , weight decay , batch size , for epochs. Colored lines are empirical and black dashed lines are the theoretical predictions.

Translation dynamics. For parameters with translation symmetry, this equation implies that the sum of these parameters decays exponentially to zero at a rate proportional to the weight decay. In particular, the dynamics do not directly depend on the learning rate nor any information of the dataset due to the lack of curvature in the gradient field for these parameters (as shown in Fig. 2).

Scale dynamics. For parameters with scale symmetry, this equation implies that the norm for these parameters is the sum of an exponentially decaying memory of the norm at initialization and an exponentially weighted integral of gradient norms accumulated through training. Compared to the translation dynamics, the scale dynamics do depend on the data through the gradient norms accumulated throughout training.

Rescale dynamics. For parameters with rescale symmetry, this equation is the sum of an exponentially decaying memory of the difference in norms at initialization and an exponentially weighted integral of difference in gradient norms accumulated through training. Similar to the scale dynamics, the rescale dynamics do depend on the data through the gradient norms, however unlike the scale dynamics we have no guarantee that the integral term is always positive.


Despite being the central guiding principle in the exploration of the physical world, symmetry has been underutilized in understanding the mechanics of neural networks. In this paper, we constructed a unifying theoretical framework harnessing the geometric properties of symmetry and realistic continuous equations for SGD that model weight decay, momentum, stochasticity, and discretization. We use this framework to derive exact dynamics for meaningful combinations of parameters, which we experimentally verified on large scale neural networks and datasets. Overall, our work provides a first step towards understanding the mechanics of learning in neural networks without unrealistic simplifying assumptions.

For more details check out our ICLR paper or this seminar presentation!


We would like to thank our collaborator Javier Sagastuy-Brena and advisors Surya Ganguli and Daniel Yamins.
We would also like to thank Megha Srivastava for very helpful feedback on this post.

  1. David Saad and Sara Solla. Dynamics of on-line gradient descent learning for multilayer neural networks.Advances in neural information processing systems, 8:302–308, 1995. 

  2. Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proc. Natl. Acad. Sci. U. S. A., May 2019. 

  3. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp.8571–8580, 2018 

  4. RF Warming and BJ Hyett. The modified equation approach to the stability and accuracy analysis of finite-difference methods. Journal of computational physics, 14(2):159–179, 1974. 

  5. David GT Barrett and Benoit Dherin. Implicit gradient regularization.arXiv preprintarXiv:2009.11162, 2020. 

  6. Nikola B Kovachki and Andrew M Stuart. Analysis of momentum methods.arXiv preprint arXiv:1906.04285, 2019. 

Read More

Do Language Models Know How Heavy an Elephant Is?

Do Language Models Know How Heavy an Elephant Is?

How heavy is an elephant? How expensive is a wedding ring?

Humans have a pretty good sense of scale, or reasonable ranges of these
numeric attributes, of different objects, but do pre-trained language
representations? Although pre-trained Language Models (LMs) like
BERT have
shown a remarkable ability to learn all kinds of knowledge, including
it remains unclear whether their representations can capture these types
of numeric attributes from text alone without explicit training data.

In our recent
we measure the amount of scale information that is captured in several
kinds of pre-trained text representations and show that, although
generally a significant amount of such information is captured, there is
still a large gap between their current performance and the theoretical
upper bound. We identify that specifically those text representations
that are contextual and good at numerical reasoning capture scale
better. We also come up with a new version of BERT, called NumBERT, with
improved numerical reasoning by replacing numbers in the pretraining
text corpus with their scientific notation
, which more readily exposes
the magnitude to the model, and demonstrate that NumBERT representations
capture scale significantly better than all those previous text

Scalar Probing

In order to understand to what extent pre-trained text representations, like
BERT representations, capture scale information, we propose a task
called scalar probing: probing the ability to predict a
distribution over values of a scalar attribute of an object. In this
work, we focus specifically on three kinds of scalar attributes: weight,
length, and price.

Here is the basic architecture of our scalar probing task:

In this example, we are trying to see whether the representation of
“dog” extracted by a pre-trained encoder can be used to predict/recover
the distribution of the weight of a dog through a linear model. We probe
three baseline language representations:
Since the latter two are contextual representations that operate on
sentences instead of words, we feed in sentences constructed using fixed
templates. For example, for weight, we use the template “The X is
heavy”, where X is the object in interest.

We explore the kind of probe that predicts a point estimate of the value
and the kind that predicts the full distribution. For predicting a point
estimate, we use a standard linear ReGRession (we denote as “rgr”)
trained to predict the log of the median of all values for each object
for the scale attribute under consideration. We predict the log because,
again, we care about the general scale rather than the exact value. The
loss is calculated using the prediction and the log of the median of the
ground-truth distribution. For predicting the full distribution, we use
a linear softmax Multi-Class Classifier (we denote as “mcc”) producing a
categorical distribution over the 12 orders of magnitude. The
categorical distribution predicted using the NumBERT (our improved
version of BERT; will be introduced later) representations is shown as
the orange histogram in the above example.

The ground-truth distributions we use come from the Distributions over
dataset which consists of empirical counts of scalar attribute values
associated with >350K nouns, adjectives, and verbs over 10 different
attributes, automatically extracted from a large web text corpus. Note
that during the construction of the dataset, all units for a certain
attribute are first unified to a single one (e.g.
centimeter/meter/kilometer -> meter) and the numeric values are scaled
accordingly. We convert the collected counts for each object-attribute
pair in DoQ into a categorical distribution over 12 orders of magnitude.
In the above example of the weight of a dog, the ground-truth
distribution is shown as the grey histogram, which is concentrated
around 10-100kg.

The better the predictive performance is across all the object-attribute
pairs we are dealing with, the better the pre-trained representations
encode the corresponding scale information.


Before looking at the scalar probing results of these different language
presentations, let’s also think about what kind of representations might
be good at capturing scale information and how to improve existing LMs
to capture scale better. All of these models are trained using large
online text corpora like Wikipedia, news, etc. How can their
representations pick up scale information from all this text?

Here is a piece of text from the first document I got when I searched on
Google “elephant weight”:

“…African elephants can range from 5,000 pounds to more than 14,000 pounds (6,350 kilograms)…”

So it is highly likely that the learning of scale is partly mediated by
the transfer of scale information from the numbers
(here “5,000”,
“14,000”, etc.) to nouns (here “elephants”) and numeracy, i.e. the
ability to reason about numbers, is probably important for representing

However, previous
shown that existing pre-trained text representations, including BERT,
ELMo, and Word2Vec, are not good at reasoning over numbers. For example,
beyond the magnitude of ~500, they cannot even decode a number from its
word embedding, e.g. embedding(“710”) 710. Thus, we propose to improve
the numerical reasoning abilities of these representations by replacing
every instance of a number in the LM training data with its scientific
, and re-pretraining BERT (which we call NumBERT). This enables
the model to more easily associate objects in the sentence directly with
the magnitude expressed in the exponent, ignoring the relatively
insignificant mantissa.


Scalar Probing

The above table shows the results of scalar probing on the DoQ data. We
use three evaluation metrics: Accuracy, Mean Squared Error (MSE), and
Earth Mover’s distance (EMD), and we do the experiments in four domains:
Lengths, Masses, Prices and Animal Masses (a subset of Masses). For MSE
and EMD, the best possible score is 0, while we compute a loose upper
 of accuracy by sampling from the ground-truth distribution and
evaluating against the mode. This upper bound achieves accuracies of
0.570 for lengths, 0.537 for masses, and 0.476 for prices.

For the Aggregate baseline, for each attribute, we compute the empirical
distribution over buckets across all objects in the training set, and
use that as the predicted distribution for all objects in the test set.
Compared with this baseline, we can see that the mcc probe over the best
text representations capture about half (as measured by accuracy) to a
 (by MSE and EMD) of the distance to the upper bound mentioned
above, suggesting that while a significant amount of scalar information
is available, there is a long way to go to support robust commonsense

Specifically, NumBERT representations do consistently better than all
the others
on Earth Mover’s Distance (EMD), which is the most
 metric because of its better convergence
robustness to adversarial perturbations of the data
performs significantly worse than the contextual representations
 – even
though the task is noncontextual (since we do not have different
ground-truths for an object occurring in different contexts in our
setting). Also, despite being weaker than BERT on downstream NLP tasks,
ELMo does better on scalar probing, consistent with it being better at
to its character-level tokenization.

Zero-shot transfer

We note that DoQ is derived heuristically from web text and contains
noise. So we also evaluate probes trained on DoQ on 2 datasets
containing ground truth labels of scalar attributes:
VerbPhysics and
Amazon Price
The first is a human labeled dataset of relative comparisons, e.g.
(person, fox, weight, bigger). Predictions for this task are made by
comparing the point estimates for rgr and highest-scoring buckets for
mcc. The second is a dataset of empirical distributions of product
prices on Amazon. We retrained a probe on DoQ prices using 12 power-of-4
buckets to support finer grained predictions.

The results are shown in the tables above. On VerbPhysics (the table on
the top), rgr+NumBERT performed best, approaching the performance of
using DoQ as an oracle, though short of specialized
this task. Scalar probes trained with mcc perform poorly, possibly
because a finer-grained model of predicted distribution is not useful for
the 3-class comparative task. On the Amazon Price Dataset (the table on
the bottom) which is a full distribution prediction task, mcc+NumBERT did
best on both distributional metrics. On both zero-shot transfer tasks,
NumBERT representations were the best across all configurations of
metrics/objectives, suggesting that manipulating numeric representations
of the text in the pre-training corpora can significantly improve
performance on scale prediction.

Moving Forward

In the work above, we introduce a new task called scalar probing used to
measure how much information of numeric attributes of objects
pre-trained text representations have captured and find out that while
there is a significant amount of scale information in object
representations (half to a third to the theoretical upper bound), these
models are far from achieving common sense scale understanding. We also
come up with an improved version of BERT, called NumBERT, whose
representations capture scale information significantly better than all
the previous ones.

Scalar probing opens up new exciting research directions to explore. For
example, lots of work has pre-trained large-scale vision & language
, like
Probing their representations to see how much scale information has been
captured and performing systematic comparisons between them and
representations learned by language-only models can be quite

Also, models learning text representations that predict scale better can
have a great real-world impact. Consider a web query like:

“How tall is the tallest building in the world?”

With a common sense understanding of what a reasonable range of heights
for “building” is, we can detect errors in the current web QA system when there are mistakes in
retrieval or parsing, e.g. when a wikipedia sentence about a building is
mistakenly parsed as being 19 miles high instead of meters.

Check out the paper Do Language Embeddings Capture
Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan

Read More

Removing Spurious Feature can Hurt Accuracy and Affect Groups Disproportionately

Removing Spurious Feature can Hurt Accuracy and Affect Groups Disproportionately


Machine learning models are susceptible to learning irrelevant patterns.
In other words, they rely on some spurious features that we humans know
to avoid. For example, assume that you are training a model to predict
whether a comment is toxic on social media platforms. You would expect
your model to predict the same score for similar sentences with
different identity terms. For example, “some people are Muslim” and
“some people are Christian” should have the same toxicity score.
However, as shown in 1, training a convolutional
neural net leads to a model which assigns different toxicity scores to
the same sentences with different identity terms. Reliance on spurious
features is prevalent among many other machine learning models. For
instance, 2 shows that state of the art models in object
recognition like Resnet-50 3 rely heavily on background, so
changing the background can also change their predictions .

(Left) Machine learning models assign different toxicity scores to the
same sentences with different identity terms.
(Right) Machine learning models make different predictions on the same
object against different backgrounds.

Machine learning models rely on spurious features such as background in an image or identity terms in a comment. Reliance on spurious features conflicts with fairness and robustness goals.

Of course, we do not want our model to rely on such spurious features
due to fairness as well as  robustness concerns. For example, a model’s
prediction should remain the same for different identity terms
(fairness); similarly its prediction should remain the same with
different backgrounds (robustness). The first instinct to remedy this
situation would be to try to remove such spurious features, for example,
by masking the identity terms in the comments or by removing the
backgrounds from the images. However, removing spurious features can
lead to drops in accuracy at test time 45. In this
blog post, we explore the  causes of such drops in accuracy.

There are two natural explanations for accuracy drops:

  1. Core (non-spurious) features can be noisy or not expressive enough
    so that even an optimal model has to use spurious features to
    achieve the best accuracy
  2. Removing spurious features can corrupt the core features

One valid question to ask is whether removing spurious features leads to
a drop in accuracy even in the absence of these two reasons. We answer
this question affirmatively in our recently published work in ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) 11. Here, we explain our results.

Removing spurious features can lead to drop in accuracy even when spurious features are removed properly and core features exactly determine the target!

(Left) When core features are not representative (blurred image), the
spurious feature (the background) provides extra information to identify
the object. (Right) Removing spurious features (gender
information) in the sport prediction task has corrupted other core
features (the weights and the bar).

Before delving into our result, we note that understanding the reasons
behind the accuracy drop is crucial for mitigating such drops. Focusing
on the wrong mitigation method fails to address the accuracy drop.

Before trying to mitigate the accuracy drop resulting from the removal of the spurious features, we must understand the reasons for the drop.

  Previous work Previous work This work
Removing spurious features causes drops in accuracy because… core features are noisy and not sufficiently expressive. spurious features are not removed properly and thus corrupt core features. a lack of training data causes spurious connections between some features and the target.
We can mitigate such drops by… focusing on collecting more expressive features (e.g., high-resolution images) focusing on more accurate methods for removing spurious features. focusing on collecting more diverse training data. We show how to leverage unlabeled data to achieve such diversity.

This work in a nutshell:

  • We study overparameterized models that fit training data perfectly.
  • We compare the “core model” that only uses core features (non-spurious) with the “full model” that uses both core features and spurious features.
  • Using the spurious feature, the full model can fit training data with a smaller norm.
  • In the overparameterized regime, since the number of training examples is less than the number of features, there are some directions of data variation that are not observed in the training data (unseen directions).
  • Though both models fit the training data perfectly, they have different “assumptions’’ for the unseen directions. This difference can lead to
    • Drop in accuracy
    • Affecting different test distributions (we also call them groups) disproportionately (increasing accuracy in some while decreasing accuracy in others).

Noiseless Linear Regression

Over the last few years, researchers have observed some surprising
phenomena about deep networks that conflict with classical machine
learning. For example, training models to zero training loss leads to
better generalization instead of overfitting 12. A line
of work 1314 found that these unintuitive
results happen even for simple models such as linear regression if the
number of features are greater than the number of training data, known
as the overparameterized regime.

Accuracy drops due to the removal of spurious features is also
unintuitive. Classical machine learning tells us that removing spurious
features should decrease generalization error (since these features are,
by definition, irrelevant for the task). Analogous to the mentioned
work, we will explain this unintuitive result in overparameterized
linear regression as well.  

Accuracy drop due to removal of the spurious feature can be explained in overparameterized linear regression.

Let’s first  formalize the noiseless linear regression setup. Recall
that we are going to study a setup in which the target is completely
determined by the core features, and the spurious feature is a single
feature that can be removed perfectly without affecting predictive
performance.  Formally, we assume there are (d) core features
(z in mathbb{R}^d) that determine the target (y in
mathbb{R}) perfectly, i.e., ( y = {theta^star}^top z).
In addition, we assume there is a single spurious feature (s) that
can also be determined by  the core features (s =
{beta^star}^top z). Note that the spurious feature can have
information about features that determine the target or it can be
completely unrelated to the target (i.e., for all (i),
(beta^star_i theta^star_i=0)).

We consider a setup where target ((y)) is a deterministic function
of core features ((z)). In addition, there is a spurious feature
((s)) that can also be determined by the core feature. We compare
two models, the core model that only uses (z) to predict (y) and the full model which uses both (z) and (s) to predict

We consider two models:

  • Core model that only uses the core features (z) to predict the
    target (y), and it is parametrized by
    ({theta^text{-s}}). For a data point with core features
    (z), its prediction is (hat y =
    {theta^text{-s}}^top z).
  • Full model that uses the core features (z) and also uses the
    spurious feature (s), and it is parametrized by
    ({theta^text{+s}}), and (w), For a data point with
    core feature (z) and a spurious feature (s), its
    prediction is (hat y = {theta^text{+s}}^top z + ws).

In this setup, the mentioned two reasons that naturally can cause
accuracy drop after removing the spurious feature (depicted in the table
above)  do not exist.

  1. The spurious feature (s) adds no information about the target
    (y) beyond what already exists in the core features
    (z) (reason 1),
  2. Removing (s) does not corrupt (z) (reason 2).

Motivated by recent work in deep learning, which speculates that
gradient descent converges to the minimum-norm solution that fits
training data perfectly 15, we consider the
minimum-norm solution.  

  • Training data: We assume we have (n < d) triples of
    ((z_i, s_i, y_i))
  • Test data: We assume core features in the test data are from a
    distribution with covariance matrix (Sigma =
    mathbb{E}[zz^top]) (we use group and test data distribution

In this simple setting, one might conjecture that removing the spurious
feature should only help accuracy. However, we show that this is not
always the case. We exactly characterize the test distributions that are
negatively affected by removing spurious features, as well as the ones
that are positively affected by it.


Let’s first look at a simple example with only one training data and
three core features ((z_1, z_2) and (z_3)).  Let the true
parameters  (theta^star =[2,2,2]^top) which results in
(y=2), and let the spurious feature parameter ({beta^star}
= [1,2,-2]^top) which results in (s=1).

First, note that the smallest L2-norm vector that can fit the training
data for the core model is  ({theta^text{-s}}=[2,0,0]). On
the other hand, in the presence of the spurious feature, the full model
can fit the training data perfectly with a smaller norm by assigning
weight (1) for the feature (s)
((|{theta^text{-s}}|_2^2 = 4) while
(|{theta^text{+s}}|_2^2 + w^2 = 2 < 4)).

Generally, in the overparameterized regime, since the number of training
examples is less than the number of features, there are some directions
of data variation that are not observed in the training data. In this
example, we do not observe any information about the second and third
features.  The core model assigns weight (0) to the unseen
directions (weight (0) for the second and third features in this
example). However, the non-zero weight for the spurious feature leads to
a different assumption for the unseen directions. In particular, the
full model does not assign weight (0) to the unseen directions.
Indeed, by substituting (s) with ({beta^star}^top
z), we can view the full model as not using (s) but
implicitly assigning weight (beta^star_2=2) to the second
feature and (beta^star_3=-2) to the third feature (unseen
directions at training).

Let’s now look at different examples and the prediction of these two

In this example, removing (s) reduces the error for a test
distribution with high deviations from zero on the second feature,
whereas removing (s) increases the error for a test distribution
with high deviations from zero on the third feature.

Main result

As we saw in the previous example, by using the spurious feature, the
full model incorporates ({beta^star}) into its estimate.  The
true target parameter ((theta^star)) and the true spurious
feature parameters (({beta^star})) agree on some of the
unseen directions and do not agree on the others.  Thus, depending on
which unseen directions are weighted heavily in the test time, removing
(s) can increase or decrease the error.

More formally, the weight assigned to the spurious feature is
proportional to the projection of (theta^star) on
({beta^star}) on the seen directions. If this number is close
to the projection of (theta^star) on ({beta^star})
on the unseen directions (in comparison to 0), removing (s)
increases the error, and it decreases the error otherwise. Note that
since we are assuming noiseless linear regression and choose models that
fit training data, the model predicts perfectly in the seen directions
and only variations in unseen directions contribute to the error.

(Left) The projection of (theta^star) on
(beta^star) is positive in the seen direction, but it is
negative in the unseen direction; thus, removing (s) decreases the
error. (Right) The projection of (theta^star) on
(beta^star) is similar in both seen and unseen directions;
thus, removing (s) increases the error.

Drop in accuracy in test time depends on the relationship between the true target parameter ((theta^star)) and the true spurious feature parameters (({beta^star})) in the seen directions and unseen direction.

Let’s now formalize the conditions under which removing the spurious
feature ((s)) increases the error. Let (Pi =
Z(ZZ^top)^{-1}Z) denote the column space of training data (seen
directions), thus (I-Pi) denotes the null space of training data
(unseen direction). The below equation determines when removing the
spurious feature decreases the error.

The left side is the difference between the projection of (theta^star) on (beta^star) in the seen direction
with their projection in the unseen direction scaled by test time
covariance. The right side is the difference between 0 (i.e., not using
spurious features) and the projection of (theta^star) on
(beta^star) in the unseen direction scaled by test time
covariance. Removing (s) helps if the left side is greater than
the right side.


While the theory applies only to linear models, we now show that in
non-linear models trained on real-world datasets, removing a spurious
feature reduces the accuracy and affects groups disproportionately.

Datasets. We are going to study the CelebA dataset 16 which
contains photos of celebrities along with 40 different attributes.
footnote{See our paper for the results on the
comment-toxicity-detection and MNIST datasets} We choose wearing
lipstick (indicating if a celebrity is wearing lipstick) as the target
and wearing earrings (indicating if a celebrity is wearing earrings) as
the spurious feature. 

Note that although wearing earrings is correlated with wearing lipstick,
we expect our model to not change its prediction if we tell the model
the person is wearing earrings.

In the CelebA dataset wearing earrings is correlated with wearing
lipstick. In this dataset, if a celebrity wears earrings, it is almost
five times more likely that they will wear lipstick than not wearing
lipstick. Similarly,  if a celebrity does not wear earrings, it is
almost two times more likely for them not to wear lipstick than wearing

Setup. We train a two-layer neural network with 128 hidden units. We
flatten the picture and concatenate the binary variable of wearing
earrings to it (we tuned a multiplier for it).  We also want to know how
much each model relies on the spurious feature. In other words, we want
to know how much the model prediction changes as we change the wearing
earrings variable. We call this attacking the model (i.e, swapping the
value of the binary feature of wearing earrings). We run each experiment
50 times and report the average.

Results. The below diagram shows the accuracy of different models, and
their accuracies when they are attacked. Note that, because our attack
focuses on the spurious feature, the core model’s accuracy will remain
the same.

Removal of the wearing lipstick decreases the overall accuracy. The
decrease in accuracy is not monotonic among different groups. The
accuracy has decreased in the group where people are not wearing
lipstick or earrings and in the group that they both have lipstick and
earrings. On the other hand, accuracy increases for the group that only
wears one of them.

Let’s break down the diagram and analyze each section.

All celebrities together: have a reasonable accuracy of 82% The overall accuracy drops 1% when we remove the spurious feature (core model accuracy). The full model relies on the spurious feature a lot, thus attacking the full model leads to a ~ 17% drop in overall accuracy.
The celebrities who follow the stereotype (people who do not have earrings or lipstick, and people who wear both) have a good accuracy overall (both above 85%); The accuracy of both groups drop as we remove the wearing earrings (i.e., core model accuracy). Using the spurious feature helps their accuracy, thus attacking the full model leads to a ~30% drop in their accuracy.
The celebrities who do not follow the stereotypes have a very low accuracy; this is especially worse for people who only wear earrings (33% accuracy in comparison to the average of 85%). Removing the wearing earring increases their accuracy substantially. Using the spurious feature does not help their accuracy, thus attacking the full model does not change accuracy for these groups.

 In non-linear models trained on real-world datasets, removing a spurious feature reduces the accuracy and affects groups disproportionately.

Q&A (Other results):

I know about my problem setting, and I am certain that disjoint features
determine the target and the spurious feature (i.e., for all (i),
(theta^star_ibeta^star_i=0)). Can I be sure that my
model will not rely on the spurious feature, and removing the spurious
feature definitely reduces the error?
No! Actually, for any
(theta^star) and ({beta^star}), we can construct a
training set and two test sets with (theta^star) and
({beta^star}) as the true parameters and the spurious feature
parameter, such that removing the spurious feature reduces the error in
one but increases the error in the other one (see Corollary 1 in our

I am collecting a balanced dataset such that the spurious feature and
the target are completely independent (i.e., (p[y,s]= p[y]p[s])).
Can I be sure that my model will not rely on the spurious feature, and
removing the spurious feature definitely reduces the error?

No! for any
(S in mathbb{R}^n) and (Y in mathbb{R}^n), we can
generate a training set and two test sets with (S) and (Y)
as their spurious feature and targets, respectively, such that removing
the spurious feature reduces the error in one but increases the error in
the other (see Corollary 2 in our paper).

What happens when we have many spurious features? Good question! Let’s
say (s_1) and (s_2) are two spurious features. We show

  1. Removing (s_1) makes the model more sensitive against
    (s_2), and
  2. If a group has high error because of the new assumption about unseen
    direction enforced by using (s_2), then it will have an even
    higher error by removing (s_1).
    (See Proposition 3 in our paper).

Is it possible to have the same model (a model with the same assumptions
on unseen directions as the full model) without relying on the spurious
feature (i.e., be robust against the spurious feature)?
Yes! You can
recover the same model as the full model without relying on the spurious
feature via robust self-training and unlabeled data (See Proposition 4).


In this work, we first showed that overparameterized models are
incentivized to use spurious features in order to fit the training data
with a smaller norm. Then we demonstrated how removing these spurious
features altered the model’s assumption on unseen directions.
Theoretically and empirically, we showed that this change could hurt the
overall accuracy and affect groups disproportionately. We also proved
that robustness against spurious features (or error reduction by
removing the spurious features) cannot be guaranteed under any condition
of the target and spurious feature. Consequently, balanced datasets do
not guarantee a robust model and practitioners should consider other
features as well. Studying the effect of removing noisy spurious
features is an interesting future direction.


I would like to thank Percy Liang, Jacob Schreiber and Megha Srivastava for their useful comments. The images in the introduction are from 1718 1920.

  1. Dixon, Lucas, et al. “Measuring and mitigating unintended bias in text classification.” Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 2018. 

  2. Xiao, Kai, et al. “Noise or signal: The role of image backgrounds in object recognition.” arXiv preprint arXiv:2006.09994 (2020). 

  3. He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. 

  4. Zemel, Rich, et al. “Learning fair representations.” International Conference on Machine Learning. 2013. 

  5. Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. 

  6. Khani, Fereshte, and Percy Liang. “Feature Noise Induces Loss Discrepancy Across Groups.” International Conference on Machine Learning. PMLR, 2020. 

  7. Kleinberg, Jon, and Sendhil Mullainathan. “Simplicity creates inequity: implications for fairness, stereotypes, and interpretability.” Proceedings of the 2019 ACM Conference on Economics and Computation. 2019. 

  8. photo from Torralba, Antonio. “Contextual priming for object detection.” International journal of computer vision 53.2 (2003): 169-191. 

  9. Zhao, Han, and Geoff Gordon. “Inherent tradeoffs in learning fair representations.” Advances in neural information processing systems. 2019. 

  10. photo from Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE International Conference on Computer Vision. 2019. 

  11. Khani, Fereshte, and Percy Liang. “Removing Spurious Features can Hurt Accuracy and Affect Groups Disproportionately.” arXiv preprint arXiv:2012.04104 (2020). 

  12. Nakkiran, Preetum, et al. “Deep double descent: Where bigger models and more data hurt.” arXiv preprint arXiv:1912.02292 (2019). 

  13. Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560. 

  14. Raghunathan, Aditi, et al. “Understanding and mitigating the tradeoff between robustness and accuracy.” arXiv preprint arXiv:2002.10716 (2020). 

  15. Gunasekar, Suriya, et al. “Implicit regularization in matrix factorization.” 2018 Information Theory and Applications Workshop (ITA). IEEE, 2018. 

  16. Liu, Ziwei, et al. “Deep learning face attributes in the wild.” Proceedings of the IEEE international conference on computer vision. 2015. 

  17. Xiao, Kai, et al. “Noise or signal: The role of image backgrounds in object recognition.” arXiv preprint arXiv:2006.09994 (2020). 

  18. Garg, Sahaj, et al. “Counterfactual fairness in text classification through robustness.” Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 2019. 

  19. photo from Torralba, Antonio. “Contextual priming for object detection.” International journal of computer vision 53.2 (2003): 169-191. 

  20. photo from Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE International Conference on Computer Vision. 2019. 

Read More

Blue People v. City of Ney

Blue People v. City of Ney


Discriminatory behavior towards certain groups by machine learning (ML) models is especially concerning in critical applications such as hiring. This blog post explains one source of discrimination: the reliance of ML models on different groups’ data distributions. We will show that when ML models use noisy features (which are pervasive in the real world, e.g., exam scores), they’re incentivized to devalue a good candidate from a lower-performing group. This blog post is based on:

Fereshte Khani and Percy Liang, “Feature Noise Induces Loss Discrepancy
Across Groups.” International Conference on Machine Learning. PMLR, 2020

The findings are illustrated by reviewing the hiring process in the
fictitious city of Ney, where recently a group of people has accused the
government of discrimination.

Hiring people in Ney

The government of Ney wants to hire qualified people. Each person in Ney has a skill level that is normally distributed with a mean (mu) and a standard deviation
of (sigma_text{skill}). A person is qualified if their skill level is greater than 0 and non-qualified
otherwise. The government wants to hire qualified people (all people
with skills greater than 0). For example, Alice with skill level 2, is
qualified, but Bob with the skill level of -1 is not qualified.

The skills level of the people in Ney is normally distributed with a mean of (mu) and a standard deviation of (sigma_text{skill}).

To assess people’s skills, the government created an exam. The exam score is a noisy indicator of the applicant’s skill since it cannot capture the true skill of a person (e.g., the same applicant would score differently on different versions of SAT). In the city of Ney, exam noise is nice and simple: If an individual has skill (z), then their
score is distributed as (mathcal{N} (z,
where (sigma_text{noise}^2) indicates the variance of noise
on the exam.

The exam score of an individual with a skill of (z) is a random variable normally distributed with a mean of (z) and a standard deviation of (sigma_text{noise}).

The government wants to choose a threshold (tau), and hire all
people whose exam scores are greater than (tau). There are two
kinds of errors that the government can make:

  1. Not hiring a qualified person  ((z > 0 land x le tau))
  2. Hiring a non-qualified person ((z le 0 land x > tau))

For simplicity, let’s assume the government cares about these two types
of errors equally and wants to minimize the overall error, i.e., the
number of non-qualified hired people plus the number of qualified
non-hired people.

The government’s goal is to find a cut-off threshold such that it minimizes the error.

Given all exam scores and knowledge of the skill distribution of the people,
what cut-off threshold should the government use to minimize the error (the above equation)?
Is it a good strategy for the government to simply use 0 as the
threshold and hire all individuals with scores greater than zero?

Let’s consider an example where the skill distribution
is (mathcal{N}(-1,1)), and the exam noise
has a standard deviation of (sigma_text{noise}=1).  The following lines of code plot
the average error for various thresholds for this example. As
illustrated, 0 is not the best threshold to use. In fact, in this
example, a threshold of (tau=1) leads to minimum error.

A simple example with (mu=-1) and (sigma_text{skill}=sigma_text{noise}=1). As shown on the right, accepting individuals with a score higher than (0) does not result in the minimum error.

The government wants to minimize the number of hired people with negative skill levels + the number of non-hired people with positive skill levels. Hiring all people with positive exam scores (a noisy indicator of the skill) is not optimal.

If 0 is not always the optimal threshold, then what is the optimal
threshold for minimizing error for different values of (mu,
sigma_text{skill}) and (sigma_text{noise})?
Generally, given a person’s exam score ((x)) and the skill level distribution ((mathbb{P}(z))), what can we infer
about their real skill ((z))? Here is where Bayesian inference
comes in.

Bayesian inference  

Let’s see what we can infer about a person’s skill given their exam score and knowing the skill level distribution
(mathbb{P} (z)) (known as the prior distribution since it shows the prior over a person’s skill). Using Bayes rule, we can calculate (mathbb{P} (z|x)) (known as the posterior distribution since it shows the distribution over a person’s skill after observing their score).

Let’s first consider two extreme cases:

  1. If the exam is completely precise
    (i.e., (sigma_text{noise}=0)), then the exam score is
    the exact indicator of a person’s skill (irrespective of the prior
  2. If the exam is pure noise (i.e., (sigma_text{noise}
    rightarrow infty)), then the exam score is meaningless, and
    the best estimate for a person’s skill is the average
    skill (mu) (irrespective of the exam score).

Intuitively, when the noise variance has a value between (0) and (infty), the best estimate of a person’s skill is a number
between their exam score ((x)) and the average skill
((mu)). The figure below shows the standard formulation of the
posterior distribution (mathbb{P} (z mid x)) after observing
an exam score ((x_0)). For more details on how to derive this
formula, see

Posterior distribution of a person’s skill after observing their exam score ((x_0)).

Based on this formula (and as hypothesized), depending on the amount of noise, (mathbb{E} [zmid x]) is a number between (x) and (mu).

An applicant’s expected skill level is between their exam score and the average skill among Ney people. If the exam is noisier, it is closer to the average skill; if the exam is more precise, it is closer to the exam score.

Optimal threshold

Now that we have exactly characterized the posterior distribution
((mathbb{P} (z mid x))), the government can find the optimal
threshold. For any exam score (x), if the government hires people
with score (x), it incurs (mathbb{P}(z le 0 mid x) )
error (probability of hiring non-qualified people). On the other hand,
if it does not hire people with score (x), it
incurs (mathbb{P}(z > 0 mid x)) error (probability of
non-hiring qualified people). Thus, in order to minimize the error, the
government should hire a person iff (mathbb{P} (z > 0 mid x) >
mathbb{P}(z le 0 mid x)). Since the posterior distribution is a
normal distribution, the government must hire an applicant
iff (mathbb{E}[z mid x] > 0).

Using the formulation in the previous section, we have:

Therefore, the optimal threshold is:

In our running example with average skill (mu=-1)
and (sigma_text{skill} = sigma_text{noise}=1), the optimal threshold is 1.
The figure below shows how the optimal threshold varies according
to (mu) and (sigma_text{noise}).
As (sigma_text{noise}) increases or (mu) decreases,
the optimal threshold moves farther away from (0).

(left) The optimal threshold increases as  the average of the prior distribution decreases (with a fixed exam noise (sigma_text{noise} > 0)). (right) The optimal threshold increases if the exam noise increases (with a fixed average skill (mu < 0)). Note that, if exam scores are not noisy or the average skill is zero, then the optimal threshold is zero.

As exams become more noisy or the average skill becomes more negative, the optimal threshold moves further away from 0.

What does machine learning have to do with all of this?

So far, we precisely identified the optimal cut-off threshold given the
exact knowledge of (mu, sigma_text{skill}),
and (sigma_text{noise}). But how can the government find the
optimal threshold using observational data? This is where machine
learning (ML) comes into the picture.
Let’s imagine very favorable conditions. Let’s assume everyone (an infinite number of them!) takes the exam, the government hires all of them and observes their true skills. Further, assume the modeling assumption is perfectly correct (i.e., both the true prior distribution and conditional distribution are normal). What would happen if the government trains a model with an infinite number of ((x,z))

The government has collected lots of data and now wants to use ML models to predict the best threshold that minimizes the error.

Before delving into this, we would like to note that in real-world
scenarios, we do not have infinite data (finite data issues); the
government does not hire everyone (selection bias issues), and the true
skill is not perfectly observable (target noise/biases issues).
Furthermore, the modeling assumptions are often incorrect (model
misspecification issues). Each of these issues may affect the model
adversely; however, in this blog post our goal is to analyze the model
decisions when none of these issues exist. In the next section, we will show that discrimination occurs even under these ideal conditions.

Under these very favorable conditions and the right loss function,
machine learning algorithms can perfectly predict (mathbb{E} [z
mid x]) from (x); therefore, can find the optimal threshold
that minimizes the error.  The following few lines of Python code show
how linear regression and logistic regression fit the data. In this
example, we set (mu = -1,
sigma_text{skill}=sigma_text{noise}=1), and as shown in
the figure on the right, the cut-off threshold predicted by the model is
one, which matches the optimal threshold as we observed previously.

A simple example along with the predicted cut-off
threshold for linear and logistic regression. The predicted cut-off
threshold results in the minimum error, as previously discussed.

Under very favorable conditions, machine learning models find the optimal threshold, which is a function of average skill, exam noise, and skill variance among people.

Optimal thresholds for different groups

So far, we have shown how to calculate the optimal threshold and
illustrated that ML models also recover this threshold. Let’s now
analyze the optimal threshold when different groups exist in the
population. There are two kinds of people in the city of Ney: blue and red. The
blue people’s skills are normally distributed centered
on (mu_text{blue}), and the red people’s skills are normally
distributed centered on (mu_text{red}). The standard deviation for
both groups is (sigma_text{skill}). There can be various
reasons for disparities between groups, for example historically blue
people might not have been allowed to attend school.

In Ney, people are divided into two groups: blue and red. The blue people have a lower average skill level than the red people.

First of all, let’s see what happens if the exam is completely precise. As
previously discussed in this case, the optimal threshold to use is 0 for
both groups independent of their distribution. Thus, both groups are
held to the same standard, and the error for the government is 0.

If there is no noise in the exam, then zero is the optimal threshold for both groups and leads to zero error.

Now let’s analyze the case where the exam is noisy
( (sigma_text{noise} > 0)). As discussed in the prior
sections, the optimal threshold depends on the average of the prior
distribution, thus the optimal threshold differs between blue and red
groups. Therefore, if the government knows the demographic information,
then it’s a better strategy for the government to classify different
groups separately (in order to minimize the error). In particular, the
government can calculate the optimal threshold for blue and red people
using Bayesian inference.

People in a group that has lower average skills need to pass a higher bar for hiring! Not only do blue people need to overcome other associated effects of being in a group with lower average skills, they also need to pass a higher bar to get hired.                  

The cut-off threshold for hiring is higher for blue people in comparison to the red people.

As stated, the government uses a higher threshold for people in a group
with a lower average skill! Consider two individuals with the same skill
level but from different groups. The blue person is less likely to get
hired by the government than the red person. Surprisingly, blue people
who are already in a group with a lower average skill (which probably
affects their confidence and society’s view of them) need to also pass a
higher bar to get hired!

Finally, note that the gap between thresholds for the different groups
grows as the noise increases.

As the exam noise increases, the gap between the optimal thresholds among different groups widens. Blue people need to get a better score than red people on the exam to get hired.

A blue person has a lower chance of getting hired in comparison with a red person with the same skill.


We examined the discriminatory effect of relying on noisy features. When ML models use noisy features, they’re naturally incentivized to devalue a good score when the candidate in question comes from an overall lower-performing group. Note that noisy features are prevalent in any real-world application (here, we assumed that noise is the same among all individuals, but it’s usually worse for disadvantaged groups). Ideally, we would like to improve the features to better reflect a candidate’s skill/potential or make the features more closely approximate the job requirements. If that’s not possible, it’s important to be conscious that the “optimal decision” is to discriminate, and we should adjust our process (e.g., hiring) in acknowledgment that group membership can shade an individual’s evaluation.

Frequently asked questions

Can we just remove the group membership information, so the model treats individuals from both groups similarly?

Unlike this example where group membership is a removable feature,
real-world datasets are more complex. Usually, datasets contain many
features such that the group membership can be predicted from them
(recall that ML models benefit from predicting group membership since it
lowers error). Thus, it is not obvious how to remove group membership in
these datasets. See
[1,2,3] for some efforts on removing group information.

Why should we treat these two groups similarly when their distributions are inherently different? Utilizing group membership information reduces error overall and for both groups!

Fairness in machine learning usually studies the impact of ML algorithms
on groups according to protected attributes such as sex, sexual
orientation, race, etc. Usually, there has been some discrimination
towards these groups throughout history, which leads to huge disparities
among their distributions. For example, women (because of their sex)
were not allowed to go to universities. Thus, these disparities are not
inherent and could (and probably should!) change over time. For
instance, see women in the labor force

Another reason to avoid relying on disparities among protected groups in
models is feedback loops. Feedback loops might exacerbate distributional
disparities among protected groups over time. (e.g., few women get
accepted → the self-doubt between women increases → women perform
worse in the exam → fewer women get accepted and so on). For
instance, see
[5] and

Finally, note that although the government objective may be to minimize the
error by weighting the costs of hiring non-qualified and non-hiring
qualified candidates similarly, it is not clear whether the group
objectives should be the same. For example, a group might be worse off
as a result of the government not hiring its qualified members than if
the government had hired its non-qualified members (for example, in
settings where the lack of minority role models in higher-level
positions leads to a lower perceived sense of belonging in other members
of a group). Thus, using group membership to minimize the error is not
necessarily the most beneficial outcome for a group; and depending on
the context we might need to minimize other objectives.

What about other notions of fairness in machine learning?

In this blog post, we studied the ML model’s prediction for two similar individuals (here same z) but from different groups (blue vs. red). This is referred to as the counterfactual notion of fairness. There is another common notion of fairness known as the statistical notion of fairness, which looks at the groups as a whole and compares their incurred error (it is also common to compare the error incurred by qualified members of different groups known as the equal opportunity [7]). Statistical and counterfactual notions of fairness are independent of each other, and satisfying one does not guarantee satisfying the other. Another consequence of feature noise is causing a trade-off between these two notions of fairness, which is beyond this blog post’s scope. See our paper [8] for critiques regarding these two notions and the effect of feature noise on statistical notions of fairness.


I would like to thank Percy Liang, Megha Srivastava, Frieda Rong, and Rishi Bommasani, Yeganeh Alimohammadi, and Michelle Lee for their useful comments.

Read More

A Model-Based Approach Towards Identifying the Brain's Learning Algorithms

A Model-Based Approach Towards Identifying the Brain’s Learning Algorithms


One of the tenets of modern neuroscience is that the brain modifies the
strengths of its synaptic connections (“weights”) during learning in
order to better adapt to its environment. However, the underlying
learning rules (“weight updates”) in the brain are currently unknown.
Many proposals have been suggested, ranging from Hebbian-style
mechanisms that seem biologically plausible but are not very effective
as learning algorithms in that they prescribe purely local changes to
the weights between two neurons that increase only if they activate
together — to backpropagation, which is effective from a learning
perspective by assigning credit to neurons along the entire downstream
path from outputs to inputs, but has numerous biologically implausible

A major long-term goal of computational neuroscience is to identify
which learning rules actually drive learning in the brain. A further
difficulty is that we do not even have strong ideas for what needs to be
measured in the brain to quantifiably assert that one learning rule is
more consistent with those measurements than another learning rule. So
how might we approach these issues? We take a simulation-based approach,
meaning that experiments are done on artificial neural networks rather
than real brains. We train over a thousand artificial neural networks
across a wide range of possible learning rule types (conceived of as
“optimizers”), system architectures, and tasks, where the ground truth
learning rule is known, and quantify the impact of these choices. Our
work suggests that recording activities from several hundred neurons,
measured semi-regularly during learning, may provide a good basis to
identify learning rules — a testable hypothesis within reach of
current neuroscience tools!

Background: A Plethora of Theories and a Paucity of Evidence

The brain modifies the connections between neurons during learning to
improve behavior; however, the underlying rules that govern these
modifications are unknown. The most famous proposed learning rule is
“Hebbian learning”, also known by the mantra: “neurons that fire
together; wire together”. In this proposal, a synaptic connection
strengthens if one neuron (“pre-synaptic”) consistently sends a signal
to another neuron (“post-synaptic”). The changes prescribed by Hebbian
learning are “local” in that they do not take into account a synapse’s
influence further downstream in the network. This locality makes
learning rather slow even in the cases where additional issues, such as
the weight changes becoming arbitrarily large, are mitigated. Though
there have been many suggested theoretical strategies to deal with this
problem, commonly involving simulations with artificial neural networks
(ANNs), these strategies appear difficult to scale up to solve
large-scale tasks such as ImageNet categorization

This property of local changes is in stark contrast to backpropagation,
the technique commonly used to optimize artificial neural networks. In
backpropagation, as the name might suggest, an error signal is
propagated backward along the entire downstream path from the outputs of
a model to the inputs of the model. This allows credit to be effectively
assigned to every neuron along the path.

Although backpropagation has long been a standard component of deep
learning, its plausibility as a biological learning rule (i.e. how the
brain modifies the strengths of its synaptic connections) is called into
question for several reasons. Chief among them is that backpropagation
requires perfect symmetry, whereby the backward error-propagating
weights are the transpose of the forward inference weights, for which
there is currently little biological support

Avoiding weight symmetry. Backpropagation naturally couples the
forward and backward weights. This constraint can be relaxed by
uncoupling them, thereby generating a spectrum of learning rule
hypotheses about how the backward weights may be updated.
For more details, see our recent prior work.

Recent approaches, from us and others
5], introduce approximate
backpropagation strategies that do not require this symmetry, and can
still succeed at large-scale learning as backpropagation does. However,
given the number of proposals, a natural question to ask is how
realistic they are. At the moment, our hypotheses are governed by domain
knowledge that specifies what “can” and “cannot” be biologically
plausible (e.g. “exact weight symmetry is likely not possible” or
“separate forward and backward passes during learning seem
implausible”), as well as characterizations of ANN task performance
under a given learning rule (which is not always directly measurable
from animal behavior). In order to be able to successfully answer this
question, we need to be able to empirically refute hypotheses. In
other words, we would ideally want to know what biological data to
collect in order to claim that one hypothesis is more likely than

More concretely, we can ask: what specific measurements from the brain,
in the form of individual activation patterns over time, synaptic
strengths, or paired-neuron input-output relations, would allow one to
draw quantitative comparisons of whether the observations are more
consistent with one or another specific learning rule? For example,
suppose we record neural responses (“activation patterns”) while an
animal is learning a task. Would these data be sufficient to enable us
to broadly differentiate between learning rule hypotheses, e.g. by
reliably indicating that one learning rule’s changes over time more
closely match the changes measured from real data than those prescribed
by another learning rule?

Some potential observables to measure on which to separate candidate
learning rule hypotheses. (Pyramidal neuron schematic adapted from Figure
4 of [6])

Answering this question turns out to be a substantial challenge, because
it is difficult on purely theoretical grounds to identify which patterns
of neural changes arise from given learning rules, without also knowing
the overall network connectivity and reward target (if any) of the
learning system.

But, there may be a silver lining. While ANNs consist of units that are
highly simplified with respect to biological neurons, recent progress
within the past few years has shown that the internal representations that
emerge in trained deep ANNs often overlap strongly with representations
in the brain, and are in fact quantifiably similar to many
neurophysiological and behavioral observations in animals
[7]. For
instance, task-optimized, deep convolutional neural networks (CNNs) have
emerged as quantitatively accurate models of encoding in primate visual
cortex [8,
10]. This is
due to (1) their cortically-inspired architecture, a cascade of
spatially-tiled linear and nonlinear operations; and (2) their being
optimized to perform certain behaviors that animals must perform to
survive, such as object recognition
[11]. CNNs trained
to recognize objects on ImageNet predict neural responses of primate
visual cortical neurons better than any other model class. Thus, these
models are, at the moment, some of our current best algorithmic
“theories” of the brain — a system that was ultimately not designed by
us, but rather the product of millions of years of evolution. On the
other hand, ANNs are designed by us — so the ground truth learning
rule is known and every unit (artificial “neuron”) can be measured up to
machine precision.

Can we marry what we can measure in neuroscience with what we can
conclude from machine learning, in order to identify what experimentally
measurable observables may be most useful for inferring the underlying
learning rule? If we can’t do this in our models, then it seems very
unlikely to be able to do this in the real brain. But if we can do this
in principle, then we are in a position to generate predictions as to
what data to collect, and whether that is even within reach of current
experimental neuroscience tools.


We adopt a two-stage “virtual experimental” approach. In the first
stage, we train ANNs with different learning rules, across a variety of
architectures, tasks, and associated hyperparameters. These will serve
as our “model organisms” on which we will subsequently perform idealized
neuroscience measurements. In the second stage, we calculate aggregated
statistics (“measurements”) from each layer of the models as features
from which to train simple classifiers that classify the category that a
given learning rule belongs to (specified below). These classifiers
include the likes of a linear SVM, as well as simple non-linear ones
such as a Random Forest and a 1D convolutional two-layer perceptron.

Overall approach. Observable statistics are generated from each
neural network’s layer, through the model training process for each
learning rule. We take a quantitative approach whereby a classifier is
cross-validated and trained on a subset of these trajectories and
evaluated on the remaining data.

Generating a large-scale dataset is crucial to this endeavor, in order
to both emulate a variety of experimental neuroscience scenarios and be
able to derive robust conclusions from them. Thus, in the first stage,
we train ANNs on tasks and architectures that have been shown to explain
variance in neural responses from sensory (visual and auditory)
brain areas [8,
These include supervised tasks across vision and audition, as well as
self-supervised ones. We consider both shallow and deep feedforward
architectures on these tasks, that are of depth comparable to what is
considered reasonable from the standpoint of shallower non-primate (e.g.
[13]) and
deeper primate sensory systems

The learning rules, tasks, architectures, and hyperparameters from which
we generate data, comprising over a thousand training experiments in total.

In the second stage, we train classifiers on the observable statistics from these ANNs to predict the learning rules (as specified in the table above) used to train them.
The four learning rules were chosen as they span the space of commonly
used variants of backpropagation (SGDM and Adam), as well as potentially
more biologically-plausible “local” learning rules (Feedback
Alignment (FA)
and Information Alignment (IA)) that efficiently
train networks at scale to varying degrees of performance but avoid exact weight

Because the primary aim of this study is to determine the extent that
different learning rules led to different encodings within ANNs, we
begin by defining representative features that can be drawn from the
course of model training. For each layer in a model, we consider three
measurements: weights of the layer, activations from the layer, and
layer-wise activity change of a given layer’s outputs relative to its
inputs. We choose ANN weights to analogize to synaptic strengths in the
brain, activations to analogize to post-synaptic firing rates, and
layer-wise activity changes to analogize to paired measurements that
involve observing the change in post-synaptic activity with respect to
changes induced by pre-synaptic input.

Defining observable statistics.

For each measure, we consider three functions applied to it: “identity”,
“absolute value”, and “square”. Finally, for each function of the
weights and activations, we consider seven statistics, and for the
layer-wise activity change observable, we only use the mean statistic
due to computational restrictions. This results in a total of 45
continuous valued observable statistics for each layer, though 24
observable statistics are ultimately used for training the classifiers,
since we remove any statistic that has a divergent value during the
course of model training. We also use a ternary indicator of layer
position in the model hierarchy: “early”, “middle”, or “deep”
(represented as a one-hot categorical variable).

We Can Separate Learning Rules from Aggregate Statistics of the Weights, Activations, or Layer-wise Activity Changes

Across tasks, different learning rules give rise to perceptible
differences in observable statistics.

Already by eye, one can pick up distinctive differences across the
learning rules for each of the training trajectories of these metrics.
Of course, this is not systematic enough to clearly judge one set of
observables versus another, but provides some initial assurance that
these metrics seem to capture some inherent differences in learning
dynamics across rules.

So these initial observations seem promising, but we want to make this
approach more quantitative. Suppose for each layer we concatenate the
trajectories of each observable and the position in the model hierarchy
that this observable came from. Can we generalize well across held-out

It turns out that the answer is in fact, yes. Across all classes of
observables, the Random Forest attains the highest test accuracy, and
all observable measures perform similarly under this classifier.

Test set confusion matrices. Random Forest performs the best and differences in learning rate policy
(Adam vs. SGDM) are more difficult to distinguish.

Looking at confusion matrices on the test set, we see that the Random
Forest hardly mistakes one learning rule from any of the others. And
when the classifiers do make mistakes, they generally tend to confuse
Adam vs. SGDM more so than IA vs. FA, suggesting that they are able to
pick up more on differences (reflected in the observable statistics) due
to high-dimensional direction of the gradient tensor than the magnitude
of the gradient tensor (the latter being directly tied to learning rate

Adding Back Some Experimental Neuroscience Realism

Up until this point, we have had access to all input types, the full learning trajectory, and noiseless access to all units when making our virtual measurements of ANN observable statistics.
But in a real experiment where someone were to
collect such data from a neural circuit, the situation would be far from
this ideal scenario. We therefore explore experimental realism in
several ways, in order to identify which observable measures are robust
across these scenarios.

Access to only portions of the learning trajectory: subsampling observable trajectories

The results presented thus far were obtained with access to the entire
learning trajectory of each model. Often however, an experimentalist
collects data throughout learning at regularly spaced intervals. We
capture this variability by randomly sampling a fixed number of points
at a fixed temporal spacing for each trajectory, which we refer to as a
“subsample period”.

Sparse subsampling across learning trajectory is most robust to
trajectory undersampling.

We find across observable measures that robustness to undersampling of
the trajectory is largely dependent on the subsample period length. As
the subsample period length increases (in the middle and right-most
columns), the Random Forest classification performance increases
compared to the same number of sampled points for a smaller period
(depicted in the left-most column).

Taken together, these results suggest that data consisting of
measurements collected temporally further apart across the learning
trajectory is more robust to undersampling than data collected closer
together in training time. Furthermore, across individual observable
measures, the weights are overall the most robust to undersampling of
the trajectory, but with enough frequency of samples we can achieve
comparable performance with the activations.

Incomplete and noisy measurements: subsampling units and Gaussian noise before collecting observables

The aggregate statistics computed from the observable measures thus far
have operated under the idealistic assumption of noiseless access to
every unit in the model. However, in most datasets, there is a
significant amount of unit undersampling as well as non-zero measurement
noise. How do these two factors affect learning rule identification, and
in particular, how noise and subsample-robust are particular observable

Addressing this question would provide insight into the types of
experimental neuroscience paradigms that may be most useful for
identifying learning rules, and predict how certain experimental tools
may fall short for given observables. For instance, optical imaging
techniques can use fluorescent indicators of electrical activities of
neurons to give us simultaneous access to thousands of neurons.
But these techniques can have lower temporal resolution and signal-to-noise than
electrophysiological recordings that more directly measure the
electrical activities of neurons, which in turn may lack the same

Activations are the most robust to measurement noise and unit
Reported here is Random Forest test set accuracy in
separating IA vs. FA, averaged over 10 train/test splits per random
sampling and simulated measurement noise seed.

To account for these tradeoffs, we model measurement noise as an
additive white Gaussian noise process added to units of ResNet-18
trained on the ImageNet and self-supervised SimCLR tasks. We choose IA
vs. FA since the differences between them are conceptually stark: IA
imposes dynamics on the feedback error weights during learning, whereas
FA keeps them fixed. If there are scenarios of measurement noise and
unit subsampling where we are at chance accuracy for this problem (50%),
then it may establish a strong constraint on learning rule
separability more generally.

Our results suggest that if one makes experimental measurements by
imaging synaptic strengths, it is still crucial that the optical imaging
readout not be very noisy, since even with the amount of units typically
recorded currently (on the order of several hundred to several thousand
synapses), a noisy imaging strategy of synaptic strengths may be
rendered ineffective.

Instead, current electrophysiological techniques that measure the
activities from hundreds of units could form a good set of neural data
to separate learning rules. Recording more units with these techniques
can improve learning rule separability from the activities, but it does
not seem necessary, at least in this setting, to record a majority of
units to perform this separation effectively.


As experimental techniques in neuroscience continue to advance, we will
be able to record data from more neurons with higher temporal
resolution. But even if we had the perfect measurement tools, it is not
clear ahead of time what should be measured in order to identify the
learning rule(s) operative within a given neural circuit, or whether
this is even possible in principle. Our model-based approach
demonstrates that we can identify learning rules solely on the basis of
standard types of experimental neuroscience measurements from the
weights, activations, or layer-wise activity changes, without knowledge
of the architecture or loss target of the learning system.

Additionally, our results suggest the following prescription for the type of
experimental neuroscience data to be collected towards this goal:

Electrophysiological recordings of post-synaptic activities
from a neural circuit on the order of several hundred units, frequently
measured at wider intervals during the course of learning, may provide a
good basis on which to identify learning rules.

We have made our dataset, code, and interactive
available so that others can analyze these properties without needing to
train neural networks themselves. Our dataset may also be of interest to
researchers theoretically or empirically investigating learning in deep
neural networks. For further details, check out our NeurIPS 2020


I would like to thank my collaborator Sanjana Srivastava
and advisors Surya Ganguli and Daniel Yamins. I would also like to
thank Jacob Schreiber, Sidd Karamcheti, and Andrey Kurenkov for their
editorial suggestions on this post.

Read More

iGibson: A Simulation Environment to Train AI Agents in Large Realistic Scenes

iGibson: A Simulation Environment to Train AI Agents in Large Realistic Scenes

Why simulation for AI?

We are living in a Golden Age of simulation environments in AI and robotics. Looking back ten years, simulation environments were rare, with only a handful of available solutions, and were complex and used only by experts. Today, there are many available simulation environments and most papers in AI and robotics at first tier conferences such as NeurIPS, CoRL or even ICRA and IROS, make some use of them. What has changed?

This extensive use of simulation environments is the result of several trends:

  • First, the increasing role of machine learning in robotics creates a demand for more data (for example, interactive experiences) than what can be generated in real time 1234. Also, the initial data collection process often involves random exploration that may be dangerous for physical robots or their surroundings.
  • Second, simulation environments have matured to be more robust, realistic (visually and physically), user friendly and accessible to all types of users, and the necessary computation to simulate complex physics is reasonably fast on most modern machines. Therefore, simulation environments have the potential to lower the barrier to entry in robotics, even for researchers without the funds to acquire expensive real robot platforms.
  • Finally, the increasing number of robotic solutions to tasks such as grasping, navigation or manipulation have brought more attention to a critical absence in our community: the lack of repeatable benchmarks. Mature sciences are based on experiments that can be easily and reliably replicated, so that different techniques, theories, and solutions can be compared in fair conditions. Simulation environments can help us to establish repeatable benchmarks, which is very difficult to achieve with real robots, which can in turn help us understand the status of our field.

Why iGibson?

These ideas motivated us in the Stanford Vision and Learning Lab to develop a simulation environment that can serve as a “playground” to train and test interactive AI agents – an environment we call iGibson5. What makes iGibson special? To understand this, let’s first define what a simulation environment is and how it is different from a physics simulator. A physics simulator is an engine capable of computing the physical effect of actions on an environment (e.g. motion of bodies when a force is applied, or flow of liquid particles when being poured). There are many existing physics simulation engines. The best known in robotics are Bullet and its python extension, PyBullet, MuJoCo, Nvidia PhysX and Flex, UnrealEngine, DART, Unity, and ODE. Given a physical problem (objects, forces, particles, and physics parameters), these engines compute the temporal evolution of the system. On the other hand, a simulation environment is a framework that includes a physics simulator, a renderer of virtual signals, and a set of assets (i.e. models of scenes, objects, and robots) that can be used to create simulations of problems to study and develop solutions for different tasks. The decision on what physics engine to use is based on the type of physical process that dominates the problem, for example rigid body physics or motion of fluids. However, to decide on what simulation environment to use, researchers are guided by the application domain they are interested in, and the research questions they want to explore. With iGibson, we aim to support the study of interactive tasks in large realistic scenes, guided by high quality virtual visual signals.

Comparison to existing simulators

No existing simulation environments support developing solutions for problems involving interactions in large scale scenes like full houses. There are several simulation environments for tasks with stationary arms, such as meta-world, RLBench, RoboSuite or DoorGym, but none of them include large realistic scenes like homes with multiple rooms for tasks that include navigation. For navigation, our previous version, Gibson (v1) and Habitat have proven to be great environments that allow researchers to study visual and language guided navigation. However, the included assets (scenes) are single meshes that cannot change when interactions are applied, like opening doors or moving objects.

Finally, a set of recent simulation environments allow for scene-level interactive tasks, such as Sapien, AI2Thor and ThreeDWorld (TDW). Sapien focuses on interaction with articulated objects (doors, cabinets, and drawers). TDW is a multi-modal simulator with audio, high quality visuals, and simulation of flexible materials and liquids via Nvidia Flex. But neither Sapien nor TDW include fully interactive scenes aligned with real object distribution and layout as part of the environment. AI2Thor includes fully interactive scenes, but the interactions are scripted: interactable objects are annotated with the possible actions they can receive. When the agent is close enough to an object and the object is in the right state (precondition), the agent can select a predefined action, and the object is “transitioned’” to the next state (postcondition). RoboThor, an alternative version of AI2Thor, enables continuous interactions but focuses on navigation. It provides limited sensory signals to the agent (only RGB-D images) that is always embodied as a locobot, a low-cost platform with limited interaction capabilities. Here at SVL, we want to study complex, long-horizon mobile manipulation tasks such as tidying a house or searching for objects, which requires access to fully interactive realistic large-scale scenes.

iGibson’s new features

The main focus of iGibson in interactivity: enabling realistic interactions in large scenes. For that, we have included several key features:

  • Fifteen fully interactive visually realistic scenes representing real world homes with furniture and articulated object models annotated with materials and dynamics properties.
  • Capabilities to import models from CubiCasa5K 6 and 3D-Front 7, giving access to more than 8000 additional interactive home scenes.
  • Realistic virtual sensor signals, including high quality RGB images from a physics-based renderer, depth maps, 1 beam and 16 beams virtual LiDAR signals, semantic/instance/material segmentation, optical and scene flow, and surface normals.
  • Domain randomization for visual texture, dynamics properties and object instances for endless variations of scenes.
  • Human-computer interface for humans to provide demonstrations of fully physical interactions with the scenes.
  • Integration with sampling-based motion planners to facilitate motion of robotic bases (navigation in 2D layout) and arms (interaction in 3D space).

Using iGibson for robot learning

These novel features in iGibson allow us to study and develop solutions for new interactive tasks in large environments. One of these new problems is Interactive Navigation, where the agents need to interact with the environment to change its configuration, for example, to open doors or push obstacles away. This is a common type of navigation in our homes and offices, but non-interactive simulation environments cannot be used to study it. In iGibson we have developed hierarchical reinforcement learning solutions for interactive navigation that decide explicitly what part of the body to use in the next phase of the task: the arm (for interactions), the base (for navigation) or the combination of both 8. We also propose a new learning solution for interactive navigation that integrates a motion planner: the learning algorithm decides on the next point to interact, and the motion planner finds a collision free path to that point of interaction 9. But these are just the tips of the iceberg: many of SVL’s projects are leveraging iGibson to study a wide variety of interactive tasks in large realistic scenes.


Simulation environments have the potential to support researchers in their study of robotics and embodied AI problems. With iGibson, SVL contributes to the community with an open source, fully academically developed simulation environment for interactive tasks in large realistic scenes. If you want to start using it, visit our website and download – setup should be straightforward, and we’re happy to answer any questions about getting the simulator up and running for your research! We hope we can facilitate new avenues of research in robotics and AI.

  1. Andrychowicz, OpenAI: Marcin, et al. “Learning dexterous in-hand manipulation.” The International Journal of Robotics Research 39.1 (2020): 3-20. 

  2. Rajeswaran, Aravind, et al. “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.” Robotics: Science and Systems, 2017 

  3. Peng, Xue Bin, et al. “Sfv: Reinforcement learning of physical skills from videos.” ACM Transactions on Graphics (TOG) 37.6 (2018): 1-14. 

  4. Zhu, Yuke, et al. “robosuite: A modular simulation framework and benchmark for robot learning.” arXiv preprint arXiv:2009.12293 (2020). 

  5. A note on Gibson – Our simulation environment takes the name from James J. Gibson [1904-1979]. Gibson was an influential psychologist and cognitive scientist with, at the time, disruptive ideas. He pushed forward a new concept of perception to be considered 1) an ecological process that cannot and should not be studied in isolation from the environment, and 2) an active process that needs agency and interactivity. This was in contrast to the predominant view of the time of perception to be a passive process where signals “arrive” and “are processed” by the brain. Instead, he argued that agents seek for information, interacting and revealing it. He also coined the term “affordance” as the opportunity the environment offers to an agent to perform a task. This is a quote from a colleague summarizing his research that directly connects to the guiding principle behind our work in the iGibson team: “ask not what’s inside your head, but what your head is inside of”. 

  6. Kalervo, Ahti, et al. “Cubicasa5k: A dataset and an improved multi-task model for floorplan image analysis.” Scandinavian Conference on Image Analysis. Springer, Cham, 2019. 

  7. Fu, Huan, et al. “3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics.” arXiv preprint arXiv:2011.09127 (2020). 

  8. Li, Chengshu, et al. “Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators.” Conference on Robot Learning. PMLR, 2020. 

  9. Xia, Fei, et al. “Relmogen: Leveraging motion generation in reinforcement learning for mobile manipulation.” arXiv preprint arXiv:2008.07792 (2020). 

Read More

Stanford AI Lab Papers and Talks at NeurIPS 2020

Stanford AI Lab Papers and Talks at NeurIPS 2020

The Neural Information Processing Systems (NeurIPS) 2020 conference is being hosted virtually from Dec 6th – Dec 12th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers

Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration

Authors: Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill


Keywords: reinforcement learning, function approximation, exploration

Acceleration with a Ball Optimization Oracle

Authors: Yair Carmon, Arun Jambulapati, Qijia Jiang, Yujia Jin, Yin Tat Lee, Aaron Sidford, Kevin Tian


Award nominations: Oral presentation

Links: Paper

Keywords: convex optimization, local search, trust region methods

BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits

Authors: Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, Ilan Shomorony


Links: Paper | Video

Keywords: clustering, k-means, k-medoids, multi-armed bandits

CaSPR: Learning Canonical Spatiotemporal Point Cloud Representations

Authors: Davis Rempe, Tolga Birdal, Yongheng Zhao, Zan Gojcic, Srinath Sridhar, Leonidas J. Guibas


Links: Paper | Video | Website

Keywords: 3d vision, dynamic point clouds, representation learning

Compositional Explanations of Neurons

Authors: Jesse Mu, Jacob Andreas


Award nominations: oral

Links: Paper

Keywords: interpretability, explanation, deep learning, computer vision, natural language processing, adversarial examples

Continuous Meta-Learning without Tasks

Authors: James Harrison, Apoorva Sharma, Chelsea Finn, Marco Pavone


Links: Paper

Keywords: meta-learning, continuous learning, changepoint detection

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

Authors: Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli


Links: Paper

Keywords: loss landscape, neural tangent kernel, linearization, taylorization, basin, nonlinear advantage

Diversity can be Transferred: Output Diversification for White- and Black-box Attacks

Authors: Yusuke Tashiro, Yang Song, Stefano Ermon


Links: Paper | Website

Keywords: adversarial examples, deep learning, robustness

Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Authors: Masha Itkina, Boris Ivanovic, Ransalu Senanayake, Mykel J. Kochenderfer, and Marco Pavone


Links: Paper | Website

Keywords: sparse distributions, generative models, discrete latent spaces, behavior prediction, image generation

Federated Accelerated Stochastic Gradient Descent

Authors: Honglin Yuan, Tengyu Ma


Award nominations: Best Paper Award of Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML 2020 (FL-ICML’20)

Links: Paper | Website

Keywords: federated learning, local sgd, acceleration, fedac

Fourier-transform-based attribution priors improve the interpretability and stability of deep learning models for genomics

Authors: Alex Michael Tseng, Avanti Shrikumar, Anshul Kundaje


Links: Paper | Website

Keywords: deep learning, interpretability, attribution prior, computational biology, genomics

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

Authors: Ines Chami, Albert Gu, Vaggos Chatziafratis, Christopher Re


Links: Paper | Video | Website

Keywords: hierarchical clustering, hyperbolic embeddings

FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply

Authors: Lingjiao Chen; Matei Zaharia; James Zou


Links: Paper | Blog Post | Website

Keywords: machine learning as a service, ensemble learning, meta learning, systems for machine learning

Generative 3D Part Assembly via Dynamic Graph Learning

Authors: Jialei Huang, Guanqi Zhan, Qingnan Fan, Kaichun Mo, Lin Shao, Baoquan Chen, Leonidas Guibas, Hao Dong


Links: Paper

Keywords: 3d part assembly, dynamic graph learning

Generative 3D Part Assembly via Dynamic Graph Learning

Authors: Jialei Huang*, Guanqi Zhan*, Qingnan Fan, Kaichun Mo, Lin Shao, Baoquan Chen, Leonidas J. Guibas, Hao Dong


Links: Paper | Website

Keywords: 3d part assembly, graph neural network

Gradient Surgery for Multi-Task Learning

Authors: Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, Chelsea Finn


Links: Paper | Website

Keywords: multi-task learning, deep reinforcement learning

HiPPO: Recurrent Memory with Optimal Polynomial Projections

Authors: Albert Gu*, Tri Dao*, Stefano Ermon, Atri Rudra, Chris Ré


Links: Paper | Blog Post

Keywords: representation learning, time series, recurrent neural networks, lstm, orthogonal polynomials

Identifying Learning Rules From Neural Network Observables

Authors: Aran Nayebi, Sanjana Srivastava, Surya Ganguli, Daniel L.K. Yamins


Award nominations: Spotlight Presentation

Links: Paper | Website

Keywords: computational neuroscience, learning rule, deep networks

Improved Techniques for Training Score-Based Generative Models

Authors: Yang Song, Stefano Ermon


Links: Paper

Keywords: score-based generative modeling, score matching, deep generative models

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

Authors: Alex Tamkin, Dan Jurafsky, Noah Goodman


Links: Paper

Keywords: bert, signal processing, self-supervised learning, interpretability, multiscale

Large-Scale Methods for Distributionally Robust Optimization

Authors: Daniel Levy, Yair Carmon, John Duchi, Aaron Sidford


Links: Paper

Keywords: robustness dro optimization large-scale optimal

Learning Physical Graph Representations from Visual Scenes

Authors: Daniel Bear, Chaofei Fan, Damian Mrowca, Yunzhu Li, Seth Alter, Aran Nayebi, Jeremy Schwartz, Li F. Fei-Fei, Jiajun Wu, Josh Tenenbaum, Daniel L. Yamins


Links: Paper | Blog Post | Website

Keywords: structure learning, graph learning, visual scene representations, unsupervised learning, unsupervised segmentation, object-centric representation, intuitive physics

MOPO: Model-based Offline Policy Optimization

Authors: Tianhe Yu*, Garrett Thomas*, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, Tengyu Ma


Links: Paper | Website

Keywords: offline reinforcement learning, model-based reinforcement learning

MOPO: Model-based Offline Policy Optimization

Authors: Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, Tengyu Ma


Links: Paper

Keywords: model-based rl, offline rl, batch rl

Measuring Robustness to Natural Distribution Shifts in Image Classification

Authors: Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, Ludwig Schmidt


Award nominations: Spotlight

Links: Paper | Website

Keywords: machine learning, robustness, image classification

Minibatch Stochastic Approximate Proximal Point Methods

Authors: Hilal Asi, Karan Chadha, Gary Cheng, John Duchi


Award nominations: Spotlight talk

Links: Paper

Keywords: stochastic optimization, sgd, aprox

Model-based Adversarial Meta-Reinforcement Learning

Authors: Zichuan Lin, Garrett Thomas, Guangwen Yang, Tengyu Ma


Links: Paper

Keywords: model-based rl, meta-rl, minimax

Multi-Plane Program Induction with 3D Box Priors

Authors: Yikai Li, Jiayuan Mao, Xiuming Zhang, William T. Freeman, Joshua B. Tenenbaum, Noah Snavely, Jiajun Wu


Links: Paper | Video | Website

Keywords: visual program induction, 3d vision, image editing

Multi-label Contrastive Predictive Coding

Authors: Jiaming Song, Stefano Ermon


Links: Paper

Keywords: representation learning, mutual information

Neuron Shapley: Discovering the Responsible Neurons

Authors: Amirata Ghorbani, James Zou


Links: Paper

Keywords: interpretability, deep learning, shapley value

No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems

Authors: Nimit Sharad Sohoni, Jared Alexander Dunnmon, Geoffrey Angus, Albert Gu, Christopher Ré


Links: Paper | Blog Post | Video

Keywords: classification, robustness, clustering, neural feature representations

Off-policy Policy Evaluation For Sequential Decisions Under Unobserved Confounding

Authors: Hongseok Namkoong, Ramtin Keramati, Steve Yadlowsky, Emma Brunskill


Links: Paper

Keywords: off-policy policy evaluation, unobserved confounding, reinforcement learning

One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL

Authors: Saurabh Kumar, Aviral Kumar, Sergey Levine, Chelsea Finn


Links: Paper

Keywords: robustness, diversity, reinforcement learning

Point process models for sequence detection in high-dimensional neural spike trains

Authors: Alex H. Williams, Anthony Degleris, Yixin Wang, Scott W. Linderman


Award nominations: Selected for Oral Presentation

Links: Paper | Website

Keywords: bayesian nonparametrics, unsupervised learning

Predictive coding in balanced neural networks with noise, chaos and delays

Authors: Jonathan Kadmon, Jonathan Timcheck, Surya Ganguli


Links: Paper

Keywords: neuroscience, predictive coding, chaos

Probabilistic Circuits for Variational Inference in Discrete Graphical Models

Authors: Andy Shih, Stefano Ermon


Links: Paper

Keywords: variational inference, discrete, high-dimensions, sum product networks, probabilistic circuits, graphical models

Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration

Authors: Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill.


Links: Paper

Keywords: reinforcement leanring, off-policy, batch reinforcement learning

Pruning neural networks without any data by iteratively conserving synaptic flow

Authors: Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli


Links: Paper | Video | Website

Keywords: network pruning, sparse initialization, lottery ticket

Robust Sub-Gaussian Principal Component Analysis and Width-Independent Schatten Packing

Authors: Arun Jambulapati, Jerry Li, Kevin Tian


Award nominations: Spotlight presentation

Links: Paper

Keywords: robust statistics, principal component analysis, positive semidefinite programming

Self-training Avoids Using Spurious Features Under Domain Shift

Authors: Yining Chen*, Colin Wei*, Ananya Kumar, Tengyu Ma (*equal contribution)


Links: Paper

Keywords: self-training, pseudo-labeling, domain shift, robustness

Wasserstein Distances for Stereo Disparity Estimation

Authors: Divyansh Garg, Yan Wang, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, Wei-Lun Chao


Award nominations: Spotlight

Links: Paper | Video | Website

Keywords: depth estimation, disparity estimation, autonomous driving, 3d object detection, statistical learning

We look forward to seeing you at NeurIPS2020!

Read More

Learning from Language Explanations

Learning from Language Explanations

Imagine you’re a machine learning practitioner and you want to solve some classification problem, like classifying groups of colored squares as being either 1s or 0s. Here’s what you would typically do: collect a large dataset of examples, label the data, and train a classifier:

But humans don’t learn like this. We have a very powerful and intuitive mechanism for communicating information about the world – language!

With just the phrase at least 2 red squares, we’ve summarized the entire dataset presented above in a much more efficient manner.

Language is a crucial medium for human learning: we use it to convey beliefs about the world, teach others, and describe things that are hard to experience directly. Thus, language ought to be a simple and effective way to supervise machine learning models. Yet past approaches to learning from language have struggled to scale up to the general tasks targeted by modern deep learning systems and the freeform language explanations used in these domains. In two short papers presented at ACL 2020 this year, we use deep neural models to learn from language explanations to help tackle a variety of challenging tasks in natural language processing (NLP) and computer vision.

What’s the challenge?

Given that language is such an intuitive interface for humans to teach others,
why is it so hard to use language for machine learning?

The principal challenge is the grounding
: understanding language
explanations in the context of other inputs. Building models that can
understand rich and ambiguous language is tricky enough, but building models
that can relate language to the surrounding world is even more challenging. For
instance, given the explanation at least two red squares, a model must not
only understand the terms red and square, but also how they refer to
particular parts of (often complex) inputs.

Past work (1,
3) has relied on semantic
convert natural language statements (e.g. at least two red squares) to formal
logical representations (e.g. Count(Square AND Red) > 2). If we can easily
check whether explanations apply to our inputs by executing these logical
formulas, we can use our explanations as features to train our model.
However, semantic parsers only work on simple domains
where we can hand-engineer a logical grammar of explanations we might expect to
see. They struggle to handle richer and vaguer language or scale up to more
complex inputs, such as images.

Fortunately, modern deep neural language models such as
BERT are beginning to show promise at
solving many language understanding tasks. Our papers propose to alleviate the
grounding problem by using neural language models that are either trained to
ground language explanations in the domain of interest, or come pre-trained
with general-purpose “knowledge” that can be used to interpret explanations. We
will show that these neural models allow us to learn from richer and more
diverse language for more challenging settings.

Representation Engineering with Natural Language Explanations

In our first paper, we examine how to build text classifiers with language
Consider the task of relation extraction, where we are given a
short paragraph and must identify whether two people mentioned in the
paragraph are married. While state-of-the-art NLP models can likely solve
this task from data alone, humans might use language to describe ways to tell
whether two people are married—for example, people who go on honeymoons are
typically married
. Can such language explanations be used to train better

In the same way that we might take an input , and extract features (e.g.
the presence of certain words) to train a model, we can use explanations to
provide additional features. For example, knowing that honeymoons are relevant
for this task, if we can create a honeymoon feature that reliably activates
whenever the two people in a paragraph are described as going on a honeymoon,
this should be useful signal for training a better model.

But creating such features requires some sort of explanation interpretation
mechanism that tells us whether an explanation is true for an input. Semantic
parsers are one such tool: given and went on honeymoon, we could
parse this explanation into a logical form which, when run on an input,
produces 1 if the word honeymoon appears between and . But what about
a vaguer explanation like and are in love? How can we parse this?

While semantic parsing is efficient and accurate in small domains, it can be
overly brittle, as it can only interpret explanations which adhere to a fixed
set of grammatical rules and functions that we must specify in advance (e.g.
contains and extract_text).
Instead, we turn to the soft reasoning
capabilities of BERT, a neural language model. BERT is particularly effective
at the task of textual entailment: determining whether a sentence implies or
contradicts another sentence (e.g. does She ate pizza imply that She ate
Yes!). In our proposed ExpBERT model, we take a BERT model
trained for textual entailment, and instead ask it to identify whether a
paragraph in our task entails an explanation. The features produced by BERT
during this process replace the indicator features produced by the semantic
parser above.

Does the soft reasoning power of BERT improve over semantic parsing? On the
marriage identification task, we find that ExpBERT leads to substantial
improvements over a classifier that is trained on the input features only (No
Explanations). Importantly, using a semantic parser to try to parse
explanations doesn’t help much, since there are general explanations (in
) that are difficult to convert to logical forms.

In the full paper, we compare to more baselines, explore larger relation
extraction tasks (e.g. TACRED),
conduct ablation studies to understand what kinds of explanations are
important, and examine how much more efficient explanations are compared to
additional data.

Shaping Visual Representations with Language

The work we’ve just described uses natural language explanations for a single
task like marriage identification. However, work in cognitive
suggests that
language also equips us with the right features and abstractions that help us
solve future tasks.
For example, explanations that indicate whether person is married to
also highlight other concepts that are crucial to human relationships:
children, daughters, honeymoons, and more. Knowing these additional
concepts are not just useful for identifying married people; they are also
important if we would later like to identify other relationships
(e.g. siblings, mother, father).

In machine learning, we might ask: how can language point out the right
features for challenging and underspecified domains, if we
ultimately wish to solve new tasks where no language is available? In our
second paper, we explore this setting,
additionally increasing the challenge by seeing whether language can improve
the learning of representations across modalities—here, vision.

We’re specifically interested in few-shot visual reasoning tasks like the following (here, from the ShapeWorld dataset):

Given a small training set of examples of a visual concept, the task is to
determine whether a held-out test image expresses the same concept. Now, what
if we assume access to language explanations of the relevant visual concepts at
training time? Can we use these to learn a better model, even if no language
is available at test time

We frame this as a meta-learning task:
instead of training and testing a model on a single task, we
train a model on a set of tasks, each with a small training set and
an accompanying language description (the meta-train set). We then test
generalization to a meta-test set of unseen tasks, for which no language is

First, let’s look at how we might solve this task without language. One typical
approach is Prototype Networks, where we learn some model
(here, a deep convolutional neural network)
that embeds the training images, averages them, and compares to an embedding of
the test image:

To use language, we propose a simple approach called Language Shaped Learning
(LSL): if we have access to explanations at training time, we encourage the
model to learn representations that are not only helpful for classification,
but are predictive of the language explanations. We do this by introducing an
auxiliary training objective (i.e. it is not related to the ultimate task of
interest), where we simultaneously train a recurrent neural network (RNN)
decoder to predict the explanation(s) from the representation of the
input images. Crucially, training this decoder depends on the
parameters of our image model , so this process should encourage
to better encode the features and abstractions exposed in

In effect, we are training the model to “think out loud” when representing
concepts at training time. At test time, we simply discard the RNN decoder, and
do classification as normal with the “language-shaped” image embeddings.

We apply this model to both the ShapeWorld dataset described above, and a more
realistic Birds
dataset, with real images and human language:

In both cases, this auxiliary training objective improves performance over a
no-explanation baseline (Meta), and Learning with Latent
(L3), a similar model proposed
for this setting that uses language as a discrete bottleneck (see the paper for

In the full paper, we also explore which parts of language are most important
(spoiler: a little bit of everything), and how much language is needed for
LSL to improve over models that don’t use language (spoiler: surprisingly little!)

Moving Forward

As NLP systems grow in their ability to understand and produce language, so too
grows the potential for machine learning systems to learn from language to
solve other challenging tasks. In the papers above, we’ve shown that deep
neural language models can be used to successfully learn from language
explanations to improve generalization across a variety of tasks in vision and

We think this is an exciting new avenue for training machine learning models,
and similar ideas are already being explored in areas such as reinforcement
learning (4,
5). We envision a future where in order to
solve a machine learning task, we no longer have to collect a large labeled
dataset, but instead interact naturally and expressively with a model in the
same way that humans have interacted with each other for millennia—through


Thanks to our coauthors (Pang Wei Koh, Percy Liang, and Noah Goodman), and to
Nelson Liu, Pang Wei Koh, and the rest of the SAIL blog team for reviewing and
publishing this blog post. This research was supported in part by the Facebook
(to Pang Wei Koh), the NSF Graduate Research Fellowship (to Jesse Mu), Toyota Research
, and the Office of Naval Research.

Read More

Stanford AI Lab Papers and Talks at CoRL 2020

Stanford AI Lab Papers and Talks at CoRL 2020

The Conference on Robot Learning (CoRL) 2020 is being hosted virtually from November 16th – November 18th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers

Learning 3D Dynamic Scene Representations for Robot Manipulation

Authors: Zhenjia Xu, Zhanpeng He, Jiajun Wu, Shuran Song


Links: Paper | Video | Website

Keywords: scene representations, 3d perception, robot manipulation

Learning Latent Representations to Influence Multi-Agent Interaction

Authors: Annie Xie, Dylan P. Losey, Ryan Tolsma, Chelsea Finn, Dorsa Sadigh


Links: Paper | Blog Post | Website

Keywords: multi-agent systems, human-robot interaction, reinforcement learning

Learning Object-conditioned Exploration using Distributed Soft Actor Critic

Authors: Ayzaan Wahid (Google), Austin Stone (Google), Brian Ichter (Google Brain), Kevin Chen (Stanford), Alexander Toshev (Google)


Links: Paper

Keywords: object navigation, visual navigation

MATS: An Interpretable Trajectory Forecasting Representation for Planning and Control

Authors: Boris Ivanovic, Amine Elhafsi, Guy Rosman, Adrien Gaidon, Marco Pavone


Links: Paper | Video

Keywords: trajectory forecasting, learning dynamical systems, motion planning, autonomous vehicles

Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous

Authors: Rose E. Wang, J. Chase Kew, Dennis Lee, Tsang-Wei Edward Lee, Tingnan Zhang, Brian Ichter, Jie Tan, Aleksandra Faust


Links: Paper | Video | Website

Keywords: multiagent systems; model-based reinforcement learning

Reinforcement Learning with Videos: Combining Offline Observations with Interaction

Authors: Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, Chelsea Finn


Links: Paper | Website

Keywords: reinforcement learning, learning from observation

Sampling-based Reachability Analysis: A Random Set Theory Approach with Adversarial Sampling

Authors: Thomas Lew, Marco Pavone


Links: Paper

Keywords: reachability analysis, robust planning and control, neural networks


Walking the Boundary of Learning and Interaction (Dorsa Sadigh)

Overview: There have been significant advances in the field of robot learning in the past decade. However, many challenges still remain when considering how robot learning can advance interactive agents such as robots that collaborate with humans. This includes autonomous vehicles that interact with human-driven vehicles or pedestrians, service robots collaborating with their users at homes over short or long periods of time, or assistive robots helping patients with disabilities. This introduces an opportunity for developing new robot learning algorithms that can help advance interactive autonomy.

In this talk, I will discuss a formalism for human-robot interaction built upon ideas from representation learning. Specifically, I will first discuss the notion of latent strategies— low dimensional representations sufficient for capturing non-stationary interactions. I will then talk about the challenges of learning such representations when interacting with humans, and how we can develop data-efficient techniques that enable actively learning computational models of human behavior from demonstrations, preferences, or physical corrections. Finally, I will introduce an intuitive controlling paradigm that enables seamless collaboration based on learned representations, and further discuss how that can be used for further influencing humans.

Live Event: November 17th, 7:00AM – 7:45AM PST

We look forward to seeing you at CoRL!

Read More