Deep Double Descent

Deep Double Descent

Deep Double Descent

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.

Deep Double Descent

Read Paper

Many classes of modern deep learning models, including CNNs, ResNets, and transformers, exhibit the previously-observed double descent phenomenon when not using early stopping or regularization. The peak occurs predictably at a “critical regime,” where the models are barely able to fit the training set. As we increase the number of parameters in a neural network, the test error initially decreases, increases, and, just as the model is able to fit the train set, undergoes a second descent.

Neither classical statisticians’ conventional wisdom that too large models are worse nor the modern ML paradigm that bigger models are better uphold. We find that double descent also occurs over train epochs. Surprisingly, we show these phenomena can lead to a regime where more data hurts, and training a deep network on a larger train set actually performs worse.

Model-wise double descent

1. There is a regime where bigger models are worse.

Deep Double Descent

The model-wise double descent phenomenon can lead to a regime where training on more data hurts. In the chart above, the peak in test error occurs around the interpolation threshold, when the models are just barely large enough to fit the train set.

In all cases we’ve observed, changes which affect the interpolation threshold (such as changing the optimization algorithm, the number of train samples, or the amount of label noise) also affect the location of the test error peak correspondingly. The double descent phenomena is most prominent in settings with added label noise; without it, the peak is smaller and easy to miss. Adding label noise amplifies this general behavior and allows us to easily investigate.

Sample-wise non-monotonicity

2. There is a regime where more samples hurts.

Deep Double Descent

The above chart shows transformers trained on a language-translation task with no added label noise. As expected, increasing the number of samples shifts the curve downwards towards lower test error. However, since more samples require larger models to fit, increasing the number of samples also shifts the interpolation threshold (and peak in test error) to the right.

For intermediate model sizes (red arrows), these two effects combine, and we see that training on 4.5x more samples actually hurts test performance.

Epoch-wise double descent

3. There is a regime where training longer reverses overfitting.

Deep Double Descent

Deep Double Descent

The charts above show test and train error as a function of both model size and number of optimization steps. For a given number of optimization steps (fixed y-coordinate), test and train error exhibit model-size double descent. For a given model size (fixed x-coordinate), as training proceeds, test and train error decreases, increases, and decreases again; we call this phenomenon epoch-wise double descent.

In general, the peak of test error appears systematically when models are just barely able to fit the train set.

Our intuition is that, for models at the interpolation threshold, there is effectively only one model that fits the train data, and forcing it to fit even slightly noisy or misspecified labels will destroy its global structure. That is, there are no “good models” which both interpolate the train set and perform well on the test set. However, in the over-parameterized regime, there are many models that fit the train set and there exist such good models. Moreover, the implicit bias of stochastic gradient descent (SGD) leads it to such good models, for reasons we don’t yet understand.

We leave fully understanding the mechanisms behind double descent in deep neural networks as an important open question.


Acknowledgments

Thanks to Mikhail Belkin and Chris Olah for helpful discussions and feedback throughout this work. An expanded version of this post can also be found on Boaz Barak’s blog, Windows on Theory.

OpenAI

Text Feature Selection for Causal Inference

Text Feature Selection for Causal Inference

Making Causal Inferences with Text

Identifying the linguistic features that cause people to act a certain way after reading a text, regardless of confounding variables, is something people do all the time without even realizing it. For example,

  • Consider university course catalogues. Students peruse these each semester before signing up. What’s the magic 200-word blurb that jives with students enough to sign up? What kind of writing style recommendations could you give to any professor, regarding any subject?
  • Consider crowdfunding campaigns [1]. We want to know which writing styles pull in the most money, but the effect of language is confounded by the subject of the campaign – a campaign for someone’s medical bills will be written differently than a campaign for building wells. We want to find writing styles that could help any campaign.
  • Consider comments on reddit, where each post has a popularity score. Say that we’re interested in finding what writing styles will help posts become popular. Some authors list their genders on reddit, and a user’s gender may also affect popularity through tone, style, or topic choices [2]. How do you decide what kind of language to reccomend to any person, regardless of their gender.

Across three papers, we develop adversarial learning-based approaches for these kinds of tasks as well as a theory of causal inference to formalize the relationship between text and causality. Our method involves:

  1. Training a model which predicts outcomes from text. We control for confounds with adversarial learning [3], [4] or residualization [5].

  2. Interpreting the models’ learned parameters to identify the most important words and phrases for the outcome, regardless of confounders.

Compared to other feature selection methods, ours picks features that are more predictive of the outcome and less affected by confounding variables across four domains: e-commerce product descriptions (predictive of sales, regardless of brand), search advertisements (predictive of click-through rate, regardless of landing page), university course descriptions (predictive of enrollment, regardless of subject), and financial complaints (predictive of a short response time, regardless of topic).

Formalizing Textual Causality

Our goal is to find features of text(s) T which are predictive of some desired target variable(s) Y but unrelated to confounding variable(s) C (i.e. the blue bit in the figure below). This is equivalent to picking a lexicon L such that when words in T belonging to L are selected, the resulting set L(T) can explain Y but not C.

In the paper, we formalize this intuitive goal into maximizing an informativeness coefficient

which measures the explanatory power of the lexicon L(T) beyond the information already contained in the confounders C. The red tells us how much variation in Y is explainable by both L(T) and C. The blue fixes C, letting us focus on L(T)’s unique effects. In our paper, we show that under some conditions this coefficient is equivalent to the strength of T’s causal effects on Y! [6]

In practice I(L) can be estimated by this sequence of steps:

  1. Training a classifier A that predicts Y from L(T) and C
  2. Training a classifier B that predicts Y from C.
  3. Measuring error(B)error(A)

We continue by introducing two methods for coming up with the best lexicon L(T).

Method 1: Adversarial Learning

First, we encode T into a vector e via an attentional bi-LSTM. We then feed e into a series of feedforward neural networks which are trained to predict each target and confounding variable using a cross-entropy loss function. As gradients back-propagate from the confound prediction heads to the encoder, we pass them through a gradient reversal layer. In other words, If the cumulative loss of the target variables is L_t and that of the confounds is L_c, then the loss which is implicitly used to train the encoder is L_e = L_t – L_c. The encoder is encouraged to learn representations of the text which are unrelated to the confounds.

To get the “importance” of each feature, we simply look at the attention scores of the model, since ngrams the model focused on while making Y-predictions in a C-invariant way are themselves predictive of Y but not C!

Method 2: Deep Residualization

Recall that we can estimate I(L) by measuring the amount by which L can further improve predictions of Y compared to predictions of Y made from just C. Our Deep Residualization algorithm is directly motivated by this. It first predicts Y from C as well as possible, and then seeks to fine-tune those predictions using a bag-of-words representation of the text T. The parameters are then updated using the loss from both prediction steps. This two-stage prediction process implicitly controls for C because T is being used to explain the part of Y’s variance that the confounds can’t explain.

Then to get the “importance” of each feature, we trace all possible paths between the feature and output, multiply weights along these paths, then sum across paths.

Social Science Applications

Armed with our theoretical framework and algorithms, we can now pick words and phrases that are strongly associated with arbitrary outcomes, regardless of confounding information. In our papers, we do this for four domains:

  • Product descriptions for chocolate and health products on the Japanese e-commerce website Rakuten. We want to find language that explains sales, but not brand or price.
  • Written complaints to the Consumer Financial Protection Bureau (CFPB). We want to find language that predicts short response time, regardless of the financial product the complaint is about.
  • Search advertisements for real estate, job listings, and apparel on the website Google.com. We want to find language that predicts a high click-through rate (CTR), regardless of the landing page the ad points to.
  • Course descriptions and enrollment figures for 6 years of undergraduate offerings at Stanford University. We want to find language that boosts enrollment, regardless of subject and requirements.

As we can see, in each setting one or both of our proposed methods outperform a number of existing feature selection algorithms: Residualized Regressions (RR), Regression with Confound features (RC), Mixed-effects Regression (MR), Mutual information (MI), and Log-Odds Ratio (OR).

Furthermore, we can interpret features these algorithms are selecting to learn about the linguistic dynamics of the associated domains!

  • Appeals to politeness and seasonality appear to help make for successful Japanese product descriptions – an interesting intersection of language and culture.
  • Concrete details (“multiple”, “xx/xx/xxxx”) and already having taken some steps (“submitted”, “ago”) appears important for writing a complaint that will get handled quickly.
  • Appeals to authority (“®“, “Official site”) and personalization (“your” “personalized”) are helpful for search advertising creatives.
  • Student choice (“or”) and dynamic activities (“eating”, “doing”, “guest”, “project”) make for successful course descriptions.

Conclusion

This work presented two methods for identifying text features which best explain an outcome, controlling for confounding variables we are not interested in. This method is generally applicable to a variety of data science and social science applications. In the future, we hope to strengthen the method’s theoretical guarantees in a causal inference framework.

The algorithms in this blog post have been open-sourced! Install via pip:

pip3 install causal-selection

This post was based on the following papers:

  1. Deconfounded Lexicon Induction for Interpretable Social Science

  2. Interpretable Neural Architectures for Attributing an Ad’s Performance to its Writing Style

  3. Predicting Sales from the Language of Product Descriptions

Read More

Procgen Benchmark

Procgen Benchmark

Procgen Benchmark

We’re releasing Procgen Benchmark, 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns generalizable skills.

PaperEnvironment CodeTraining Code

CoinRun

StarPilot

CaveFlyer

Dodgeball

FruitBot

Chaser

Miner

Jumper

Leaper

Maze

BigFish

Heist

Climber

Plunder

Ninja

BossFight

Getting started

Using the environment is easy whether you’re a human or AI:

$ pip install procgen # install
$ python -m procgen.interactive --env-name starpilot # human
$ python <<EOF # random AI agent
import gym
env = gym.make('procgen:procgen-coinrun-v0')
obs = env.reset()
while True:
    obs, rew, done, info = env.step(env.action_space.sample())
    env.render()
    if done:
        break
EOF

We’ve found that all of the Procgen environments require training on 500–1000 different levels before they can generalize to new levels, which suggests that standard RL benchmarks need much more diversity within each environment. Procgen Benchmark has become the standard research platform used by the OpenAI RL team, and we hope that it accelerates the community in creating better RL algorithms.

Environment diversity is key

In several environments, it has been observed that agents can overfit to remarkably large training sets. This evidence raises the possibility that overfitting pervades classic benchmarks like the Arcade Learning Environment, which has long served as a gold standard in reinforcement learning (RL). While the diversity between different games in the ALE is one of the benchmark’s greatest strengths, the low emphasis on generalization presents a significant drawback. In each game the question must be asked: are agents robustly learning a relevant skill, or are they approximately memorizing specific trajectories?

CoinRun was designed to address precisely this issue, by using procedural generation to construct distinct sets of training levels and test levels. While CoinRun has helped us better quantify generalization in RL, it is still only a single environment. It’s likely that CoinRun is not fully representative of the many challenges RL agents must face. We want the best of both worlds: a benchmark comprised of many diverse environments, each of which fundamentally requires generalization. To fulfill this need, we have created Procgen Benchmark. CoinRun now serves as the inaugural environment in Procgen Benchmark, contributing its diversity to a greater whole.

Previous work, including the Obstacle Tower Challenge and the General Video Game AI framework, has also encouraged using procedural generation to better evaluate generalization in RL. We’ve designed environments in a similar spirit, with two Procgen environments drawing direct inspiration from GVGAI-based work. Other environments like Dota and StarCraft also provide lots of per-environment complexity, but these environments are hard to rapidly iterate with (and it’s even harder to use more than one such environment at a time). With Procgen Benchmark, we strive for all of the following: experimental convenience, high diversity within environments, and high diversity across environments.

Procgen Benchmark

Procgen Benchmark consists of 16 unique environments designed to measure both sample efficiency and generalization in reinforcement learning. This benchmark is ideal for evaluating generalization since distinct training and test sets can be generated in each environment. This benchmark is also well-suited to evaluate sample efficiency, since all environments pose diverse and compelling challenges for RL agents. The environments’ intrinsic diversity demands that agents learn robust policies; overfitting to narrow regions in state space will not suffice. Put differently, the ability to generalize becomes an integral component of success when agents are faced with ever-changing levels.

Design principles

We’ve designed all Procgen environments to satisfy the following criteria:

  • High Diversity: Environment generation logic is given maximal freedom, subject to basic design constraints. The diversity in the resulting level distributions presents agents with meaningful generalization challenges.

  • Fast Evaluation: Environment difficulty is calibrated such that baseline agents make significant progress after training for 200M timesteps. Moreover, the environments are optimized to perform thousands of steps per second on a single CPU core, enabling a fast experimental pipeline.

  • Tunable Difficulty: All environments support two well-calibrated difficulty settings: easy and hard. While we report results using the hard difficulty setting, we make the easy difficulty setting available for those with limited access to compute power. Easy environments require approximately an eighth of the resources to train.

  • Emphasis on Visual Recognition and Motor Control: In keeping with precedent, environments mimic the style of many Atari and Gym Retro games. Performing well primarily depends on identifying key assets in the observation space and enacting appropriate low level motor responses.

Evaluating generalization

We came to appreciate how hard RL generalization can be while conducting the Retro Contest, as agents continually failed to generalize from the limited data in the training set. Later, our CoinRun experiments painted an even clearer picture of our agents’ struggle to generalize. We’ve now expanded on those results, conducting our most thorough study of RL generalization to date using all 16 environments in Procgen Benchmark.

We first measured how the size of the training set impacts generalization. In each environment, we generated training sets ranging in size from 100 to 100,000 levels. We trained agents for 200M timesteps on these levels using Proximal Policy Optimization, and we measured performance on unseen test levels.

Generalization performance
Score over 100k levels, log scale
CoinRun
StarPilot
CaveFlyer
Dodgeball
FruitBot
Chaser
Miner
Jumper
Leaper
Maze
BigFish
Heist
Climber
Plunder
Ninja
BossFight

We found that agents strongly overfit to small training sets in almost all environments. In some cases, agents need access to as many as 10,000 levels to close the generalization gap. We also saw a peculiar trend emerge in many environments: past a certain threshold, training performance improves as the training sets grows! This runs counter to trends found in supervised learning, where training performance commonly decreases with the size of the training set. We believe this increase in training performance comes from an implicit curriculum provided by a diverse set of levels. A larger training set can improve training performance if the agent learns to generalize even across levels in the training set. We previously noticed this effect with CoinRun, and have found it often occurs in many Procgen environments as well.

An ablation with deterministic levels

We also conducted a simple ablation study to emphasize the importance of procedural generation. Instead of using a new level at the start of every episode, we trained agents on a fixed sequence of levels. The agent begins each episode on the first level, and when it successfully completes a level, it progresses to the next one. If the agent fails at any point, the episode terminates. The agent can reach arbitrarily many levels, though in practice it rarely progresses beyond the 20th level in any environment.

Train and test performance
Score over 200M timesteps
CoinRun
StarPilot
CaveFlyer
Dodgeball
FruitBot
Chaser
Miner
Jumper
Leaper
Maze
BigFish
Heist
Climber
Plunder
Ninja
BossFight

At test time, we remove the determinism in the sequence of levels, instead choosing level sequences at random. We find that agents become competent over the first several training levels in most games, giving an illusion of meaningful progress. However, test performance demonstrates that the agents have in fact learned almost nothing about the underlying level distribution. We believe this vast gap between training and test performance is worth highlighting. It reveals a crucial hidden flaw in training on environments that follow a fixed sequence of levels. These results show just how essential it is to use diverse environment distributions when training and evaluating RL agents.

Next steps

We expect many insights gleaned from this benchmark to apply in more complex settings, and we’re excited to use these new environments to design more capable and efficient agents.

If you’re interested in helping develop diverse environments, we’re hiring!


Acknowledgments

Thanks to Marc Bellemare, Julian Togelius, Carles Gelada, Jacob Jackson, Alex Ray, Lilian Weng, and Joshua Achiam for their feedback on the paper.

Thanks to Mira Murati, Brooke Chan, Justin Jay Wang, Greg Brockman, Ashley Pilipiszyn and Jack Clark for their work supporting, designing, writing, and providing feedback on this post.

Special thanks to Kenney for the many high quality game assets used throughout these environments.

Additional thanks to CraftPix.net for several game backgrounds, as well as to GameArtGuppy, and ansimuz. All asset licenses can be found here.

OpenAI