Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability

Many experimental works have observed that generalization in deep RL appears to be difficult: although RL agents can learn to perform very complex tasks, they don’t seem to generalize over diverse task distributions as well as the excellent generalization of supervised deep nets might lead us to expect. In this blog post, we will aim to explain why generalization in RL is fundamentally harder, and indeed more difficult even in theory.

We will show that attempting to generalize in RL induces implicit partial observability, even when the RL problem we are trying to solve is a standard fully-observed MDP. This induced partial observability can significantly complicate the types of policies needed to generalize well, potentially requiring counterintuitive strategies like information-gathering actions, recurrent non-Markovian behavior, or randomized strategies. Ordinarily, this is not necessary in fully observed MDPs but surprisingly becomes necessary when we consider generalization from a finite training set in a fully observed MDP. This blog post will walk through why partial observability can implicitly arise, what it means for the generalization performance of RL algorithms, and how methods can account for partial observability to generalize well.

RECON: Learning to Explore the Real World with a Ground Robot

RECON Exploration Teaser

An example of our method deployed on a Clearpath Jackal ground robot (left) exploring a suburban environment to find a visual target (inset). (Right) Egocentric observations of the robot.

Imagine you’re in an unfamiliar neighborhood with no house numbers and I give you a photo that I took a few days ago of my house, which is not too far away. If you tried to find my house, you might follow the streets and go around the block looking for it. You might take a few wrong turns at first, but eventually you would locate my house. In the process, you would end up with a mental map of my neighborhood. The next time you’re visiting, you will likely be able to navigate to my house right away, without taking any wrong turns.

Such exploration and navigation behavior is easy for humans. What would it take for a robotic learning algorithm to enable this kind of intuitive navigation capability? To build a robot capable of exploring and navigating like this, we need to learn from diverse prior datasets in the real world. While it’s possible to collect a large amount of data from demonstrations, or even with randomized exploration, learning meaningful exploration and navigation behavior from this data can be challenging – the robot needs to generalize to unseen neighborhoods, recognize visual and dynamical similarities across scenes, and learn a representation of visual observations that is robust to distractors like weather conditions and obstacles. Since such factors can be hard to model and transfer from simulated environments, we tackle these problems by teaching the robot to explore using only real-world data.

Designs from Data: Offline Black-Box Optimization via Conservative Training

<!– –>


Figure 1: Offline Model-Based Optimization (MBO): The goal of offline MBO is to optimize an unknown objective function $f(x)$ with respect to $x$, provided access to only as static, previously-collected dataset of designs.

Machine learning methods have shown tremendous promise on prediction problems: predicting the efficacy of a drug, predicting how a protein will fold, or predicting the strength of a composite material. But can we use machine learning for design? Conventionally, such problems have been tackled with black-box optimization procedures that repeatedly query an objective function. For instance, if designing a drug, the algorithm will iteratively modify the drug, test it, then modify it again. But when evaluating the efficacy of a candidate design involves conducting a real-world experiment, this can quickly become prohibitive. An appealing alternative is to create designs from data. Instead of requiring active synthesis and querying, can we devise a method that simply examines a large dataset of previously tested designs (e.g., drugs that have been evaluated before), and comes up with a new design that is better? We call this offline model-based optimization (offline MBO), and in this post, we discuss offline MBO methods and some recent advances.

A First-Principles Theory of NeuralNetwork Generalization


Fig 1. Measures of generalization performance for neural networks trained on four different boolean functions (colors) with varying training set size. For both MSE (left) and learnability (right), theoretical predictions (curves) closely match true performance (dots).

Deep learning has proven a stunning success for countless problems of interest, but this success belies the fact that, at a fundamental level, we do not understand why it works so well. Many empirical phenomena, well-known to deep learning practitioners, remain mysteries to theoreticians. Perhaps the greatest of these mysteries has been the question of generalization: why do the functions learned by neural networks generalize so well to unseen data? From the perspective of classical ML, neural nets’ high performance is a surprise given that they are so overparameterized that they could easily represent countless poorly-generalizing functions.

Making RL Tractable by Learning More Informative Reward Functions: Example-Based Control, Meta-Learning, and Normalized Maximum Likelihood



Diagram of MURAL, our method for learning uncertainty-aware rewards for RL. After the user provides a few examples of desired outcomes, MURAL automatically infers a reward function that takes into account these examples and the agent’s uncertainty for each state.

Although reinforcement learning has shown success in domains such as robotics, chip placement and playing video games, it is usually intractable in its most general form. In particular, deciding when and how to visit new states in the hopes of learning more about the environment can be challenging, especially when the reward signal is uninformative. These questions of reward specification and exploration are closely connected — the more directed and “well shaped” a reward function is, the easier the problem of exploration becomes. The answer to the question of how to explore most effectively is likely to be closely informed by the particular choice of how we specify rewards.

For unstructured problem settings such as robotic manipulation and navigation — areas where RL holds substantial promise for enabling better real-world intelligent agents — reward specification is often the key factor preventing us from tackling more difficult tasks. The challenge of effective reward specification is two-fold: we require reward functions that can be specified in the real world without significantly instrumenting the environment, but also effectively guide the agent to solve difficult exploration problems. In our recent work, we address this challenge by designing a reward specification technique that naturally incentivizes exploration and enables agents to explore environments in a directed way.

Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning

We consider a problem: Can a machine learn from a few labeled pixels to predict every pixel in a new image?
This task is extremely challenging (see Fig. 1) as a single body part could contain visually distinctive areas
(e.g. head consists of eyes, noses and mouths); different body parts might look similar and undistinguishable
(e.g., upper arms v.s. lower arms). It could be even more difficult if we do not provide any precise location
but only the occurrence of body parts in the image. This problem is dubbed weakly-supervised segmentation, where
the goal is to classify every pixel into semantic categories using only partial / weak supervision. There are many
forms of weak annotations which are cheap but not perfect, e.g. image-level tags, bounding boxes, points and scribbles.

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

Recent years have demonstrated the potential of deep multi-agent reinforcement
learning (MARL) to train groups of AI agents that can collaborate to solve complex
tasks – for instance, AlphaStar achieved professional-level performance in the
Starcraft II video game, and OpenAI Five defeated the world champion in Dota2.
These successes, however, were powered by huge swaths of computational resources;
tens of thousands of CPUs, hundreds of GPUs, and even TPUs were used to collect and train on
a large volume of data. This has motivated the academic MARL community to develop
MARL methods which train more efficiently.



DeepMind’s AlphaStar attained professional level performance in StarCraft II, but required enormous amounts of
computational power to train.

Research in developing more efficient and effective MARL algorithms has focused on off-policy methods – which store and re-use data for multiple policy updates – rather than on-policy algorithms, which use newly collected training data before each update to the agents’ policies. This is largely due to the common belief that off-policy algorithms are much more sample-efficient than on-policy methods.

In this post, we outline our recent publication in which we re-examine many of these assumptions about on-policy algorithms. In particular, we analyze the performance of PPO, a popular single-agent on-policy RL algorithm, and demonstrate that with several simple modifications, PPO achieves strong performance in 3 popular MARL benchmarks while exhibiting a similar sample efficiency to popular off-policy algorithms in the majority of scenarios. We study the impact of these modifications through ablation studies and suggest concrete implementation and tuning practices which are critical for strong performance. We refer to PPO with these modifications as Multi-Agent PPO (MAPPO).

BASALT: A Benchmark for
Learning from Human Feedback

TL;DR: We are launching a NeurIPS competition and benchmark called BASALT: a
set of Minecraft environments and a human evaluation protocol that we hope will
stimulate research and investigation into solving tasks with no pre-specified
reward function, where the goal of an agent must be communicated through
demonstrations, preferences, or some other form of human feedback. Sign up
to participate in the
competition!

Learning What To Do by Simulating the Past

Reinforcement learning (RL) has been used successfully for solving tasks which
have a well defined reward function – think AlphaZero for Go, OpenAI Five for
Dota, or AlphaStar for StarCraft. However, in many practical situations you
don’t have a well defined reward function. Even a task as seemingly
straightforward as cleaning a room has many subtle cases: should a business
card with a piece of gum be thrown away as trash, or might it have sentimental
value
? Should the clothes on the floor be washed, or returned to the
closet? Where are notebooks supposed to be stored? Even when these aspects of
a task have been clarified, translating it into a reward is non-trivial: if you
provide rewards every time you sweep the trash, then the agent might dump the
trash back out so that it can sweep it up again
.1

Alternatively, we can try to learn a reward function from human feedback about
the behavior of the agent. For example, Deep RL from Human Preferences
learns a reward function from pairwise comparisons of video clips of the
agent’s behavior. Unfortunately, however, this approach can be very costly:
training a MuJoCo Cheetah to run forward requires a human to provide 750
comparisons.


Instead, we propose an algorithm that can learn a policy without any human
supervision or reward function
, by using information implicitly available in
the state of the world. For example, we learn a policy that balances this
Cheetah on its front leg from a single state in which it is balancing.

An EPIC way to evaluate reward functions

Cross-posted from the DeepMind Safety blog.

In many reinforcement learning problems the objective is too complex to be specified procedurally, and a reward function must instead be learned from user data. However, how can you tell if a learned reward function actually captures user preferences? Our method, Equivalent-Policy Invariant Comparison (EPIC), allows one to evaluate a reward function by computing how similar it is to other reward functions. EPIC can be used to benchmark reward learning algorithms by comparing learned reward functions to a ground-truth reward.
It can also be used to validate learned reward functions prior to deployment, by comparing them against reward functions learned via different techniques or data sources.



Figure 1: EPIC compares reward functions $R_a$ and $R_b$ by first mapping them to canonical representatives and then computing the Pearson distance between the canonical representatives on a coverage distribution $mathcal{D}$. Canonicalization removes the effect of potential shaping, and Pearson distance is invariant to positive affine transformations.