This post is crosslisted on the CMU ML blog.
The history of machine learning has largely been a story of increasing
abstraction. In the dawn of ML, researchers spent considerable effort
engineering features. As deep learning gained popularity, researchers then
shifted towards tuning the update rules and learning rates for their
optimizers. Recent research in metalearning has climbed one level of
abstraction higher: many researchers now spend their days manually constructing
task distributions, from which they can automatically learn good optimizers.
What might be the next rung on this ladder? In this post we introduce theory
and algorithms for unsupervised metalearning, where machine learning
algorithms themselves propose their own task distributions. Unsupervised
metalearning further reduces the amount of human supervision required to solve
tasks, potentially inserting a new rung on this ladder of abstraction.
We start by discussing how machine learning algorithms use human supervision to
find patterns and extract knowledge from observed data. The most common machine
learning setting is regression, where a human provides labels $Y$ for a set of
examples $X$. The aim is to return a predictor that correctly assigns
labels to novel examples. Another common machine learning problem setting is
reinforcement learning (RL), where an agent takes actions in an environment.
In RL, humans indicate the desired behavior through a reward function that
the agent seeks to maximize. To draw a crude analogy to regression,
the environment dynamics are the examples $X$, and the reward function gives
the labels $Y$. Algorithms for regression and RL employ many tools,
including tabular methods (e.g., value iteration), linear methods (e.g., linear
regression) kernelmethods (e.g., RBFSVMs), and deep neural networks. Broadly, we call these
algorithms learning procedures: processes that take as input a dataset
(examples with labels, or transitions with rewards) and output a function that
performs well (achieves high accuracy or large reward) on the dataset.
Machine learning research is similar to the control room for large physics
experiments. Researchers have a number of knobs they can tune which affect the
performance of the learning procedure. The right setting for the knobs depends
on the particular experiment: some settings work well for highenergy
experiments; others work well for ultracold atom experiments.
Figure Credit.
Similar to lab procedures used in physics and biology, the learning
procedures used in machine learning have many knobs^{1} that can be tuned.
For example, the learning procedure for training a neural network might be defined by an optimizer
(e.g., Nesterov, Adam) and a learning rate (e.g., 1e5).
Compared with regression, learning procedures specific to RL (e.g., DDPG) often have many more knobs, including
the frequency of data collection and how frequently the policy is updated.
Finding the right setting for the knobs can have a large effect on how quickly
the learning procedure solves a task, and a good configuration of knobs for one
learning procedure may be a bad configuration for another.
MetaLearning Optimizes Knobs of the Learning Procedure
While machine learning practitioners often carefully tune these knobs by hand,
if we are going to solve many tasks, it may be useful to automatic this
process. The process of setting the knobs of learning procedures via
optimization is called metalearning [Thrun 1998]. Algorithms that perform this optimization
problem automatically are known as metalearning algorithms.
Explicitly tuning the knobs of learning procedures is an active area of research, with various researchers looking at tuning the update rules [Andrychowicz 2016, Duan 2016, Wang 2016], weight initialization [Finn 2017], network weights [Ha 2016], network architectures [Gaier 2019], and other facets of learning procedures.
To evaluate a setting of knobs, metalearning algorithms consider not one task
but a distribution over many tasks. For example, a distribution over supervised
learning tasks may include learning a dog detector, learning a cat detector,
and learning a bird detector. In reinforcement learning, a task distribution
could be defined as driving a car in a smooth, safe, and efficient manner,
where tasks differ by the weights they place on smoothness, safety, and
efficiency. Ideally, the task distribution is designed to mirror the
distribution over tasks that we are likely to encounter in the real world.
Since the tasks in a task distribution are typically related, information from
one task may be useful in solving other tasks more efficiently. As you might
expect, a knob setting that works best on one distribution of tasks may not be
the best for another task distribution; the optimal knob setting
depends on the task distribution.
An illustration of metalearning, where tasks correspond to arranging blocks
into different types of towers. The human has a particular block tower in mind
and rewards the robot when it builds the correct tower. The robot’s aim is to
build the correct tower as quickly as possible.
In many settings we want to do well on a task distribution to which we have
only limited access. For example, in a selfdriving car, tasks may correspond
to finding the optimal balance of smoothness, safety, and efficiency for each
rider, but querying riders to get rewards is expensive. A researcher can
attempt to manually construct a task distribution that mimics the true task
distribution, but this can be quite challenging and time consuming. Can we
avoid having to manually design such task distributions?
To answer this question, we must understand where the benefits of metalearning
come from. When we define
task distributions for metalearning, we do so with some prior knowledge in
mind. Without this prior information, tuning the knobs of a learning procedure
is often a zerosum game: setting the knobs to any configuration will
accelerate learning on some tasks while slowing learning on other tasks. Does
this suggest there is no way to see the benefit of metalearning without the
manual construction of task distributions? Perhaps not! The next section
presents an alternative.
Optimizing the Learning Procedure with SelfProposed Tasks
If designing task distributions is the bottleneck in applying metalearning
algorithms, why not have metalearning algorithms propose their own tasks? At
first glance this seems like a terrible idea, because the No Free Lunch Theorem
suggests that this is impossible, without additional knowledge. However, many
realworld settings do provide a bit of additional information, albeit
disguised as unlabeled data. For example, in regression, we might have
access to an unlabeled dataset and know that the downstream tasks will be
labeled versions of this same image dataset. In a RL setting, a robot can
interact with its environment without receiving any reward, knowing that
downstream tasks will be constructed by defining reward functions for this very
environment (i.e. the real world). Seen from this perspective, the recipe for
unsupervised metalearning (doing metalearning without manually constructed
tasks) becomes clear: given unlabeled data, construct task distributions from this unlabeled data or environment, and then metalearn to quickly solve these selfproposed tasks.
In unsupervised metalearning, the agent proposes its own tasks, rather than
relying on tasks proposed by a human.
How can we use this unlabeled data to construct task distributions which will
facilitate learning downstream tasks? In the case of regression, prior work on
unsupervised metalearning [Hsu 2018, Khodadadeh 2019]
clusters an unlabeled dataset of images and then randomly chooses subsets of
the clusters to define a distribution of classification tasks. Other work
[Jabri 2019] look at an RL setting: after exploring an environment without a
reward function to collect a set of behaviors that are feasible in this
environment, these behaviors are clustered and used to define a distribution of
reward functions. In both cases, even though the tasks constructed can be
random, the resulting task distribution is not random, because all tasks share
the underlying unlabeled data — the image dataset for regression and
the environment dynamics for reinforcement learning. The underlying unlabeled
data are the inductive bias with which we pay for our free lunch.
Let us take a deeper look into the RL case. Without knowing the downstream tasks or reward functions, what is the “best” task distribution for “practicing” to solve tasks quickly? Can we measure how effective a task distribution is for solving unknown,
downstream tasks? Is there any sense in which one unsupervised task
proposal mechanism is better than another? Understanding the answers to these
questions may guide the principled development of metalearning algorithms with
little dependence on human supervision. Our work [Gupta 2018], takes a
first step towards answering these questions. In particular, we examine the
worstcase performance of learning procedures, and derive an optimal
unsupervised metareinforcement learning procedure.
Optimal Unsupervised MetaLearning
To answer the questions posed above, our first step is to define an optimal
metalearner for the case where the distribution of tasks is known.
We define an optimal metalearner as the learning procedure that achieves the
largest expected reward, averaged across the distribution of tasks. More precisely,
we will compare the expected reward for a learning procedure $f$
to that of best learning procedure $f^*$,
defining the regret of $f$ on a task distribution $p$ as follows:
Extending this definition to the case of unsupervised metalearning, an optimal
unsupervised metalearner can be defined as a metalearner that achieves the
minimum worstcase regret across all possible task distributions that may be
encountered in the environment. In the absence of any knowledge about the
actual downstream task, we resort to a worst case formulation. An unsupervised
metalearning algorithm will find a single learning procedure $f$ that has the
lowest regret against an adversarially chosen task distribution $p$:
Our work analyzes how exactly we might obtain such an optimal unsupervised
metalearner, and provides bounds on the regret that it might incur in the
worst case. Specifically, under some restrictions on the family of tasks that
might be encountered at testtime, the optimal distribution for an unsupervised
metalearner to propose is uniform over all possible tasks.
The intuition for this is straightforward: if the test time task distribution
can be chosen adversarially, the algorithm must make sure it is uniformly good
over all possible tasks that might be encountered. As a didactic example, if
testtime reward functions were restricted to the class of goalreaching tasks,
the regret for reaching a goal at testtime is inverse related to the probability
of sampling that goal during trainingtime. If any one of
the goals $g$ has lower density than the others, an adversary can propose
a task distribution solely consisting of reaching that goal $g$
causing the learning procedure to incur a higher regret. This example suggests that we can find an optimal unsupervised metalearner using a uniform distribution over goals. Our paper formalizes this idea and extends it to broader classes task distributions.
Now, actually sampling from a uniform distribution over all possible tasks is quite challenging.
Several recent papers have proposed RL exploration methods based on maximizing
mutual information [Achiam 2018, Eysenbach 2018, Gregor 2016, Lee 2019, Sharma 2019].
In this work, we show that these methods provide a tractable approximation to the uniform distribution over task distributions. To
understand why this is, we can look at the form of a mutual information
considered by [Eysenbach 2018], between states $s$ and latent
variables $z$:
In this objective, the first marginal entropy term is maximized when there is a
uniform distribution over all possible tasks. The second conditional entropy
term ensures consistency, by making sure that for each $z$, the resulting
distribution of $s$ is narrow. This suggests constructing unsupervised
taskdistributions in an environment by optimizing mutual information
gives us a provably optimal task distribution, according to our
notion of minmax optimality.
While the analysis makes some limiting assumptions about the forms of tasks
encountered, we show how this analysis can be extended to provide a bound on
the performance in the most general case of reinforcement learning. It also
provides empirical gains on several simulated environments as compared to
methods which train from scratch, as shown in the Figure below.
Summary & Discussion
In summary:

Learning procedures are recipes for converting datasets into function
approximators. Learning procedures have many knobs, which can be tuned by
optimizing the learning procedures to solve a distribution of tasks. 
Manually designing these task distributions is challenging, so a recent line
of work suggests that the learning procedure can use unlabeled data to
propose its own tasks for optimizing its knobs. 
These unsupervised metalearning algorithms allow for learning in regimes
previously impractical, and further expand that capability of machine
learning methods. 
This work closely relates to other works on unsupervised skill
discovery, exploration and representation learning, but
explicitly optimizes for transferability of the representations and skills to
downstream tasks.
A number of open questions remain about unsupervised metalearning:

Unsupervised learning is closely connected to unsupervised metalearning: the
former uses unlabeled data to learn features, while the second uses unlabeled
data to tune the learning procedure. Might there be some unifying treatment
of both approaches? 
Our analysis only proves that task proposal based on mutual information is optimal for memoryless metalearning algorithms. Metalearning algorithms with memory, which we expect will perform better, may perform best with different task proposal mechanisms.

Scaling unsupervised meta learning to leverage largescale datasets and
complex tasks holds the promise of acquiring learning procedures for solving
realworld problems more efficiently than our current learning procedures.
Check out our paper for more experiments and proofs: https://arxiv.org/abs/1806.04640
Acknowledgments
Thanks to Jake Tyo, Conor Igoe, Sergey Levine, Chelsea Finn, Misha Khodak, Daniel Seita, and Stefani Karp for their feedback.

These knobs are often known as hyperparameters, but we will stick with
the colloquial “knob” to avoid having to draw a line between parameters and
hyperparameters. ↩