BASALT: A Benchmark for
Learning from Human Feedback

TL;DR: We are launching a NeurIPS competition and benchmark called BASALT: a
set of Minecraft environments and a human evaluation protocol that we hope will
stimulate research and investigation into solving tasks with no pre-specified
reward function, where the goal of an agent must be communicated through
demonstrations, preferences, or some other form of human feedback. Sign up
to participate in the
competition!

Motivation

Deep reinforcement learning takes a reward function as input and learns to
maximize the expected total reward. An obvious question is: where did this
reward come from? How do we know it captures what we want? Indeed, it often
doesn’t capture what we want, with
many
recent
examples showing that the provided
specification often leads the agent to behave in an unintended way.


Our existing algorithms have a problem: they implicitly assume access to a
perfect specification, as though one has been handed down by God. Of course, in
reality, tasks don’t come pre-packaged with rewards; those rewards come from
imperfect human reward designers.

For example, consider the task of summarizing articles. Should the agent focus
more on the key claims, or on the supporting evidence? Should it always use a
dry, analytic tone, or should it copy the tone of the source material? If the
article contains toxic content, should the agent summarize it faithfully,
mention that toxic content exists but not summarize it, or ignore it completely?
How should the agent deal with claims that it knows or suspects to be false? A
human designer likely won’t be able to capture all of these considerations in a
reward function on their first try, and, even if they did manage to have a
complete set of considerations in mind, it might be quite difficult to translate
these conceptual preferences into a reward function the environment can directly
calculate.


Since we can’t expect a good specification on the first try, much recent work
has proposed algorithms that instead allow the designer to iteratively
communicate details and preferences about the task. Instead of rewards, we use
new types of feedback, such as
demonstrations (in the above example,
human-written summaries), preferences
(judgments about which of two summaries is better),
corrections (changes
to a summary that would make it better), and more. The agent may
also
elicit
feedback by, for example, taking the first
steps of a provisional plan and seeing if the human intervenes, or by asking the
designer questions about the task. This
paper
provides a framework and summary of
these techniques.

Despite the plethora of techniques developed to tackle this problem, there have
been no popular benchmarks that are specifically intended to evaluate algorithms
that learn from human feedback. A typical paper will take an existing deep RL
benchmark (often Atari or MuJoCo), strip away the rewards, train an agent using
their feedback mechanism, and evaluate performance according to the preexisting
reward function.

This has a variety of problems, but most notably, these environments do not have
many potential goals. For example, in the Atari game Breakout, the agent must
either hit the ball back with the paddle, or lose. There are no other
options. Even if you get good performance on Breakout with your algorithm, how
can you be confident that you have learned that the goal is to hit the bricks
with the ball and clear all the bricks away, as opposed to some simpler
heuristic like “don’t die”? If this algorithm were applied to summarization,
might it still just learn some simple heuristic like “produce grammatically
correct sentences”, rather than actually learning to summarize? In the real
world, you aren’t funnelled into one obvious task above all others; successfully
training such agents will require them being able to identify and perform a
particular task in a context where many tasks are possible.

We built the Benchmark for Agents that Solve Almost Lifelike Tasks (BASALT) to
provide a benchmark in a much richer environment: the popular video game
Minecraft. In Minecraft, players can choose among
a wide variety of things to do. Thus, to learn to do a specific task in
Minecraft, it is crucial to learn the details of the task from human feedback;
there is no chance that a feedback-free approach like “don’t die” would perform
well.

We’ve just launched the MineRL BASALT competition on Learning from Human
Feedback
, as a sister competition to the existing
MineRL Diamond competition on Sample Efficient Reinforcement
Learning
, both of which will be presented at
NeurIPS 2021. You can sign up to participate in the competition
here.

Our aim is for BASALT to mimic realistic settings as much as possible, while
remaining easy to use and suitable for academic experiments. We’ll first explain
how BASALT works, and then show its advantages over the current environments
used for evaluation.

What is BASALT?

We argued previously that we should be thinking about the specification of the
task as an iterative process of imperfect communication between the AI designer
and the AI agent. Since BASALT aims to be a benchmark for this entire process,
it specifies tasks to the designers and allows the designers to develop agents
that solve the tasks with (almost) no holds barred.


Initial provisions. For each task, we provide a Gym environment (without
rewards), and an English description of the task that must be
accomplished. The Gym environment exposes pixel observations as well as
information about the player’s inventory. Designers may then use whichever
feedback modalities they prefer, even reward functions and hardcoded
heuristics, to create agents that accomplish the task. The only restriction is
that they may not extract additional information from the Minecraft
simulator, since this approach would not be possible in most real world
tasks.

For example, for the MakeWaterfall
task
,
we provide the following details:

Description: After spawning in a mountainous area, the agent should build
a beautiful waterfall and then reposition itself to take a scenic picture of
the same waterfall. The picture of the waterfall can be taken by orienting
the camera and then throwing a snowball when facing the waterfall at a good
angle.

Resources: 2 water buckets, stone pickaxe, stone shovel, 20 cobblestone blocks

Evaluation. How do we evaluate agents if we don’t provide reward functions?
We rely on human comparisons. Specifically, we record the trajectories of
two different agents on a particular environment seed and ask a human to
decide which of the agents performed the task better. We plan to release code
that will allow researchers to collect these comparisons from Mechanical Turk
workers. Given a few comparisons of this form, we use
TrueSkill to compute scores for
each of the agents that we are evaluating.

For the competition, we will hire contractors to provide the comparisons. Final
scores are determined by averaging normalized TrueSkill scores across tasks. We
will validate potential winning submissions by retraining the models and
checking that the resulting agents perform similarly to the submitted agents.

Dataset. While BASALT does not place any restrictions on what types of
feedback may be used to train agents, we (and MineRL
Diamond
) have found that, in practice,
demonstrations are needed at the start of training to get a reasonable
starting policy. (This approach has also been used for
Atari
.) Therefore, we have collected and
provided a dataset of human demonstrations for each of our tasks.






The three stages of the waterfall task in one of our demonstrations: climbing to
a good location, placing the waterfall, and returning to take a scenic picture
of the waterfall.

Getting started. One of our goals was to make BASALT particularly easy to
use. Creating a BASALT environment is as simple as installing
MineRL and calling gym.make() on the appropriate
environment name. We have also provided a behavioral cloning (BC) agent in a
repository
that could be submitted to the competition; it takes just a couple of hours to
train an agent on any given task.

Advantages of BASALT

BASALT has a number of advantages over existing benchmarks like MuJoCo and Atari:

Many reasonable goals. People do a lot of things in Minecraft: perhaps you
want to defeat the Ender Dragon while others try to stop
you
, or build a giant floating
island chained to the ground
, or
produce more stuff than you will ever
need
. This is a particularly
important property for a benchmark where the point is to figure out what to
do: it means that human feedback is critical in identifying which task the
agent must perform out of the many, many tasks that are possible in
principle.

Existing benchmarks mostly do not satisfy this property:

  1. In some Atari games, if you do anything other than the intended gameplay, you
    die and reset to the initial state, or you get stuck. As a result, even pure
    curiosity-based agents do well on
    Atari
    .
  2. Similarly in MuJoCo, there is not much that any given simulated robot can
    do. Unsupervised skill learning methods will frequently learn policies that
    perform well on the true reward: for example,
    DADS learns locomotion policies for MuJoCo
    robots that would get high reward, without using any reward information or human
    feedback.

In contrast, there is effectively no chance of such an unsupervised method
solving BASALT tasks. When testing your algorithm with BASALT, you don’t have to
worry about whether your algorithm is secretly learning a heuristic like
curiosity that wouldn’t work in a more realistic setting.






In Pong, Breakout and Space Invaders, you either play towards winning the game,
or you die.






In Minecraft, you could battle the Ender Dragon, farm peacefully, practice
archery, and more.

Large amounts of diverse data. Recent
work has demonstrated the value of large
generative models trained on huge, diverse datasets. Such models may offer a
path forward for specifying tasks: given a large pretrained model, we can
“prompt” the model with an input such that the model then generates the
solution to our task. BASALT is an excellent test suite for such an approach,
as there are thousands of hours of Minecraft gameplay on YouTube.

In contrast, there is not much easily available diverse data for Atari or
MuJoCo. While there may be videos of Atari gameplay, in most cases these are all
demonstrations of the same task. This makes them less suitable for studying the
approach of training a large model with broad knowledge and then “targeting” it
towards the task of interest.

Robust evaluations. The environments and reward functions used in current
benchmarks have been designed for reinforcement learning, and so often include
reward shaping or termination conditions that make them unsuitable for
evaluating algorithms that learn from human feedback. It is often possible to
get surprisingly good performance with hacks that would never work in a
realistic setting. As an extreme example, Kostrikov et
al
show that when initializing the GAIL
discriminator to a constant value (implying the constant reward
$R(s,a) = log 2$), they reach 1000 reward on Hopper, corresponding to about a
third of expert performance – but the resulting policy stays still and doesn’t
do anything!

In contrast, BASALT uses human evaluations, which we expect to be far more
robust and harder to “game” in this way. If a human saw the Hopper staying still
and doing nothing, they would correctly assign it a very low score, since it is
clearly not progressing towards the intended goal of moving to the right as fast
as possible.

No holds barred. Benchmarks often have some strategies that are implicitly
not allowed because they would “solve” the benchmark without actually solving
the underlying problem of interest. For example, there is
controversy over
whether algorithms should be allowed to rely on determinism in Atari, as many
such solutions would likely not work in more realistic settings.

However, this is an effect to be minimized as much as possible: inevitably, the
ban on strategies will not be perfect, and will likely exclude some strategies
that really would have worked in realistic settings. We can avoid this problem
by having particularly challenging tasks, such as playing Go or building
self-driving cars, where any method of solving the task would be impressive and
would imply that we had solved a problem of interest. Such benchmarks are “no
holds barred”: any approach is acceptable, and thus researchers can focus
entirely on what leads to good performance, without having to worry about
whether their solution will generalize to other real world tasks.

BASALT does not quite reach this level, but it is close: we only ban strategies
that access internal Minecraft state. Researchers are free to hardcode
particular actions at particular timesteps, or ask humans to provide a novel
type of feedback, or train a large generative model on YouTube data, etc. This
enables researchers to explore a much larger space of potential approaches to
building useful AI agents.

Harder to “teach to the test”. Suppose Alice is training an imitation
learning algorithm on HalfCheetah, using 20 demonstrations. She suspects that
some of the demonstrations are making it hard to learn, but doesn’t know which
ones are problematic. So, she runs 20 experiments. In the ith experiment, she
removes the ith demonstration, runs her algorithm, and checks how much reward
the resulting agent gets. From this, she realizes she should remove
trajectories 2, 10, and 11; doing this gives her a 20% boost.

The problem with Alice’s approach is that she wouldn’t be able to use this
strategy in a real-world task, because in that case she can’t simply “check how
much reward the agent gets” – there isn’t a reward function to check! Alice is
effectively tuning her algorithm to the test, in a way that wouldn’t generalize
to realistic tasks, and so the 20% boost is illusory.

While researchers are unlikely to exclude specific data points in this way, it
is common to use the test-time reward as a way to validate the algorithm and
to tune hyperparameters, which can have the same effect. This
paper
quantifies a similar effect in few-shot
learning with large language models, and finds that previous few-shot learning
claims were significantly overstated.

BASALT ameliorates this problem by not having a reward function in the first
place. It is of course still possible for researchers to teach to the test
even in BASALT, by running many human evaluations and tuning the algorithm based
on these evaluations, but the scope for this is greatly reduced, since it is far
more costly to run a human evaluation than to check the performance of a trained
agent on a programmatic reward.

Note that this does not prevent all hyperparameter tuning. Researchers can still
use other strategies (that are more reflective of realistic settings), such as:

  1. Running preliminary experiments and looking at proxy metrics. For example,
    with behavioral cloning (BC), we could perform hyperparameter tuning to reduce
    the BC loss.
  2. Designing the algorithm using experiments on environments which do have
    rewards (such as the MineRL Diamond environments).

Easily available experts. Domain experts can usually be consulted when an AI
agent is built for real-world deployment. For example, the NET-VISA
system
used for
global seismic monitoring was built with relevant domain knowledge provided by
geophysicists. It would thus be useful to investigate techniques for building
AI agents when expert help is available.

Minecraft is well suited for this because it is extremely popular, with over
100 million active players. In addition, many of its properties are easy to
understand: for example, its tools have similar functions to real world tools,
its landscapes are somewhat realistic, and there are easily understandable goals
like building shelter and acquiring enough food to not starve. We ourselves have
hired Minecraft players both through Mechanical Turk and by recruiting Berkeley
undergrads.

Building towards a long-term research agenda. While BASALT currently focuses
on short, single-player tasks, it is set in a world that contains many avenues
for further work to build general, capable agents in Minecraft. We envision
eventually building agents that can be instructed to perform arbitrary
Minecraft tasks in natural language on public multiplayer servers, or
inferring what large scale project human players are working on and assisting
with those projects, while adhering to the norms and customs followed on that
server.





Can we build an agent that can help recreate Middle Earth on MCME (left), and also play Minecraft
on the anarchy server 2b2t
(right) on which large-scale destruction of property (“griefing”) is the norm?

Interesting research questions

Since BASALT is quite different from past benchmarks, it allows us to study a
wider variety of research questions than we could before. Here are some
questions that seem particularly interesting to us:

  1. How do various feedback modalities compare to each other? When should each
    one be used? For example, current practice tends to train on demonstrations
    initially and preferences later. Should other feedback modalities be integrated
    into this practice?
  2. Are corrections an effective technique for focusing the agent on rare but
    important actions? For example, vanilla behavioral cloning on MakeWaterfall leads
    to an agent that moves near waterfalls but doesn’t create waterfalls of its own,
    presumably because the “place waterfall” action is such a tiny fraction of the
    actions in the demonstrations. Intuitively, we would like a human to “correct”
    these problems, e.g. by specifying when in a trajectory the agent should have
    taken a “place waterfall” action. How should this be implemented, and how
    powerful is the resulting technique? (The
    past
    work we are aware of does not seem directly
    applicable, though we have not done a thorough literature review.)
  3. How can we best leverage domain expertise? If for a given task, we have (say)
    five hours of an expert’s time, what is the best use of that time to train a
    capable agent for the task? What if we have a hundred hours of expert time
    instead?
  4. Would the “GPT-3 for Minecraft” approach work well for BASALT? Is it
    sufficient to simply prompt the model appropriately? For example, a sketch of
    such an approach would be:

    • Create a dataset of YouTube videos paired with their automatically
      generated captions, and train a model that predicts the next video frame from
      previous video frames and captions.
    • Train a policy that takes actions which lead to observations predicted by
      the generative model (effectively learning to imitate human behavior,
      conditioned on previous video frames and the caption).
    • Design a “caption prompt” for each BASALT task that induces the policy to
      solve that task.

FAQ

If there are really no holds barred, couldn’t participants record themselves
completing the task, and then replay those actions at test time?

Participants wouldn’t be able to use this strategy because we keep the seeds of
the test environments secret. More generally, while we allow participants to
use, say, simple nested-if strategies, Minecraft worlds are sufficiently random
and diverse that we expect that such strategies won’t have good performance,
especially given that they have to work from pixels.

Won’t it take far too long to train an agent to play Minecraft? After all, the
Minecraft simulator must be really slow relative to MuJoCo or Atari.

We designed the tasks to be in the realm of difficulty where it should be
feasible to train agents on an academic budget. Our behavioral cloning baseline
trains in a couple of hours on a single GPU. Algorithms that require environment
simulation like GAIL will take longer, but we expect that a day or two of
training will be enough to get decent results (during which you can get a few
million environment samples).

Won’t this competition just reduce to “who can get the most compute and human
feedback”?

We impose limits on the amount of compute and human feedback that submissions
can use to prevent this scenario. We will retrain the models of any potential
winners using these budgets to verify adherence to this rule.

Conclusion

We hope that BASALT will be used by anyone who aims to learn from human
feedback, whether they are working on imitation learning, learning from
comparisons, or some other method. It mitigates many of the issues with the
standard benchmarks used in the field. The current baseline has lots of obvious
flaws, which we hope the research community will soon fix.

Note that, so far, we have worked on the competition version of BASALT. We aim
to release the benchmark version shortly. You can get started now, by simply
installing MineRL from pip and loading up the BASALT
environments. The code to run your own human evaluations will be added in the
benchmark release.

If you would like to use BASALT in the very near future and would like beta
access to the evaluation code, please email the lead organizer, Rohin Shah, at
rohinmshah@berkeley.edu.

This post is based on the paper “The MineRL BASALT Competition on Learning
from Human Feedback
”, accepted at the NeurIPS
2021 Competition Track. Sign up to participate in the
competition!


Read More