Model-based reinforcement learning (MBRL) is a variant of the iterative
learning framework, reinforcement learning, that includes a structured
component of the system that is solely optimized to model the environment
dynamics. Learning a model is broadly motivated from biology, optimal control,
and more – it is grounded in natural human intuition of planning before acting. This intuitive
grounding, however, results in a more complicated learning process. In this
post, we discuss how model-based reinforcement learning is more susceptible to
parameter tuning and how AutoML can help in finding very well performing
parameter settings and schedules. Below, left is the expected behavior of an
agent maximizing velocity on a “Half Cheetah” robotic task, and to the right is
what our paper with hyperparameter tuning finds.

MBRL

Model-based reinforcement learning (MBRL) is an iterative framework for solving
tasks in a partially understood environment. There is an agent that repeatedly
tries to solve a problem, accumulating state and action data. With that data,
the agent creates a structured learning tool – a dynamics model – to reason
about the world. With the dynamics model, the agent decides how to act by
predicting into the future. With those actions, the agent collects more data,
improves said model, and hopefully improves future actions.

AutoML

Humans are pretty poor at internalizing higher-dimensional relationships.
Unfortunately, all ML systems come with hyperparameters that have complex
higher-dimensional relationships. Manually searching for configurations or
schedules that work well is a tedious and unrewarding task, so let’s let a
computer do it for us. Automated Machine Learning (AutoML) is a field
dedicated to the study of using machine learning algorithms to tune our machine
learning tools. However, there have not been many attempts in using AutoML
methods for RL so far, (for more on AutoRL see this blog post) even
though, given the success of AutoML in Supervised Learning, one could expect a
bigger impact in higher-dimensional RL. This is partly due to the harder
problem of dynamic hyperparameter tuning (where hyperparameters can change
within a run), but more on that later.

Why MBRL is challenging to tune

Why may we see even more impact of AutoML on MBRL when compared to model-freel RL?
First off, more machine learning parts equals a harder problem,
so there is a greater potential to make hyperparameter tuning way more
impactful to MBRL. Normally, a graduate student fine tunes one problem at a
time with a handful of variables, but in MBRL there are two weirdly intertwined
systems: the model and the planner (the objectives are mismatched). So, no human is likely to be able
to find the perfect parameters. Much research progress, still, is yielding to
luck when it comes to finding the right hyperparameters. Even though it is an
open problem to see if computers can find the true optimum, as mentioned
before, computers are much better at optimizing in high-dimensional spaces such
as RL.

Secondly, the benchmark where most RL algorithms are tested in recent work –
a simulator called Mujoco – has been used for years and performance of
the best algorithms is close to maxing out the realistic behaviors in the
simulator. Such a realistic solution for Half Cheetah can be seen in the video
on the left at the beginning of the post.

Mujoco showed up as a favorite of Deep RL because it was available when a
massive growth phase was coming through. It is a decent simulator, relatively
lightweight, and easy-enough to use (though, Mujoco is expensive, has fairly
strict licensing, and not a completely accurate portrayal of the real world).
Mujoco has worked for individual researchers and teams growing into a new area,
but not necessarily great for the long-term health of the field. Researchers
have tracked the so-called state-of-the-art (SOTA) performance of algorithms
across the group of tasks available on the benchmark. This has led to relying
on this benchmark as a surrogate objective for what humans truly want to
optimize – performance in the real world – and left progress in the research
field vulnerable to being gamed. And in such problems, where optimal
solutions may be unintuitive, computers are even more likely to substantially
outperform humans.

Static vs. Dynamic Parameter Tuning

Reinforcement learning, or any iterative framework for that matter, poses an
interesting challenge for AutoML and parameter tuning research: the best
parameters might change over time. The ideal parameters shift because the
data used to train any model changes over time. This is different from a more
classical approach to AutoML – on a Supervised Learning task, static
parameter configurations are much more likely to perform well (such as the
weights and biases of a deployed vision model).

When the distribution of the data we are using changes over time, we look to
dynamic hyperparameter tuning where the hyperparameters of the model or
optimizer are adapted over time. In the case of MBRL, this can have an elegant
interpretation: as the agent gets more data, it can train a more accurate
model, and then it may want to use that model to plan further into the future.
This would translate to a dynamic tuning of the model predictive horizon
hyperparameter variable used in the action optimization. Static hyperparameters –
which most RL algorithms report in tables in the appendix of their papers –
are likely not going to be able to deal with shifting distributions. The
algorithm used to demonstrate this is Hyperband (more later in the post). It is
a static tuning method and the chosen parameters show a low correlation with
reward across a longer run or to another task. For more details on the
specifics of static tuning, see the paper, but the crucial question is: how
much can an RL agent gain with dynamics tuning. Normally, this gain is measured
on the performance of an algorithm on one task.

This finding, however, does not mean that static tuning does not have its uses.
In this paper, we further studied what aspects of static and dynamic are most
important for a practitioner depending on what they want to achieve –
transferability (across runs on the same task or across tasks) or final
performance. We showed that static parameter configurations learn
hyperparameter settings that are more robust to transferring. By design,
dynamic configurations make many more choices about the parameter settings than
static configurations, thus making it very challenging to tune dynamic
configurations by hand and without automatic HPO methods. With an increasing
number of decision points, it becomes more and more likely that each choice we
make is specifically tailored to the environment and even the current run at
hand.

Weak or even negative correlation across different budgets are found in most of
the environments with static and transferred tunings, which shows different
configurations perform best for different training durations and that dynamic
tuning may be needed. Above shows the Spearman rank correlation of
hyperparameters sampled by Hyperband across trials when training. $cor$ is the
correlation coefficient, $p$ is the p-value for the hypothesis test and $n$ gives
the number of configurations trained in both fidelities.

Breaking the simulator

Combining AutoML with MBRL dramatically beats the SOTA results on a couple of
Mujoco tasks by using existing MBRL algorithms with both dynamic and static
parameter tuning. With sufficient hyperparameter tuning, the MBRL algorithm
(PETS) literally breaks Mujoco. The Mujoco log reads something akin to:

WARNING: Nan, Inf or huge value in QACC at DOF 0. The simulation is unstable. Time = 40.4200.

This correlates to the task being solved in a non-intuitive, exploitative
manner: the Half Cheetah cartwheels.

Normally, the Half Cheetah is supposed to run. However, through hyperparameter
tuning, the agent can experience that stumbling and flipping over can be
converted into a successful cartwheel. Chaining such cartwheels back to back
allows it to build up much more speed than otherwise possible. So much, in
fact, that the simulator can’t keep up and breaks. This shows that we have not
yet explored the full potential of existing agent implementations and that
AutoML can be a key component in doing so. The paper has a much, much wider
range of results on multiple environments, but I leave that to the reader. The
paper has interesting tradeoffs between optimizing the model (learning the
dynamics) and the controller (solving the reward-maximization problem).
Additionally, it shows how dynamically changing the parameters throughout a
trial can be useful, such as increasing your model horizon as the algorithm
collects data and the model becomes more accurate. We analyse in-depth the
impact of design decisions on the following tuning methods:

Population Based Training (PBT): An evolutionary approach to hyperparameter tuning, where the best performing members are modified and replace worse performing configurations (link).
Population Based Training with Backtracking (PBT-BT): Population-based training where the agents can return to past configurations during the learning process.
Hyperband: A Bandit-Based Approach to Hyperparameter Optimization (link).
Random Search: A method where configurations are generated within the hyperparameter search space (example).

Conclusion

Ultimately, this is a paper that can really move the field forward by showing
the ceiling is much higher than expected for current deep RL algorithms and
that new benchmarking tasks are needed to facilitate continued development of
the research area. Knowing that the last generation of MBRL algorithms still
has substantial performance to be cultivated motivates more interesting
numerical research revisiting past methods as the future is designed.

As the Mujoco simulator seems to be reaching its maximum in terms of logged
performance, it is time for a new generation of simulators and tasks to
benchmark developments in deep reinforcement learning. In creating new tasks,
there is an exciting opportunity to move our research methods closer to real
world applications — providing a harder challenge with a growing potential
payoff.

This post is based on the following paper:

On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning
Baohe Zhang, Raghu Rajan, Luis Pineda, Nathan Lambert, André Biedenkapp, Kurtland Chua, Frank Hutter, Roberto Calandra
Conference on Artificial Intelligence and Statistics (AISTATS), 2021.
A video explaining the paper

Vedere AI

The Importance of Hyperparameter Optimization for Model-based Reinforcement Learning

MBRL

AutoML

Why MBRL is challenging to tune

Static vs. Dynamic Parameter Tuning

Breaking the simulator

Conclusion

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.