Improving Verifiability in AI Development

We’ve contributed to a multi-stakeholder report by 58 co-authors at 30 organizations, including the Centre for the Future of Intelligence, Mila, Schwartz Reisman Institute for Technology and Society, Center for Advanced Study in the Behavioral Sciences, and Center for Security and Emerging Technologies. This report describes 10 mechanisms to improve the verifiability of claims made about AI systems. Developers can use these tools to provide evidence that AI systems are safe, secure, fair, or privacy-preserving. Users, policymakers, and civil society can use these tools to evaluate AI development processes.

Read Report

While a growing number of organizations have articulated ethics principles to guide their AI development process, it can be difficult for those outside of an organization to verify whether the organization’s AI systems reflect those principles in practice. This ambiguity makes it harder for stakeholders such as users, policymakers, and civil society to scrutinize AI developers’ claims about properties of AI systems and could fuel competitive corner-cutting, increasing social risks and harms. The report describes existing and potential mechanisms that can help stakeholders grapple with questions like:

  • Can I (as a user) verify the claims made about the level of privacy protection guaranteed by a new AI system I’d like to use for machine translation of sensitive documents?
  • Can I (as a regulator) trace the steps that led to an accident caused by an autonomous vehicle? Against what standards should an autonomous vehicle company’s safety claims be compared?
  • Can I (as an academic) conduct impartial research on the risks associated with large-scale AI systems when I lack the computing resources of industry?
  • Can I (as an AI developer) verify that my competitors in a given area of AI development will follow best practices rather than cut corners to gain an advantage?

The 10 mechanisms highlighted in the report are listed below, along with recommendations aimed at advancing each one. (See the report for discussion of how these mechanisms support verifiable claims as well as relevant caveats about our findings.)

Institutional Mechanisms and Recommendations

  1. Third party auditing. A coalition of stakeholders should create a task force to research options for conducting and funding third party auditing of AI systems.
  2. Red teaming exercises. Organizations developing AI should run red teaming exercises to explore risks associated with systems they develop, and should share best practices and tools.
  3. Bias and safety bounties. AI developers should pilot bias and safety bounties for AI systems to strengthen incentives and processes for broad-based scrutiny of AI systems.
  4. Sharing of AI incidents. AI developers should share more information about AI incidents, including through collaborative channels.

Software Mechanisms and Recommendations

  1. Audit trails. Standard setting bodies should work with academia and industry to develop audit trail requirements for safety-critical applications of AI systems.
  2. Interpretability. Organizations developing AI and funding bodies should support research into the interpretability of AI systems, with a focus on supporting risk assessment and auditing.
  3. Privacy-preserving machine learning. AI developers should develop, share, and use suites of tools for privacy-preserving machine learning that include measures of performance against common standards.

Hardware Mechanisms and Recommendations

  1. Secure hardware for machine learning. Industry and academia should work together to develop hardware security features for AI accelerators or otherwise establish best practices for the use of secure hardware (including secure enclaves on commodity hardware) in machine learning contexts.
  2. High-precision compute measurement. One or more AI labs should estimate the computing power involved in a single project in great detail and report on lessons learned regarding the potential for wider adoption of such methods.
  3. Compute support for academia. Government funding bodies should substantially increase funding for computing power resources for researchers in academia, in order to improve the ability of those researchers to verify claims made by industry.

We and our co-authors will be doing further research on these mechanisms and OpenAI will be looking to adopt several of these mechanisms in the future. We hope that this report inspires meaningful dialogue, and we are eager to discuss additional institutional, software, and hardware mechanisms that could be useful in enabling trustworthy AI development. We encourage anyone interested in collaborating on these issues to connect with the corresponding authors and visit the report website.

Read Report

Report Authors
(Equal contribution)
  • Gillian Hadfield OpenAI, University of Toronto, Schwartz Reisman Institute for Technology and Society
  • Heidy Khlaaf Adelard
  • Jingying Yang Partnership on AI
  • Helen Toner Center for Security and Emerging Technology
  • Ruth Fong University of Oxford
  • Tegan Maharaj Mila, Montreal Polytechnic
  • Pang Wei Koh Stanford University
  • Sara Hooker Google Brain
  • Jade Leung Future of Humanity Institute
  • Andrew Trask University of Oxford
  • Emma Bluemke University of Oxford
  • Jonathan Lebensold Mila, McGill University
  • Cullen O’Keefe OpenAI
  • Mark Koren Stanford Centre for AI Safety
  • Théo Ryffel École Normale Supérieure (Paris)
  • JB Rubinovitz Remedy.AI
  • Tamay Besiroglu University of Cambridge
  • Federica Carugati Center for Advanced Study in the Behavioral Sciences
  • Jack Clark OpenAI
  • Peter Eckersley Partnership on AI
  • Sarah de Haas Google Research
  • Maritza Johnson Google Research
  • Ben Laurie Google Research
  • Alex Ingerman Google Research
  • Igor Krawczuk École Polytechnique Fédérale de Lausanne
  • Amanda Askell OpenAI
  • Rosario Cammarota Intel
  • Andrew Lohn RAND Corporation
  • David Krueger Mila, Montreal Polytechnic
  • Charlotte Stix Eindhoven University of Technology
  • Peter Henderson Stanford University
  • Logan Graham University of Oxford
  • Carina Prunkl Future of Humanity Institute
  • Bianca Martin OpenAI
  • Elizabeth Seger University of Cambridge
  • Noa Zilberman University of Oxford
  • Seán Ó hÉigeartaigh Leverhulme Centre for the Future of Intelligence, Centre for the Study of Existential Risk
  • Frens Kroeger Coventry University
  • Girish Sastry OpenAI
  • Rebecca Kagan Center for Security and Emerging Technology
  • Adrian Weller University of Cambridge, Alan Turing Institute
  • Brian Tse Future of Humanity Institute, Partnership on AI
  • Elizabeth Barnes OpenAI
  • Allan Dafoe Future of Humanity Institute
  • Paul Scharre Center for a New American Security
  • Ariel Herbert-Voss OpenAI
  • Martijn Rasser Center for a New American Security
  • Shagun Sodhani Mila, University of Montreal
  • Carrick Flynn Center for Security and Emerging Technology
  • Thomas Gilbert University of California, Berkeley
  • Lisa Dyer Partnership on AI
  • Saif Khan Center for Security and Emerging Technology
  • Yoshua Bengio Mila, University of Montreal
  • Markus Anderljung Future of Humanity Institute
(Descending contribution)

OpenAI

OpenAI Microscope

OpenAI Microscope

OpenAI Microscope

We’re introducing OpenAI Microscope, a collection of visualizations of every significant layer and neuron of eight vision “model organisms” which are often studied in interpretability. Microscope makes it easier to analyze the features that form inside these neural networks, and we hope it will help the research community as we move towards understanding these complicated systems.

Browse Microscope

The abilities of modern neural networks are the result of the interactions of thousands of neurons (sometimes tens of thousands or more!). In order to understand their behavior, we’d like to be able to quickly and easily investigate these neurons interactions in detail, and share those observations. This is especially true in collaborative environments. For instance, one researcher might speculate:

InceptionV1 4c:447 is a car detector which is built from a wheel detector (4b:373) and a window detector (4b:237).

When someone makes a claim like this, it’s useful if others can quickly explore those neurons, evaluating the claim and discovering new things. This is the goal of the OpenAI Microscope.

OpenAI Microscope
OpenAI Microscope

Microscope systematically visualizes every neuron in several commonly studied vision models, and makes all of those neurons linkable. We hope this will support the interpretability community in several ways:

  1. Although these models and visualizations are already open source (we help maintain the lucid library, which is used to generate all the visualizations in Microscope) visualizing neurons is tedious. Microscope changes the feedback loop of exploring neurons from minutes to seconds. This quick feedback loop has been essential for us in discovering unexpected features like high-low frequency detectors in the ongoing circuits project.
  2. Making models and neurons linkable allows immediate scrutiny and further exploration of research making claims about those neurons. It also removes potential confusion about which model and neuron is being discussed (which of the five versions of InceptionV1 are we talking about again?). This is really helpful for collaboration, especially when researchers are at different institutions.
  3. One of the wonderful things about interpretability as an area of ML is how accessible it is. Compared to many other areas, it requires comparatively little access to compute. But systematically visualizing neural networks can still take hundreds of GPU hours. We hope that, by sharing our visualizations, we can help keep interpretability highly accessible.

Just as biologists often focus on the study of a few “model organisms,” Microscope focuses on exploring a small number of models in detail. Our initial release includes nine frequently studied vision models, along with several visualization techniques we’ve found particularly useful in studying them. We plan to expand to other models and techniques in the coming months.

We’re excited to see how the community will use Microscope, and we encourage you to reuse these assets. In particular, we think it has a lot of potential in supporting the Circuits collaboration—a project to reverse engineer neural networks by analyzing individual neurons and their connections—or similar work.

Browse Microscope

OpenAI

OpenAI Standardizes on PyTorch

OpenAI Standardizes on PyTorch

OpenAI Standardizes on PyTorch

We are standardizing OpenAI’s deep learning framework on PyTorch. In the past, we implemented projects in many frameworks depending on their relative strengths. We’ve now chosen to standardize to make it easier for our team to create and share optimized implementations of our models.

OpenAI Standardizes on PyTorch

As part of this move, we’ve just released a PyTorch-enabled version of Spinning Up in Deep RL, an open-source educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning. We are also in the process of writing PyTorch bindings for our highly-optimized blocksparse kernels, and will open-source those bindings in upcoming months.

The main reason we’ve chosen PyTorch is to increase our research productivity at scale on GPUs. It is very easy to try and execute new research ideas in PyTorch; for example, switching to PyTorch decreased our iteration time on research ideas in generative modeling from weeks to days. We’re also excited to be joining a rapidly-growing developer community, including organizations like Facebook and Microsoft, in pushing scale and performance on GPUs.

Going forward we’ll primarily use PyTorch as our deep learning framework but sometimes use other ones when there’s a specific technical reason to do so. Many of our teams have already made the switch, and we look forward to contributing to the PyTorch community in upcoming months.

OpenAI

Deep Double Descent

Deep Double Descent

Deep Double Descent

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.

Deep Double Descent

Read Paper

Many classes of modern deep learning models, including CNNs, ResNets, and transformers, exhibit the previously-observed double descent phenomenon when not using early stopping or regularization. The peak occurs predictably at a “critical regime,” where the models are barely able to fit the training set. As we increase the number of parameters in a neural network, the test error initially decreases, increases, and, just as the model is able to fit the train set, undergoes a second descent.

Neither classical statisticians’ conventional wisdom that too large models are worse nor the modern ML paradigm that bigger models are better uphold. We find that double descent also occurs over train epochs. Surprisingly, we show these phenomena can lead to a regime where more data hurts, and training a deep network on a larger train set actually performs worse.

Model-wise double descent

1. There is a regime where bigger models are worse.

Deep Double Descent

The model-wise double descent phenomenon can lead to a regime where training on more data hurts. In the chart above, the peak in test error occurs around the interpolation threshold, when the models are just barely large enough to fit the train set.

In all cases we’ve observed, changes which affect the interpolation threshold (such as changing the optimization algorithm, the number of train samples, or the amount of label noise) also affect the location of the test error peak correspondingly. The double descent phenomena is most prominent in settings with added label noise; without it, the peak is smaller and easy to miss. Adding label noise amplifies this general behavior and allows us to easily investigate.

Sample-wise non-monotonicity

2. There is a regime where more samples hurts.

Deep Double Descent

The above chart shows transformers trained on a language-translation task with no added label noise. As expected, increasing the number of samples shifts the curve downwards towards lower test error. However, since more samples require larger models to fit, increasing the number of samples also shifts the interpolation threshold (and peak in test error) to the right.

For intermediate model sizes (red arrows), these two effects combine, and we see that training on 4.5x more samples actually hurts test performance.

Epoch-wise double descent

3. There is a regime where training longer reverses overfitting.

Deep Double Descent

Deep Double Descent

The charts above show test and train error as a function of both model size and number of optimization steps. For a given number of optimization steps (fixed y-coordinate), test and train error exhibit model-size double descent. For a given model size (fixed x-coordinate), as training proceeds, test and train error decreases, increases, and decreases again; we call this phenomenon epoch-wise double descent.

In general, the peak of test error appears systematically when models are just barely able to fit the train set.

Our intuition is that, for models at the interpolation threshold, there is effectively only one model that fits the train data, and forcing it to fit even slightly noisy or misspecified labels will destroy its global structure. That is, there are no “good models” which both interpolate the train set and perform well on the test set. However, in the over-parameterized regime, there are many models that fit the train set and there exist such good models. Moreover, the implicit bias of stochastic gradient descent (SGD) leads it to such good models, for reasons we don’t yet understand.

We leave fully understanding the mechanisms behind double descent in deep neural networks as an important open question.


Acknowledgments

Thanks to Mikhail Belkin and Chris Olah for helpful discussions and feedback throughout this work. An expanded version of this post can also be found on Boaz Barak’s blog, Windows on Theory.

OpenAI

Procgen Benchmark

Procgen Benchmark

Procgen Benchmark

We’re releasing Procgen Benchmark, 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns generalizable skills.

PaperEnvironment CodeTraining Code

CoinRun

StarPilot

CaveFlyer

Dodgeball

FruitBot

Chaser

Miner

Jumper

Leaper

Maze

BigFish

Heist

Climber

Plunder

Ninja

BossFight

Getting started

Using the environment is easy whether you’re a human or AI:

$ pip install procgen # install
$ python -m procgen.interactive --env-name starpilot # human
$ python <<EOF # random AI agent
import gym
env = gym.make('procgen:procgen-coinrun-v0')
obs = env.reset()
while True:
    obs, rew, done, info = env.step(env.action_space.sample())
    env.render()
    if done:
        break
EOF

We’ve found that all of the Procgen environments require training on 500–1000 different levels before they can generalize to new levels, which suggests that standard RL benchmarks need much more diversity within each environment. Procgen Benchmark has become the standard research platform used by the OpenAI RL team, and we hope that it accelerates the community in creating better RL algorithms.

Environment diversity is key

In several environments, it has been observed that agents can overfit to remarkably large training sets. This evidence raises the possibility that overfitting pervades classic benchmarks like the Arcade Learning Environment, which has long served as a gold standard in reinforcement learning (RL). While the diversity between different games in the ALE is one of the benchmark’s greatest strengths, the low emphasis on generalization presents a significant drawback. In each game the question must be asked: are agents robustly learning a relevant skill, or are they approximately memorizing specific trajectories?

CoinRun was designed to address precisely this issue, by using procedural generation to construct distinct sets of training levels and test levels. While CoinRun has helped us better quantify generalization in RL, it is still only a single environment. It’s likely that CoinRun is not fully representative of the many challenges RL agents must face. We want the best of both worlds: a benchmark comprised of many diverse environments, each of which fundamentally requires generalization. To fulfill this need, we have created Procgen Benchmark. CoinRun now serves as the inaugural environment in Procgen Benchmark, contributing its diversity to a greater whole.

Previous work, including the Obstacle Tower Challenge and the General Video Game AI framework, has also encouraged using procedural generation to better evaluate generalization in RL. We’ve designed environments in a similar spirit, with two Procgen environments drawing direct inspiration from GVGAI-based work. Other environments like Dota and StarCraft also provide lots of per-environment complexity, but these environments are hard to rapidly iterate with (and it’s even harder to use more than one such environment at a time). With Procgen Benchmark, we strive for all of the following: experimental convenience, high diversity within environments, and high diversity across environments.

Procgen Benchmark

Procgen Benchmark consists of 16 unique environments designed to measure both sample efficiency and generalization in reinforcement learning. This benchmark is ideal for evaluating generalization since distinct training and test sets can be generated in each environment. This benchmark is also well-suited to evaluate sample efficiency, since all environments pose diverse and compelling challenges for RL agents. The environments’ intrinsic diversity demands that agents learn robust policies; overfitting to narrow regions in state space will not suffice. Put differently, the ability to generalize becomes an integral component of success when agents are faced with ever-changing levels.

Design principles

We’ve designed all Procgen environments to satisfy the following criteria:

  • High Diversity: Environment generation logic is given maximal freedom, subject to basic design constraints. The diversity in the resulting level distributions presents agents with meaningful generalization challenges.

  • Fast Evaluation: Environment difficulty is calibrated such that baseline agents make significant progress after training for 200M timesteps. Moreover, the environments are optimized to perform thousands of steps per second on a single CPU core, enabling a fast experimental pipeline.

  • Tunable Difficulty: All environments support two well-calibrated difficulty settings: easy and hard. While we report results using the hard difficulty setting, we make the easy difficulty setting available for those with limited access to compute power. Easy environments require approximately an eighth of the resources to train.

  • Emphasis on Visual Recognition and Motor Control: In keeping with precedent, environments mimic the style of many Atari and Gym Retro games. Performing well primarily depends on identifying key assets in the observation space and enacting appropriate low level motor responses.

Evaluating generalization

We came to appreciate how hard RL generalization can be while conducting the Retro Contest, as agents continually failed to generalize from the limited data in the training set. Later, our CoinRun experiments painted an even clearer picture of our agents’ struggle to generalize. We’ve now expanded on those results, conducting our most thorough study of RL generalization to date using all 16 environments in Procgen Benchmark.

We first measured how the size of the training set impacts generalization. In each environment, we generated training sets ranging in size from 100 to 100,000 levels. We trained agents for 200M timesteps on these levels using Proximal Policy Optimization, and we measured performance on unseen test levels.

Generalization performance
Score over 100k levels, log scale
CoinRun
StarPilot
CaveFlyer
Dodgeball
FruitBot
Chaser
Miner
Jumper
Leaper
Maze
BigFish
Heist
Climber
Plunder
Ninja
BossFight

We found that agents strongly overfit to small training sets in almost all environments. In some cases, agents need access to as many as 10,000 levels to close the generalization gap. We also saw a peculiar trend emerge in many environments: past a certain threshold, training performance improves as the training sets grows! This runs counter to trends found in supervised learning, where training performance commonly decreases with the size of the training set. We believe this increase in training performance comes from an implicit curriculum provided by a diverse set of levels. A larger training set can improve training performance if the agent learns to generalize even across levels in the training set. We previously noticed this effect with CoinRun, and have found it often occurs in many Procgen environments as well.

An ablation with deterministic levels

We also conducted a simple ablation study to emphasize the importance of procedural generation. Instead of using a new level at the start of every episode, we trained agents on a fixed sequence of levels. The agent begins each episode on the first level, and when it successfully completes a level, it progresses to the next one. If the agent fails at any point, the episode terminates. The agent can reach arbitrarily many levels, though in practice it rarely progresses beyond the 20th level in any environment.

Train and test performance
Score over 200M timesteps
CoinRun
StarPilot
CaveFlyer
Dodgeball
FruitBot
Chaser
Miner
Jumper
Leaper
Maze
BigFish
Heist
Climber
Plunder
Ninja
BossFight

At test time, we remove the determinism in the sequence of levels, instead choosing level sequences at random. We find that agents become competent over the first several training levels in most games, giving an illusion of meaningful progress. However, test performance demonstrates that the agents have in fact learned almost nothing about the underlying level distribution. We believe this vast gap between training and test performance is worth highlighting. It reveals a crucial hidden flaw in training on environments that follow a fixed sequence of levels. These results show just how essential it is to use diverse environment distributions when training and evaluating RL agents.

Next steps

We expect many insights gleaned from this benchmark to apply in more complex settings, and we’re excited to use these new environments to design more capable and efficient agents.

If you’re interested in helping develop diverse environments, we’re hiring!


Acknowledgments

Thanks to Marc Bellemare, Julian Togelius, Carles Gelada, Jacob Jackson, Alex Ray, Lilian Weng, and Joshua Achiam for their feedback on the paper.

Thanks to Mira Murati, Brooke Chan, Justin Jay Wang, Greg Brockman, Ashley Pilipiszyn and Jack Clark for their work supporting, designing, writing, and providing feedback on this post.

Special thanks to Kenney for the many high quality game assets used throughout these environments.

Additional thanks to CraftPix.net for several game backgrounds, as well as to GameArtGuppy, and ansimuz. All asset licenses can be found here.

OpenAI

Safety Gym

Safety Gym

Safety Gym

We’re releasing Safety Gym, a suite of environments and tools for measuring progress towards reinforcement learning agents that respect safety constraints while training.

We also provide a standardized method of comparing algorithms and how well they avoid costly mistakes while learning. If deep reinforcement learning is applied to the real world, whether in robotics or internet-based tasks, it will be important to have algorithms that are safe even while learning—like a self-driving car that can learn to avoid accidents without actually having to experience them.

Paper


Safety Gym


Safety Starter Agents

Exploration is risky

Reinforcement learning agents need to explore their environments in order to learn optimal behaviors. Essentially, they operate on the principle of trial and error: they try things out, see what works or doesn’t work, and then increase the likelihood of good behaviors and decrease the likelihood of bad behaviors. However, exploration is fundamentally risky: agents might try dangerous behaviors that lead to unacceptable errors. This is the “safe exploration” problem in a nutshell.

Consider an example of an autonomous robot arm in a factory using reinforcement learning (RL) to learn how to assemble widgets. At the start of RL training, the robot might try flailing randomly, since it doesn’t know what to do yet. This poses a safety risk to humans who might be working nearby, since they could get hit.

For restricted examples like the robot arm, we can imagine simple ways to ensure that humans aren’t harmed by just keeping them out of harm’s way: shutting down the robot whenever a human gets too close, or putting a barrier around the robot. But for general RL systems that operate under a wider range of conditions, simple physical interventions won’t always be possible, and we will need to consider other approaches to safe exploration.

Constrained reinforcement learning

The first step towards making progress on a problem like safe exploration is to quantify it: figure out what can be measured, and how going up or down on those metrics gets us closer to the desired outcome. Another way to say it is that we need to pick a formalism for the safe exploration problem. A formalism allows us to design algorithms that achieve our goals.

While there are several options, there is not yet a universal consensus in the field of safe exploration research about the right formalism. We spent some time thinking about it, and the formalism we think makes the most sense to adopt is constrained reinforcement learning.

Constrained RL is like normal RL, but in addition to a reward function that the agent wants to maximize, environments have cost functions that the agent needs to constrain. For example, consider an agent controlling a self-driving car. We would want to reward this agent for getting from point A to point B as fast as possible. But naturally, we would also want to constrain the driving behavior to match traffic safety standards.

We think constrained RL may turn out to be more useful than normal RL for ensuring that agents satisfy safety requirements. A big problem with normal RL is that everything about the agent’s eventual behavior is described by the reward function, but reward design is fundamentally hard. A key part of the challenge comes from picking trade-offs between competing objectives, such as task performance and satisfying safety requirements. In constrained RL, we don’t have to pick trade-offs—instead, we pick outcomes, and let algorithms figure out the trade-offs that get us the outcomes we want.

We can use the self-driving car case to sketch what this means in practice. Suppose the car earns some amount of money for every trip it completes, and has to pay a fine for every collision.

In normal RL, you would pick the collision fine at the beginning of training and keep it fixed forever. The problem here is that if the pay-per-trip is high enough, the agent may not care whether it gets in lots of collisions (as long as it can still complete its trips). In fact, it may even be advantageous to drive recklessly and risk those collisions in order to get the pay. We have seen this before when training unconstrained RL agents.

By contrast, in constrained RL you would pick the acceptable collision rate at the beginning of training, and adjust the collision fine until the agent is meeting that requirement. If the car is getting in too many fender-benders, you raise the fine until that behavior is no longer incentivized.

Safety Gym

To study constrained RL for safe exploration, we developed a new set of environments and tools called Safety Gym. By comparison to existing environments for constrained RL, Safety Gym environments are richer and feature a wider range of difficulty and complexity.

In all Safety Gym environments, a robot has to navigate through a cluttered environment to achieve a task. There are three pre-made robots (Point, Car, and Doggo), three main tasks (Goal, Button, and Push), and two levels of difficulty for each task. We give an overview of the robot-task combinations below, but make sure to check out the paper for details.

In these videos, we show how an agent without constraints tries to solve these environments. Every time the robot does something unsafe—which here, means running into clutter—a red warning light flashes around the agent, and the agent incurs a cost (separate from the task reward). Because these agents are unconstrained, they often wind up behaving unsafely while trying to maximize reward.

Point is a simple robot constrained to the 2D plane, with one actuator for turning and another for moving forward or backward. Point has a front-facing small square which helps with the Push task.
Goal: Move to a series of goal positions.
Button: Press a series of goal buttons.
Push: Move a box to a series of goal positions.

Car has two independently-driven parallel wheels and a free-rolling rear wheel. For this robot, turning and moving forward or backward require coordinating both of the actuators.
Goal: Move to a series of goal positions.
Button: Press a series of goal buttons.
Push: Move a box to a series of goal positions.

Doggo is a quadruped with bilateral symmetry. Each of its four legs has two controls at the hip, for azimuth and elevation relative to the torso, and one in the knee, controlling angle. A uniform random policy keeps the robot from falling over and generates travel.
Goal: Move to a series of goal positions.
Button: Press a series of goal buttons.
Push: Move a box to a series of goal positions.

Benchmark

To help make Safety Gym useful out-of-the-box, we evaluated some standard RL and constrained RL algorithms on the Safety Gym benchmark suite: PPO, TRPO, Lagrangian penalized versions of PPO and TRPO, and Constrained Policy Optimization (CPO).

Our preliminary results demonstrate the wide range of difficulty of Safety Gym environments: the simplest environments are easy to solve and allow fast iteration, while the hardest environments may be too challenging for current techniques. We also found that Lagrangian methods were surprisingly better than CPO, overturning a previous result in the field.

Below, we show learning curves for average episodic return and average episodic sum of costs. In our paper, we describe how to use these and a third metric (the average cost over training) to compare algorithms and measure progress.

Return and cost trade off against each other meaningfully

To facilitate reproducibility and future work, we’re also releasing the algorithms code we used to run these experiments as the Safety Starter Agents repo.

Open problems

There is still a lot of work to do on refining algorithms for constrained RL, and combining them with other problem settings and safety techniques. There are three things we are most interested in at the moment:

  1. Improving performance on the current Safety Gym environments.
  2. Using Safety Gym tools to investigate safe transfer learning and distributional shift problems.
  3. Combining constrained RL with implicit specifications (like human preferences) for rewards and costs.

Our expectation is that, in the same way we today measure the accuracy or performance of systems at a given task, we’ll eventually measure the “safety” of systems as well. Such measures could feasibly be integrated into assessment schemes that developers use to test their systems, and could potentially be used by the government to create standards for safety.[1] We also hope that systems like Safety Gym can make it easier for AI developers to collaborate on safety across the AI sector via work on open, shared systems.

If you’re excited to work on safe exploration problems with us, we’re hiring!


Acknowledgments

We gratefully acknowledge the many people who contributed to this release. Thanks Christy Dennison, Ethan Knight, and Adam Stooke for discussions, research, and testing of Safety Gym along the way, and to Mira Murati for supporting the project team. Thanks Karl Cobbe, Matthias Plappert, and Jacob Hilton for feedback on the paper. Thanks Ashley Pilipiszyn, Ben Barry, Justin Jay Wang, Richard Perez, Jen DeRosa, and Greg Brockman for work on editing, designing, illustrating, and shipping this post. Thanks Amanda Askell, Jack Clark, and Miles Brundage for discussions and blog post contributions on measurements for AI safety and policy implications. Thanks Chris Hesse for liaising on open source release guidelines.


Footnotes

  1. OpenAI’s comments in response to a request for information from the US agency NIST regarding Artificial Intelligence Standards. ↩︎

OpenAI