Using Selective Attention in Reinforcement Learning Agents

Posted by Yujin Tang, Research Software Engineer and David Ha, Staff Research Scientist, Google Research, Tokyo

Inattentional blindness is the psychological phenomenon that causes one to miss things in plain sight, and is a consequence of the selective attention that enables you to remain focused on important parts of the world without distraction from irrelevant details. It is believed that this selective attention mechanism enables people to condense broad sensory information into a form that is compact enough to be used for future decision making. While this may seem to be a limitation, such “bottlenecks” observed in nature can also inspire the design of machine learning systems that hope to mimic the success and efficiency of biological organisms. For example, while most methods presented in the deep reinforcement learning (RL) literature allow an agent to access the entire visual input, and even incorporating modules for predicting future sequences of visual inputs, perhaps reducing an agent’s access to its visual inputs via an attention constraint could be beneficial to an agent’s performance?

In our recent GECCO 2020 paper, “Neuroevolution of Self-Interpretable Agents” (AttentionAgent), we investigate the properties of such agents that employ a self-attention bottleneck. We show that not only are they able to solve challenging vision-based tasks from pixel inputs with 1000x fewer learnable parameters compared to conventional methods, they are also better at generalization to unseen modifications of their tasks, simply due to its ability to “not see details” that can confuse it. Furthermore, looking at where the agent is focusing its attention provides visual interpretability to its decision making process. The following diagram illustrates how the agent learned to deal with its attention bottleneck:

AttentionAgent learned to attend to task critical regions in its visual inputs. In a car driving task (CarRacing, top row), the agent mostly attends to the road borders, but shifts its focus to the turns before it changes heading directions. In a fireball dodging game (DoomTakeCover, bottom row), the agent focuses on fireballs and enemy monsters. Left: Visual inputs to the agent. Center: Agent’s attention overlaid on the visual inputs, the white patches indicate where the agent focuses its attention. Right: Visual cues based on which the agent makes decisions.

Agent with Artificial Attention
While there have been several works that explore how constraints such as sparsity may play a role in actually shaping the abilities of reinforcement learning agents, AttentionAgent takes inspiration from concepts related to inattentional blindness — when the brain is involved in effort-demanding tasks, it assigns most of its attention capacity only to task-relevant elements and is temporarily blind to other signals. To achieve this, we segment the input image into several patches and then rely on a modified self-attention architecture to simulate voting between patches to elect a subset to be considered important. The patches of interest are elected at each time step and, once determined, AttentionAgent makes decisions solely on these patches, ignoring the rest.

In addition to extracting key factors from visual inputs, the ability to contextualize these factors as they change in time is just as crucial. For example, a batter in the game of baseball must use visual signals to continuously keep track of the baseball’s location in order to predict its position and be able to hit it. In AttentionAgent, a long short-term memory (LSTM) model accepts information from the important patches and generates an action at each time step. LSTM keeps track of the changes in the input sequence, and can thus utilize the information to track how critical factors evolve over time.

It is conventional to optimize a neural network with backpropagation. However, because AttentionAgent contains non-differentiable operations for the generation of important patches, like sorting and slicing, it is not straightforward to apply such techniques for training. We therefore turn to derivative-free optimization algorithms to overcome this difficulty.

Overview of our method and illustration of data processing flow in AttentionAgent. Top: Input transformation — A sliding window segments an input image into smaller patches, and then “flattens” them for future processing. Middle: Patch election — The modified self-attention module holds votes between patches to generate a patch importance vector. Bottom: Action generation — AttentionAgent picks the patches of the highest importance, extracts corresponding features and makes decisions based on them.

Generalization to Unseen Modifications of the Environment
We demonstrate that Attention Agent learned to attend to a variety of regions in the input images. Visualization of the important patches provides a peek into how the agent is making decisions, illustrating that most selections make sense and are consistent with human intuition, and is a powerful tool for analyzing and debugging an agent in development. Furthermore, since the agent learned to ignore information non-critical to the core task, it can generalize to tasks where small environmental modifications are applied.

Here, we show that restricting the agent’s decision-making controller’s access to important patches only while ignoring the rest of the scene can result in better generalization, simply due to how the agent is restricted from “seeing things” that can confuse it. Our agent is trained to survive in the VizDoom TakeCover environment only, but it can also survive in unseen settings with higher walls, different floor textures, or when confronted with a distracting sign.

DoomTakeCover Generalization: The AttentionAgent is trained in the environment with no modifications (left). It is able to adapt to changes in the environment, such as a higher wall (middle, left), a different floor texture (middle, right), or floating text (right).

When one learns to drive during a sunny day, one also can transfer those skills (to some extent) to driving at night, on a rainy day, in a different car, or in the presence of bird droppings on the windshield. AttentionAgent is not only able to solve CarRacing-v0, it can also achieve similar performance in unseen conditions, such as brighter or darker scenery, or having its vision modified by artifacts such as side bars or background blobs, while requiring 1000x fewer parameters than conventional methods that fail to generalize.

CarRacing Generalization: No modification (left); color perturbation (middle, left); vertical bars on left and right (middle, right); added red blob (right).

Limitations and Future Work
While AttentionAgent is able to cope with various modifications of the environment, there are limitations to this approach, and much more work to be done to further enhance the generalization capabilities of the agent. For example, AttentionAgent does not generalize to cases where dramatic background changes are involved. The agent trained on the original car racing environment with the green grass background fails to generalize when the background is replaced with distracting YouTube videos. When we take this one step further and replace the background with pure uniform noise, we observe that the agent’s attention module breaks down and attends only to random patches of noise, rather than to the road-related patches. If we train an agent from scratch in the noisy background environment, it manages to get around the track, although the performance is mediocre. Interestingly, the agent still attends only to the noise, rather than to the road, it appears to have learned to drive by estimating where the lane is based on the number of selected patches on the left and right of the screen.

AttentionAgent fails to generalize to drastically modified environments. Left: The background suddenly becomes a cat (Creative Commons video). Middle: The background suddenly becomes an arcade game (Creative Commons video). Right: AttentionAgent learned to drive on pure noise background by avoiding noise patches.

The simplistic method we use to extract information from important patches may be inadequate for more complicated tasks. How we can learn more meaningful features, and perhaps even extract symbolic information from the visual input will be an exciting future direction. In addition to open sourcing the code to the research community, we have also released CarRacingExtension, a suite of car racing tasks that involve various environmental modifications, as testbeds and benchmark for ML researchers who are interested in agent generalizations.

Acknowledgements
This research was conducted by Yujin Tang, Duong Nguyen, and David Ha. We would like to thank Yingtao Tian, Lana Sinapayen, Shixin Luo, Krzysztof Choromanski, Sherjil Ozair, Ben Poole, Kai Arulkumaran, Eric Jang, Brian Cheung, Kory Mathewson, Ankur Handa, and Jeff Dean for valuable discussions.