Reinforcement learning is a promising avenue of AI development, producing AI that can handle extremely complex tasks. Reinforcement AI algorithms are used in the creation of mobile robotics systems and self-driving cars among other applications. However, due to the way that reinforcement AI is trained, they can occasionally manifest bizarre and unexpected behaviors. These behaviors can be dangerous, and AI researchers refer to this problem as the “safe exploration” problem, which is where the AI becomes stuck in the exploration of unsafe states.
Recently, Google’s AI research lab DeepMind released a paper that proposed new methods for dealing with the safe exploration problem and training reinforcement learning AI in a safer fashion. The method suggested by DeepMind also corrects for reward hacking or loopholes in the reward criteria.
DeepMind’s new method has two different systems intended to guide the behavior of the AI in situations where unsafe behavior could arise. The two systems used by DeepMind’s training technique are a generative model and a forward dynamics model. Both of these models are trained on a variety of data, such as demonstrations by safety experts and completely random vehicle trajectories. The data is labeled by a supervisor with specific reward values, and the AI agent will pick up on patterns of behavior that will enable it to collect the greatest reward. The unsafe states have also been labeled, and once the model has managed to successfully predict rewards and unsafe states it is deployed to carry out the targeted actions.
The research team explains in the paper that the idea is to create possible behaviors from scratch, to suggest the desired behaviors, and to have these hypothetical scenarios be as informative as possible while simultaneously avoiding direct interference with the learning environment. The DeepMind team refers to this approach as ReQueST, or reward query synthesis via trajectory optimization.
ReQueST is capable of leading to four different types of behavior. The first type of behavior tries to maximize uncertainty regarding ensemble reward models. Meanwhile, behavior two and three attempts to both minimize and maximize predicted rewards. Predicted rewards are minimized in order to lead to the discovery of behaviors that the model may be incorrectly predicting. On the other hand, predicted reward is maximized in order to lead to behavior labels possessing the highest information value. Finally, the fourth type of behavior tries to maximize the novelty of trajectories, in order that the model continue to explore regardless of the rewards projected.
Once the model has reached the desired level of reward collection, a planning agent is used to make decisions based on the learned rewards. This model-predictive control scheme lets agents learn to avoid unsafe states by using the dynamic model and predicting possible consequences, in contrast to the behaviors of algorithms that learn through pure trial and error.
As reported by VentureBeat, the DeepMind researchers believe that their project is the first reinforcement learning system that is capable of learning in a controlled, safe fashion:
“To our knowledge, ReQueST is the first reward modeling algorithm that safely learns about unsafe states and scales to training neural network reward models in environments with high-dimensional, continuous states. So far, we have only demonstrated the effectiveness of ReQueST in simulated domains with relatively simple dynamics. One direction for future work is to test ReQueST in 3D domains with more realistic physics and other agents acting in the environment.”