stub DeepMind and Google Brain Aim Create Methods to Improve Efficiency of Reinforcement Learning - Unite.AI
Connect with us

Artificial Intelligence

DeepMind and Google Brain Aim Create Methods to Improve Efficiency of Reinforcement Learning

Updated on

Reinforcement learning systems can be powerful and robust, able to carry out extremely complex tasks through thousands of iterations of training. While reinforcement learning algorithms are capable of enabling sophisticated and occasionally surprising behavior, they take a long time to train and require vast amounts of data. These factors make reinforcement learning techniques rather inefficient, and recently research teams from Alphabet DeepMind and Google Brain have endeavored to find more efficient methods of creating reinforcement learning systems.

As reported by VentureBeat, the combined research group recently proposed methods of making reinforcement learning training more efficient. One of the proposed improvements was an algorithm dubbed Adaptive Behavior Policy Sharing (ABPS), while the other was a framework called Universal Value Function Approximators (UVFA). ABPS lets pools of AI agents share their adaptively selected experiences, while UVFA lets those AI simultaneously investigate directed exploration policies.

ABPS is intended to expedite the customization of hyperparameters when training a model. ABPS makes finding the optimal hyperparameters quicker by allowing several different agents with different hyperparameters to share their behavior policy experiences. To be more precise, ABPS lets reinforcement learning agents select actions from those actions that a policy has deemed okay and afterward it's granted a reward and observation based on the following state.

AI reinforcement agents are trained with various combinations of possible hyperparameters, like decay rate and learning rate. When training a model, the goal is that the model converges on the combination of hyperparameters that gives it the best performance, and in this case those that also improve data efficiency. The efficiency is increased by training many agents at one time and choosing the behavior of only one agent to be deployed during the next time step. The policy that the target agent has is used to sample actions. The transitions are then logged within a shared space, and this space is constantly evaluated so that policy selection doesn’t have to occur as often. At the end of the training, an ensemble of agents is chosen and the top performing agents are selected to undergo final deployment.

In terms of UVFA, it attempts to deal with one of the common problems of reinforcement learning, that weakly reinforced agents often don’t learn tasks. UVFA attempts to solve the issue by having the agent learn a separate set of exploitation and exploration policies at the same time. Separating the tasks creates a framework that allows the exploratory policies to keep exploring the environment while the exploitation policies continue to try and maximize the reward for the current task. The exploratory policies of UVFA serve as a baseline architecture that will continue to improve even if there are no natural rewards being found. In such a condition, a function which corresponds to intrinsic rewards is approximated, which pushes the agents to explore all states in an environment, even if they often return to familiar states.

As VentureBeat explained, when the UVFA framework is in play, the intrinsic rewards of the system are given directly to the agent as inputs. The agent then keeps track of a representation of all inputs (such as rewards, action, and state) during a given episode. The result is that the reward is preserved over time and the agent’s policy is at least somewhat informed by it at all times.

This is accomplished with the utilization of an “episodic novelty” and a “life-long novelty” module.  The function of the first module is to hold the current, episodic memory and map the current findings to the previously mentioned representation, letting the agent determine an intrinsic episodic reward for every step of training. Afterward, the state-linked with the current observation is added into memory. Meanwhile, the life-long novelty module is responsible for influencing how often the agent explores over the course of many episodes.

According to the Alphabet/Google teams, the new training techniques have already demonstrated the potential for substantial improvement while training a reinforcement learning system. UVFA was able to double the performance of some of the base agents that played various Atari games. Meanwhile, ABPS was able to increase performance on some of the same Atari games, decreasing variance amongst the top performing agents by approximately 25%. The UVFA trained algorithm was able to achieve a high score in Pitfall by itself, lacking any engineered features of human demos.