One of the interesting facts about researching AI is that it can often execute actions and pursue strategies that surprise the very researchers designing them. This happened during a recent virtual game of hide and seek where multiple AI agents were pitted against one another. Researchers at OpenAI, an AI firm based out of San Francisco, were surprised to find that their AI agents started exploiting strategies in the game world that the researchers didn’t even know existed.
OpenAI has trained a group of AI agents to play a hide and seek game with each other. The AI programs are trained with reinforcement learning, a technique where the desired behavior is elicited from the AI algorithms by providing the algorithms with feedback. The AI starts out by taking random actions, and every time it takes an action that gets it closer to its goal, the agent is rewarded. The AI desires to gain the maximum amount of reward possible, so it will experiment to see which actions gain it more reward. Through trial and error the AI is capable of distinguishing strategies that will bring them to victory, those which will give them the most reward.
Reinforcement learning has already demonstrated impressive success at learning the rules of games. OpenAI recently trained a team of AI to play the MMORPG DOTA 2, and the AI defeated a world-champion team of human players last year. A similar thing happened with the game StarCraft when an AI was trained on the game by DeepMind. Reinforcement learning has also been used to teach AI programs to play Pictionary with humans, learning to interpret pictures and use basic common sense reasoning.
In the hide and seek video game created by the researchers, multiple AI agents were pitted against one another. The result was an arms race of sorts, where each agent wants to outperform the other and obtain the most reward points. A new strategy adopted by one agent will cause its opponent to seek a new strategy to counter it, and vice-versa. Igor Mordatch, a researcher at OpenAI, explained to IEEE Spectrum that the experiment demonstrates that this process of trial and error playing between agents “is enough for the agents to learn surprising behaviors on their own—it’s like children playing with each other.”
What were the surprising behaviors exactly? The researchers had four basic strategies that they expected the AI agents to learn, and they learned these fairly quickly, becoming competent in them after just 25 million simulated games. The game took place in a 3d environment full of ramps, blocks, and walls. The AI agents learned to chase each other around, move blocks to build forts they could hide in, and move ramps around. The AI seekers learned to drag ramps around to get inside the hiders’ forts, while the hiders learned to try and take the ramps into their forts so the seekers couldn’t use them.
However, around the benchmark of 380 million games, something unexpected happened. The AI agents learned to use two strategies the researchers didn’t expect. The seeker agents learned that by jumping onto a box and tilting/riding the box towards a nearby fort, they could jump into the fort and find the hider. The researchers hadn’t even realized that this was possible within the physics of the game environment. The hiders learned to deal with this issue by dragging the boxes into place within their fort.
While the unexpected behavior of agents trained on reinforcement learning algorithms is harmless in this instance, it does raise some potential concerns about how reinforcement learning is applied to other situations. A member of the OpenAI research team, Bowen Baker, explained to IEEE Spectrum that these unexpected behaviors could be potentially dangerous. After all, what if robots started behaving in unexpected ways?
“Building these environments is hard,” Baker explained. “The agents will come up with these unexpected behaviors, which will be a safety problem down the road when you put them in more complex environments.”
However, Baker also explained that reinforcement strategies could lead to innovative solutions to current problems. Systems trained with reinforcement learning could solve a wide array of problems with solutions we may not able to even imagine.
DeepMind and Google Brain Aim Create Methods to Improve Efficiency of Reinforcement Learning
Reinforcement learning systems can be powerful and robust, able to carry out extremely complex tasks through thousands of iterations of training. While reinforcement learning algorithms are capable of enabling sophisticated and occasionally surprising behavior, they take a long time to train and require vast amounts of data. These factors make reinforcement learning techniques rather inefficient, and recently research teams from Alphabet DeepMind and Google Brain have endeavored to find more efficient methods of creating reinforcement learning systems.
As reported by VentureBeat, the combined research group recently proposed methods of making reinforcement learning training more efficient. One of the proposed improvements was an algorithm dubbed Adaptive Behavior Policy Sharing (ABPS), while the other was a framework called Universal Value Function Approximators (UVFA). ABPS lets pools of AI agents share their adaptively selected experiences, while UVFA lets those AI simultaneously investigate directed exploration policies.
ABPS is intended to expedite the customization of hyperparameters when training a model. ABPS makes finding the optimal hyperparameters quicker by allowing several different agents with different hyperparameters to share their behavior policy experiences. To be more precise, ABPS lets reinforcement learning agents select actions from those actions that a policy has deemed okay and afterward it’s granted a reward and observation based on the following state.
AI reinforcement agents are trained with various combinations of possible hyperparameters, like decay rate and learning rate. When training a model, the goal is that the model converges on the combination of hyperparameters that gives it the best performance, and in this case those that also improve data efficiency. The efficiency is increased by training many agents at one time and choosing the behavior of only one agent to be deployed during the next time step. The policy that the target agent has is used to sample actions. The transitions are then logged within a shared space, and this space is constantly evaluated so that policy selection doesn’t have to occur as often. At the end of the training, an ensemble of agents is chosen and the top performing agents are selected to undergo final deployment.
In terms of UVFA, it attempts to deal with one of the common problems of reinforcement learning, that weakly reinforced agents often don’t learn tasks. UVFA attempts to solve the issue by having the agent learn a separate set of exploitation and exploration policies at the same time. Separating the tasks creates a framework that allows the exploratory policies to keep exploring the environment while the exploitation policies continue to try and maximize the reward for the current task. The exploratory policies of UVFA serve as a baseline architecture that will continue to improve even if there are no natural rewards being found. In such a condition, a function which corresponds to intrinsic rewards is approximated, which pushes the agents to explore all states in an environment, even if they often return to familiar states.
As VentureBeat explained, when the UVFA framework is in play, the intrinsic rewards of the system are given directly to the agent as inputs. The agent then keeps track of a representation of all inputs (such as rewards, action, and state) during a given episode. The result is that the reward is preserved over time and the agent’s policy is at least somewhat informed by it at all times.
This is accomplished with the utilization of an “episodic novelty” and a “life-long novelty” module. The function of the first module is to hold the current, episodic memory and map the current findings to the previously mentioned representation, letting the agent determine an intrinsic episodic reward for every step of training. Afterward, the state-linked with the current observation is added into memory. Meanwhile, the life-long novelty module is responsible for influencing how often the agent explores over the course of many episodes.
According to the Alphabet/Google teams, the new training techniques have already demonstrated the potential for substantial improvement while training a reinforcement learning system. UVFA was able to double the performance of some of the base agents that played various Atari games. Meanwhile, ABPS was able to increase performance on some of the same Atari games, decreasing variance amongst the top performing agents by approximately 25%. The UVFA trained algorithm was able to achieve a high score in Pitfall by itself, lacking any engineered features of human demos.
DeepMind Discovers AI Training Technique That May Also Work In Our Brains
DeepMind just recently published a paper detailing how a newly developed type of reinforcement learning could potentially explain how reward pathways within the human brain operate. As reported by NewScientist, the machine learning training method is called distributional reinforcement learning and the mechanisms behind it seem to plausibly explain how dopamine is released by neurons within the brain.
Neuroscience and computer science have a long history together. As far back as 1951, Marvin Minksy used a system of rewards and punishments to create a computer program capable of solving a maze. Minksy was inspired by the work of Ivan Pavlov, a physiologist who demonstrated that dogs could learn through a series of rewards and punishments. Deepmind’s new paper adds to the intertwining history of neuroscience and computer science by applying a type of reinforcement learning to gain insight into how dopamine neurons might function.
Whenever a person, or animal, is about to carry out an action, the collections of neurons in their brain responsible for the release of dopamine make a prediction about how rewarding the action will be. Once the action has been carried out and the consequences (rewards) of that action made apparent, the brain releases dopamine. However, this dopamine release is scaled in accordance with the size of the error in prediction. If the reward is larger/better than expected, a stronger surge of dopamine is triggered. In contrast, a worse reward leads to less dopamine being released. The dopamine serves as a corrective function that makes the neurons tune their predictions until they converge on the actual rewards being earned. This is very similar to how reinforcement learning algorithms operate.
The year 2017 saw DeepMind researchers release an enhanced version of a commonly used reinforcement learning algorithm, and this superior learning method was able to boost performance on many reinforcement learning tasks. The DeepMind team thought that the mechanisms behind the new algorithm could be used to better explain how dopamine neurons operate within the human brain.
In contrast to older reinforcement learning algorithms, DeepMind’s newer algorithm represents rewards as a distribution. Older reinforcement learning approaches represented estimated rewards as just a single number that stood for the average expected result. This change allowed the model to more accurately represent possible rewards and perform better as a result. The superior performance of the new training method prompted the DeepMind researchers to investigate if dopamine neurons in the human brain operate in a similar fashion.
In order to investigate the workings of dopamine neurons, DeepMind worked alongside Harvard to research the activity of dopamine neurons in mice. The researchers had the mice perform various tasks and gave them rewards based on the roll of dice, recording how their dopamine neurons fired. Different neurons seemed to predict different potential results, releasing different amounts of dopamine. Some neurons predicted lower than the actual reward while some predicted rewards higher than the actual reward. After graphing out the distribution of the reward predictions, the researchers found that the distribution of predictions was fairly close to the genuine reward distribution. This suggests that the brain does make use of a distributional system when making predictions and adjusting predictions to better match reality.
The study could inform both neuroscience nad computer science. The study supports the use of distributional reinforcement learning as a method of creating more advanced AI models. Beyond that, it could have implications for our theories of how the brain operates regarding reward systems. If dopamine neurons are distributed and some are more pessimistic or optimistic than others, understanding these distributions could alter how we approach aspects of psychology like mental health and motivation.
As MIT Technology View reported, Matt Botvinik, the director of neuroscience research at DeepMind, explained the importance of the findings at a press briefing. Botvinik said:
“If the brain is using it, it’s probably a good idea. It tells us that this is a computational technique that can scale in real-world situations. It’s going to fit well with other computational processes. It gives us a new perspective on what’s going on in our brains during everyday life”
Ubisoft Trains AI Agent To Drive A Car In A Racing Game
The term “AI” is used a lot in discussions of video games, but it is typically used to refer to the logic that controls non-player characters in video games, rather than referring to any system driven by what computer scientists would recognize as AI. Actual applications of AI utilizing artificial neural networks are fairly rare within the video game industry, but as VentureBeat reports gaming company Ubisoft has recently published a paper investigating possible uses for an AI agent trained with reinforcement learning.
While entities like DeepMind and OpenAI have investigated how AIs perform in a variety of video games, like StarCraft 2, Dota 2, and Minecraft, very little research has been done on the use of AI under the specific constraints often faced by game developers. Ubisoft La Forge, the prototyping arm of Ubisoft, just recently published a paper detailing an algorithm capable of carrying out predictable actions within a commercial video game. According to the report, the AI algorithms were capable of hitting current benchmarks and performing complex tasks reliably.
The authors of the paper note that while reinforcement learning has been used to great effect in the context of certain video games, often achieving parity with the best human players of said games, the systems created by OpenAI and DeepMind are rarely useful for game developers. The authors note that lack of accessibility is a large issue and that the most impressive results are obtained by research groups with access to large scale computational resources, resources that typically go well beyond what the average game developer has access to. Wrote the researchers:
“These systems have comparatively seen little use within the video game industry, and we believe lack of accessibility to be a major reason behind this. Indeed, really impressive results … are produced by large research groups with computational resources well beyond what is typically available within video game studios.”
The research team from Ubisoft aimed to remedy some of these problems by creating a reinforcement learning approach that optimized for issues like data sample collection and runtime budget constraints. Ubisoft’s solution was adapted from research done at the University of California, Berkeley. The Soft Actor-Critic model developed by UC Berkely researches is able to create a model that can effectively generalize to new conditions and is much more sample-efficient than most models. The Ubisoft team took this approach and adapted it for both discrete and continuous actions.
The Ubisoft research team evaluated the performance of their algorithm on three different games. There were two soccer games used to test the algorithm, as well as a simple platformer-style game. While the results for these games was slightly worse than the state-of-the-art industry results, another test was conducted in which the algorithms performed much better. The researchers used a driving video game as their test case, having the AI agent follow a given path and negotiate obstacles in an environment the agent hadn’t witnessed during training. There were two continuous actions, steering and acceleration, as well as one binary action (breaking).
The researchers summarized their results in the paper, declaring that the hybrid Soft Actor-Critic approach was successful when training an AI agent to drive at high speeds in a commercially available video game. According to the researchers, their training approach can potentially work for a wide variety of possible interaction approaches. These include instances where the AI agent has the exact same input options that the player has, demonstrating the “practical usefulness of such an algorithm for the video game industry.”